Why try to reinvent the wheel when there’s already tools in place that work really, really well?
I’ve been figuring out how to clean and analyze this dataset – not big data, but big enough where my laptop couldn’t handle it and taught myself how to work on an AWS instance (which will be the subject of another post one day)
After hacking at it for a day or two, it turns out that using Panda’s aggregate features was the simplest and quickest way. Go figure.
%%time # reading in orders dataset & slicing on relevant columns df_orders = pd.read_csv(fname_orders) df_orders = df_orders[['order_id', 'user_id']][df_orders.eval_set == 'prior'] # reading in product order dataset & slicing on relevant columns df_orders_products = pd.read_csv(fname_orders_products) df_orders_products = df_orders_products[['order_id', 'product_id']] # merging above datasets on product id df_merged = df_orders_products.merge(df_orders, on='order_id') # counting how many times each product was ordered total_product_count = df_orders_products.product_id.value_counts() # filtering for products had more than 500 orders included_product_ids = sorted(list(total_product_count.index[total_product_count > 500])) # slicing the merged dataframe on most ordered products df_merged = df_merged[df_merged.product_id.isin(included_product_ids)] # counting how many times each user ordered each product agg = df_merged[['user_id', 'product_id']].groupby(by=['user_id', 'product_id']).size() # unstacking to create matrix agg_unstack = agg.unstack()
CPU times: user 34.5 s, sys: 8.9 s, total: 43.4 s Wall time: 43.4 s
The sizes of these dataframes:
df_orders shape: (3214874, 2) df_orders_products shape: (32434489, 2) df_merged shape: (28785570, 3) agg shape: (11233928,) agg_unstack shape: (206139, 8280)