【Kaggle多模态新赛】H&M个性化时尚推荐Baseline

peter AINLP 2022-04-20 14:24

来源：投稿作者：peter
编辑：学姐

目前本科在读，目前任香港某高校人工智能中心算法研究员，多次于业界顶尖公司、研究组实习，算法开发经验丰富。

赛题链接

https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations

赛题描述

在这个比赛中，您将获得客户在一段时间内的购买历史，以及支持的元数据。可用的元数据涵盖从服装类型和客户年龄，到来自产品描述的文本数据，再到服装的图像数据。您的任务是预测在训练数据结束后的 7 天内每个客户将购买哪些商品。在此期间未进行任何购买的客户不计入评分。

※ 比赛时间线

2022年2 月 2 日年 - 比赛开始。

2022年 5 月 2 日年 - 报名截止日期。您必须在此日期之前接受比赛规则才能参加比赛。

2022年5 月 2 日年 - 团队合并截止日期。这是参与者可以加入或合并团队的最后一天。

2022年5 月 9 日年 - 最终提交截止日期。

※ 丰厚的奖金

第一名：15,000美元

第二名：10,000美元

第三名：8,000美元

第四名：7,000美元

第五名：5,000美元

第六名：5,000美元

※ 推荐理由

多模态学习是近日数据科学领域较新也是较火热的领域之一，内卷程度低，容易出成果也能在业界落地。是近期加入数据科学领域选择方向的不二之选。作为多模态比赛，你可以使用任何数据进行推断，如果想研究分类数据类型算法，或者深入研究NLP和CV，这取决于你。

通过参加这场比赛，你能学到从推荐系统到cv、nlp多领域的知识，拿到奖牌对于各方向的求职也大有裨益。

数据描述

题目提供了三张表格和一组照片共四种数据，数据详情如下：

images - 每一个article_id的商品所对应的图片
articles - 每一个article_id对应的商品具体的Metadata
customers - 每一个customer_id对应的顾客具体的Metadata
transactions_train - 历史购买记录

数据可视化1. 商品元数据的种类

2. 商品的种类分布

640

3. 客户相关的数据

640

4. 客户年龄分布

640

5. 客户对于新时尚的感知

6. 价格数据分布

640

评价指标

MAP@12：对于少于12次购物的客户，做完整的12个预测没有惩罚，所以对于每个客户都进行12个预测较为有利。

Baseline构建 640

我们利用用户不同年龄组之间的相关性进行相互预测，用户年龄组购买情况相关性系数矩阵如下：

我们对数据的观察：

最相似的两个年龄组是 (49, 59] & (59, 69], 相关性系数0.68.
最不相关的两个年龄组是 (-1, 19] & (69, 119], 相关性系数 0.09.
基于[EDA](https://www.kaggle.com/hechtjp/eda-based-on-timeseries), (19, 29] 是最多人的年龄类别，和此类别最相关的年龄是, 系数为0.59.
各年龄段前100篇文章至少相差30%，把年龄类别分开预测会比统一预测更好

Baseline流程 640

使用基于规则的算法
对每个年龄组分别预测

预测代码：

# 遍历每个年龄类别，分开预测for uniBin in listUniBins:    df  = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv',                            usecols= ['t_dat', 'customer_id', 'article_id'],                            dtype={'article_id': 'int32', 't_dat': 'string', 'customer_id': 'string'})    # 处理没有年龄信息的客户    if str(uniBin) == 'nan':        dfCustomersTemp = dfCustomers[dfCustomers['age_bins'].isnull()]    else:        dfCustomersTemp = dfCustomers[dfCustomers['age_bins'] == uniBin]    dfCustomersTemp = dfCustomersTemp.drop(['age_bins'], axis=1)    dfCustomersTemp = cudf.from_pandas(dfCustomersTemp)    df = df.merge(dfCustomersTemp[['customer_id', 'age']], on='customer_id', how='inner')    print(f'The shape of scope transaction for {uniBin} is {df.shape}. \n')    df ['customer_id'] = df ['customer_id'].str[-16:].str.hex_to_int().astype('int64')    df['t_dat'] = cudf.to_datetime(df['t_dat'])    last_ts = df['t_dat'].max()    tmp = df[['t_dat']].copy().to_pandas()    tmp['dow'] = tmp['t_dat'].dt.dayofweek    tmp['ldbw'] = tmp['t_dat'] - pd.TimedeltaIndex(tmp['dow'] - 1, unit='D')    tmp.loc[tmp['dow'] >=2 , 'ldbw'] = tmp.loc[tmp['dow'] >=2 , 'ldbw'] + pd.TimedeltaIndex(np.ones(len(tmp.loc[tmp['dow'] >=2])) * 7, unit='D')    df['ldbw'] = tmp['ldbw'].values    weekly_sales = df.drop('customer_id', axis=1).groupby(['ldbw', 'article_id']).count().reset_index()    weekly_sales = weekly_sales.rename(columns={'t_dat': 'count'})    df = df.merge(weekly_sales, on=['ldbw', 'article_id'], how = 'left')    weekly_sales = weekly_sales.reset_index().set_index('article_id')    df = df.merge(        weekly_sales.loc[weekly_sales['ldbw']==last_ts, ['count']],        on='article_id', suffixes=("", "_targ"))    df['count_targ'].fillna(0, inplace=True)    del weekly_sales    df['quotient'] = df['count_targ'] / df['count']    target_sales = df.drop('customer_id', axis=1).groupby('article_id')['quotient'].sum()    general_pred = target_sales.nlargest(N).index.to_pandas().tolist()    general_pred = ['0' + str(article_id) for article_id in general_pred]    general_pred_str =  ' '.join(general_pred)    del target_sales    purchase_dict = {}    tmp = df.copy().to_pandas()    tmp['x'] = ((last_ts - tmp['t_dat']) / np.timedelta64(1, 'D')).astype(int)    tmp['dummy_1'] = 1    tmp['x'] = tmp[["x", "dummy_1"]].max(axis=1)    a, b, c, d = 2.5e4, 1.5e5, 2e-1, 1e3    tmp['y'] = a / np.sqrt(tmp['x']) + b * np.exp(-c*tmp['x']) - d    tmp['dummy_0'] = 0    tmp['y'] = tmp[["y", "dummy_0"]].max(axis=1)    tmp['value'] = tmp['quotient'] * tmp['y']    tmp = tmp.groupby(['customer_id', 'article_id']).agg({'value': 'sum'})    tmp = tmp.reset_index()    tmp = tmp.loc[tmp['value'] > 0]    tmp['rank'] = tmp.groupby("customer_id")["value"].rank("dense", ascending=False)    tmp = tmp.loc[tmp['rank'] <= 12]    purchase_df = tmp.sort_values(['customer_id', 'value'], ascending = False).reset_index(drop = True)    purchase_df['prediction'] = '0' + purchase_df['article_id'].astype(str) + ' '    purchase_df = purchase_df.groupby('customer_id').agg({'prediction': sum}).reset_index()    purchase_df['prediction'] = purchase_df['prediction'].str.strip()    purchase_df = cudf.DataFrame(purchase_df)    sub  = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv',                            usecols= ['customer_id'],                            dtype={'customer_id': 'string'})    numCustomers = sub.shape[0]    sub = sub.merge(dfCustomersTemp[['customer_id', 'age']], on='customer_id', how='inner')    sub['customer_id2'] = sub['customer_id'].str[-16:].str.hex_to_int().astype('int64')    sub = sub.merge(purchase_df, left_on = 'customer_id2', right_on = 'customer_id', how = 'left',                   suffixes = ('', '_ignored'))    sub = sub.to_pandas()    sub['prediction'] = sub['prediction'].fillna(general_pred_str)    sub['prediction'] = sub['prediction'] + ' ' +  general_pred_str    sub['prediction'] = sub['prediction'].str.strip()    sub['prediction'] = sub['prediction'].str[:131]    sub = sub[['customer_id', 'prediction']]    sub.to_csv(f'submission_' + str(uniBin) + '.csv',index=False)    print(f'Saved prediction for {uniBin}. The shape is {sub.shape}. \n')    print('-'*50)print('Finished.\n')print('='*50)

赛题难点思考 640

1、根据数据分析的结果进行更完善的特征工程

2、如何在预测中使用多种模态数据

AINLP

一个有趣有AI的自然语言处理公众号：关注AI、NLP、机器学习、推荐系统、计算广告等相关技术。公众号可直接对话双语聊天机器人，尝试自动对联、作诗机、藏头诗生成器，调戏夸夸机器人、彩虹屁生成器，使用中英翻译，查询相似词，测试NLP相关工具包。

344篇原创内容

Official Account

进技术交流群请添加AINLP小助手微信（id: ainlper)
请备注具体方向+所用到的相关技术点

关于AINLP

AINLP 是一个有趣有AI的自然语言处理社区，专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享，主题包括文本摘要、智能问答、聊天机器人、机器翻译、自动生成、知识图谱、预训练模型、推荐系统、计算广告、招聘信息、求职经验分享等，欢迎关注！加技术交流群请添加AINLPer(id：ainlper)，备注工作/研究方向+加群目的。

阅读至此了，分享、点赞、在看三选一吧🙏

【Kaggle多模态新赛】H&M个性化时尚推荐Baseline

【Kaggle多模态新赛】H&M个性化时尚推荐Baseline

Recommend

台积电拟补贴员工买自家股票

Disney Plus has been missing episodes of Agent Carter, DuckTales, and several ot...

大型 SaaS 平台产品架构设计思路

爆了！乡村vlog收割8000w+播放，又一个流量天花板

How to Create Required or Explicitly Excluded Fields with TypeScript

Amazon Web Services SDK :: All | Jenkins plugin

65.5万美元不翼而飞黑客从iCloud备份中获取MetaMask种子

国内唯一连续入选Gartner，Quick BI是如何做到的？

8 Green Apps That Help You Live A More Sustainable Life

淘宝终于放大招！某东、某多的优势要没了？

About Joyk