3

500行SQL快速实现UCF

 3 years ago
source link: http://www.cnblogs.com/arachis/p/UCF.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

写在前面话

UCF通常是User-base Collaborative Filter的简写;大体的算法思路是根据用户行为计算相似群体(邻居),为用户推荐其邻居喜好的内容;感觉是不是很简单、那废话不多说先撸个SQL。

SQL

select uid1,uid2,sim
from (
    select uid1
        ,uid2
        ,cnt12 / sqrt(cnt1*cnt2) sim
        ,row_number() over(partition by uid1 order by cnt12 / sqrt(cnt1*cnt2) desc) sim_rn
    from (
        select a.uid uid1
            ,b.uid uid2
            ,count(a.iid) cnt12 
        from tb_behavior a
        join tb_behavior b
        on a.iid = b.iid
        where a.uid <> b.uid
        group by a.uid,b.uid
    ) a12
    join (select uid,count(iid) cnt1 from tb_behavior group by uid) a1
    on a12.uid1 = a1.uid
    join (select uid,count(iid) cnt2 from tb_behavior group by uid) a2
    on a12.uid1 = a2.uid
) tb_neighbour
where sim > 0.1 and sim_rn <= 30

读者实现的话只需要把上面的tb_behavior表替换成自己业务的用户行为即可;iid,uid分别对应物品id和用户id;

根据共现相似度,即共同喜好的物品个数比上各自喜好物品总数乘积取平方;最后截断用户最相似的前30个邻居作为推荐的依据。

上面构造了邻居表,下面就是根据邻居的喜好为用户推荐了,具体sql如下:

select uid1,iid
from (
    select uid1
        ,iid
        ,max(sim) score
        ,row_number() over(partition by uid1 order by max(sim) desc) user_rn
    from tb_neighbour a12
    join (select uid,iid from tb_behavior) a2
    on a12.uid2 = a2.uid
    join (select uid,collect_set(iid) iids1 from tb_behavior group by uid) a1
    on a12.uid1 = a1.uid
    where not array_contaions(iids1,a2.iid)
    group by uid1,iid
) tb_rec
where user_rn <= 500

这里说明下包括上面的top30邻居和用户top500的最大推荐列表都是工程优化,截断节约些存储;具体读者可以根据自己业务需要进行设置;

然后大概说下各个表的含义:a1表是用户已消费过的物品,a2表是用户每个邻居喜好的物品;那么也就是说从邻居喜好的物品中过滤掉已经消费的

物品整体根据共现相似度进行排序。

思考

但思路很简单、实际作者开发中总会遇到各种各样的问题,下面就捡几个主要的和大家一起讨论下:

  • 1.join引起的数据倾斜问题:tb_neighbour表很大,往往热点物品会占据80%的曝光和消费记录,如何解决?
  • 2.增量更新问题:上面的框架,tb_behavior表每次都是全量计算,是否能改造成增量更新邻居表和推荐结果,并减少计算时间呢?

join引起的数据倾斜问题

先思考问题1,既然我们目的是求相似邻居,物品join只是为了关联上一组用户对,那自然的想法是可以根据feed做近似采样、相似度精度也几乎无损失。

下面我试着实现下这种思路:

with tb_behavior_sample as (
    select uid,iid 
    from (
        select uid
            ,iid
            ,row_number() over(partition by iid order by rand()) feed_rn
        from tb_behavior
    ) bh
    where feed_rn <= 50000
) 

select uid1,uid2,sim
from (
    select uid1
        ,uid2
        ,cnt12 / sqrt(cnt1*cnt2) sim
        ,row_number() over(partition by uid1 order by cnt12 / sqrt(cnt1*cnt2) desc) sim_rn
    from (
        select a.uid uid1
            ,b.uid uid2
            ,count(a.iid) cnt12 
        from tb_behavior_sample a
        join tb_behavior_sample b
        on a.iid = b.iid
        where a.uid <> b.uid
        group by a.uid,b.uid
    ) a12
    join (select uid,count(iid) cnt1 from tb_behavior group by uid) a1
    on a12.uid1 = a1.uid
    join (select uid,count(iid) cnt2 from tb_behavior group by uid) a2
    on a12.uid1 = a2.uid
) tb_neighbour
where sim > 0.1 and sim_rn <= 30

这里用了hive的with as语法,读者可自行查阅,篇幅有限,就不展开了;feed_rn就是随机采样了50000条,实际操作时读者可以先统计下item的分布、大概找到一个阈值;

比如取top10的item的出现次数作为阈值;那计算相似度时分子最多减小10,分母不变。这对大多数情况精度应该足够了,而且因为避免了数据倾斜,大大降低了计算时间。

增量更新问题

问题2是一个工程问题,lambda架构能使初始结果效果不错,可直接上线灰度了;在此基础上再加小时或者天增量;kappa架构相对就比较繁琐、需要一开始就设计增量流程。

精度方面也需要一定的累积;不过如何选择,读者可以根据自己的数据量和熟悉程度自行选择;作者这里仅以kappa架构说明。

重新review上面sql,我们发现我们仅需要记录下cnt12,cnt1,cnt2,iids1这些计算关键即可,其中iids2是用户邻居喜好的物品数组;数值类型可累加更新、

数组类型合并起来比较麻烦,一种解决方案是注册UDF;这里采取另一种这种的方案:把iids1合并成字符串,过滤的时候再分割为字符串数组。

with tb_behavior_sample_incr as (
    select uid,iid 
    from (
        select uid
            ,iid
            ,row_number() over(partition by iid order by rand()) feed_rn
        from tb_behavior_incr
    ) bh
    where feed_rn <= 50000
) 

insert overwrite table tb_neighbour
select uid1,uid2,sim
from (
    select uid1
        ,uid2
        ,sum(cnt12) / sqrt(sum(cnt1)*sum(cnt2)) sim
        ,row_number() over(partition by uid1 order by sum(cnt12) / sqrt(sum(cnt1)*sum(cnt2)) desc) sim_rn
    from (
        select uid1,uid2,cnt12,cnt1,cnt2
        from tb_neighbour
        union all
        select a.uid uid1
            ,b.uid uid2
            ,count(a.iid) cnt12 
            ,cnt1
            ,cnt2
        from tb_behavior_sample_incr a
        join tb_behavior_sample_incr b
        on a.iid = b.iid
        where a.uid <> b.uid
        group by a.uid,b.uid 
    ) a12
    join (select uid,count(iid) cnt1 from tb_behavior_incr group by uid) a1
    on a12.uid1 = a1.uid
    join (select uid,count(iid) cnt2 from tb_behavior_incr group by uid) a2
    on a12.uid1 = a2.uid
    group by uid1,uid2
) tb_neighbour
where sim > 0.1 and sim_rn <= 30

其中tb_behavior_sample_incr,tb_behavior_incr是相应tb_behavior_sample,tb_behavior的增量表;使用union all和group by聚合相同用户对的结果

kappa架构初次计算即是增量,不断累积每次增量的结果更新tb_neighbour;相当于lambda初始全量计算的一种回放,直至追到最新的时间分区。

insert overwrite table tb_user_consume
select uid,substring_index(concat_ws(",",collect_list(iids1)),",",10000) iids1 
from (
    select uid,concat_ws(",",collect_set(cast(iid as string))) iids1
    from tb_behavior_incr
    union all
    select uid,iids1
    from tb_user_consume
) a
group by uid

select uid1,iid
from (
    select uid1
        ,iid
        ,max(sim) score
        ,row_number() over(partition by uid1 order by max(sim) desc) user_rn
    from tb_neighbour a12
    join (select uid,cast(iid as string) iid from tb_behavior_incr) a2
    on a12.uid2 = a2.uid
    join (select uid,split(iids1,",") iids1 from tb_user_consume) a1
    on a12.uid1 = a1.uid
    where not array_contaions(iids1,a2.iid)
    group by uid1,iid
) tb_rec
where user_rn <= 500

使用tb_user_consume缓存用户最近消费的前10000条记录,将用户邻居最新喜好物品推荐给用户。

写在后面的话

呼!终于写完了;虽然说有了上面这一套操作,UCF推荐基本完成;但有没有更好的方式呢?我想应该就是embedding大法了吧;比如item2vec对用户聚类,根据聚类

推荐;再或者根据好友关系,推荐好友喜好的物品。前者表征更细致,值得一说的是其也有负采样策略和checkpoint增量更新;后者好友信任度更高,解释性更强。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK