教程:使用SQL窗口函数做商务分析
source link: https://zzyongx.github.io/blogs/sql-window-functions-tutorial.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
教程:使用SQL窗口函数做商务分析
原文 statsbot 是一家商业智能公司,本文翻译自它的博客,原文使用PostgreSQL,本文改用MySQL(语法和原文略有不同,并且要求MySQL8),并补上了创建测试数据的sql,方便学习。
注 本文的sql都非常复杂,我估计性能也高不到哪去,非常不适合OLTP业务。
任何一个从事数据分析的人都有过这样的感叹,一个直观上很简单的方法,使用纯SQL解决却非常困难(例如:TOP N)。
如何计算每月的营收增长和总营收?我们可以相信计算结果吗?有没有重复数据的影响?如何找到每月的TOP N订单?所有这些问题都需要把商业语言转成编程语言。可以使用复杂的JOIN、数据导入excel、SQL存储过程等解决问题。窗口函数,让我们可以使用纯SQL解决问题,它更简单易读,更容易调试。
1 准备工作
1.1 安装启动MySQL
yum install https://repo.percona.com/yum/percona-release-latest.noarch.rpm percona-release setup ps80 yum install percona-server-server systemctl status mysqld # 首次启动 mysql8会把root密码写入日志,需要手动修改密码:alter user 'root'@'localhost' identified by '********' grep password /var/log/mysqld.log
1.2 准备数据
CREATE TABLE `orders` ( `order_id` int(11) NOT NULL, `customer_id` char(10) DEFAULT NULL, `state` char(2) DEFAULT NULL, `datetime` datetime DEFAULT NULL, `amount` int(11) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT INTO orders(order_id, customer_id, state, datetime, amount) VALUES (1, 'A', 'CA', '2017-01-01', 200), (2, 'B', 'CA', '2017-01-05', 250), (3, 'C', 'NY', '2017-01-12', 200), (4, 'A', 'CA', '2017-02-04', 400), (5, 'D', 'CA', '2017-02-05', 250), (5, 'D', 'CA', '2017-02-05 12:00:00', 300), (6, 'C', 'NY', '2017-02-19', 300), (7, 'A', 'CA', '2017-03-01', 150), (8, 'E', 'NY', '2017-03-05', 500), (9, 'F', 'CA', '2017-03-09', 250), (10, 'B', 'CA', '2017-03-21', 600);
这就是我们的测试数据集了:
order_id customer_id state datetime amount 1 A CA 2017-01-01 00:00:00 200 2 B CA 2017-01-05 00:00:00 250 3 C NY 2017-01-12 00:00:00 200 4 A CA 2017-02-04 00:00:00 400 5 D CA 2017-02-05 00:00:00 250 5 D CA 2017-02-05 12:00:00 300 6 C NY 2017-02-19 00:00:00 300 7 A CA 2017-03-01 00:00:00 150 8 E NY 2017-03-05 00:00:00 500 9 F CA 2017-03-09 00:00:00 250 10 B CA 2017-03-21 00:00:00 6002 营收增长
在商务中,月营收增长使用公式 100*(m1-m0)/m0
表示,m1是指定月份的营收,m0是它的前一月份的营收。你可以使用下面SQL计算出每个月的营收。
SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`, SUM(amount) AS revenue FROM orders GROUP BY `month` ORDER BY `month`
month revenue 2017-01-01 650 2017-02-01 1250 2017-03-01 1500
你可以把数据导入excel计算,并做出趋势图。这里我们演示如何使用SQL解决。
WITH monthly_revenue AS ( SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`, SUM(amount) AS revenue FROM orders GROUP BY `month` ), prev_month_revenue AS ( SELECT t1.*, t2.revenue AS prev_month_revenue FROM monthly_revenue t1 LEFT JOIN monthly_revenue t2 ON TIMESTAMPDIFF(MONTH, t2.`month`, t1.`month`)=1 ) SELECT *, ROUND(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) AS revenue_growth FROM prev_month_revenue ORDER BY `month`
month revenue prev_month_revenue revenue_growth 2017-01-01 650 NULL NULL 2017-02-01 1250 650 92.3 2017-03-01 1500 1250 20.0
这里我们先计算月营收,然后自JOIN,这个解法没问题,只是 TIMESTAMPDIFF
的可读性不好。下面介绍如何使用窗口函数解决:
WITH monthly_revenue AS ( SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`, SUM(amount) AS revenue FROM orders GROUP BY `month` ), prev_month_revenue AS ( SELECT *, LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue FROM monthly_revenue ) SELECT *, ROUND(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) AS revenue_growth FROM prev_month_revenue ORDER BY `month`
这两个SQL的结果一样,prev_month_revenue 子句用到了窗口函数。这里解释下 LAG
:
lag
是窗口函数,获取前一行
revenue
是表达式,表示你想从前一行获取revenue
字段over(order by month)
是窗口规格,指定如何定义前一行
,这里通过month排序定义行的顺序
窗口函数的结构基本如此: function (expression, [parameters]) OVER (window specification)
非常简单强大。例如,通过添加 partition by
窗口规格,可以把行分组,本例分别计算每个 state
的营收增长。
WITH monthly_revenue AS ( SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`, `state`, SUM(amount) AS revenue FROM orders GROUP BY `month`, `state` ), prev_month_revenue AS ( SELECT *, LAG(revenue) OVER (PARTITION BY state ORDER BY month) AS prev_month_revenue FROM monthly_revenue ) SELECT *, ROUND(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) AS revenue_growth FROM prev_month_revenue ORDER BY `month`, `state`
month state revenue prev_month_revenue revenue_growth 2017-01-01 CA 450 NULL NULL 2017-02-01 CA 950 450 111.1 2017-03-01 CA 1000 950 5.3 2017-01-01 NY 200 NULL NULL 2017-02-01 NY 300 200 50.0 2017-03-01 NY 500 300 66.7
通过第一个子句按州计算月营收,第二个子句按州计算上个月营收,第三个子句按州计算月营收增长。
3 流水总计
另一个常见问题是计算一段时间的流水,例如,计算到本月为止的总营收:
WITH monthly_revenue AS ( SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`, SUM(amount) AS revenue FROM orders GROUP BY `month` ) SELECT *, SUM(revenue) OVER (ORDER BY `month` rows BETWEEN unbounded preceding AND current row) AS `running_total` FROM monthly_revenue ORDER BY `month`;
month revenue running_total 2017-01-01 650 650 2017-02-01 1250 1900 2017-03-01 1500 3400
这里新出现的 rows BETWEEN unbounded preceding AND current row
称作 frame clause
它定义了感兴趣的行的子集(用于聚合函数),它的结构是 rows between frame_start and frame_end
。
frame_start
和 frame_end
取如下值,但是 frame_start
指定的行必须在 frame_end
前面:
unbounded preceding
窗口第一行unbounded following
窗口最后一行N preceding
前N行N following
后N行current row
当前行
SELECT *, SUM(amount) OVER () AS amount_total, SUM(amount) OVER (ORDER BY order_id rows BETWEEN unbounded preceding AND current row) as running_sum, SUM(amount) OVER (PARTITION BY customer_id ORDER BY `datetime` rows BETWEEN unbounded preceding AND current row) as running_sum_by_customer, AVG(amount) OVER (ORDER BY `datetime` rows BETWEEN 5 preceding AND current row) as trailing_avg FROM orders ORDER BY order_id;
- amount_total 没有指定窗口,默认取所有行
- running_sum 按照订单ID排序,本订单和之前所有订单
- running_sum_by_customer 按每个客户分组,本订单和之前所有订单
- trailing_avg 最近5笔订单的均值
4 处理重复数据
仔细观察数据,发现有两个order_id为5的数据,这是同一笔订单,初始值是250,后来追加到300。这导致计算月增长错误。(应该通过主键的方式避免)
排除重复数据:
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY `datetime` DESC) AS row_num FROM orders ) row_num_tbl WHERE row_num=1
order_id customer_id state datetime amount row_num 1 A CA 2017-01-01 00:00:00 200 1 2 B CA 2017-01-05 00:00:00 250 1 3 C NY 2017-01-12 00:00:00 200 1 4 A CA 2017-02-04 00:00:00 400 1 5 D CA 2017-02-05 12:00:00 300 1 6 C NY 2017-02-19 00:00:00 300 1 7 A CA 2017-03-01 00:00:00 150 1 8 E NY 2017-03-05 00:00:00 500 1 9 F CA 2017-03-09 00:00:00 250 1 10 B CA 2017-03-21 00:00:00 600 1
下图演示了这个过程:
使用排重后的数据计算月营收增长:
WITH orders_cleaned as ( SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY `datetime` DESC) AS row_num FROM orders ) row_num_tbl WHERE row_num=1 ), monthly_revenue AS ( SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`, SUM(amount) AS revenue FROM orders_cleaned GROUP BY `month` ), prev_month_revenue AS ( SELECT *, LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue FROM monthly_revenue ) SELECT *, ROUND(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) AS revenue_growth FROM prev_month_revenue ORDER BY `month`;
对比错误数据,这个数据看起来更正确,月营收增长没有大幅波动。
month revenue prev_month_revenue revenue_growth 2017-01-01 650 NULL NULL 2017-02-01 1000 650 53.8 2017-03-01 1500 1000 50.0
给排重后的数据创建视图:
CREATE VIEW orders_cleaned AS SELECT order_id, customer_id, `state`, `datetime`, amount FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY `datetime` DESC) AS row_num FROM orders ) t WHERE row_num=1
5 分组计算TOP N
计算每月的TOP 2订单
WITH orders_ranked as ( SELECT *, DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`, ROW_NUMBER() OVER (PARTITION BY DATE_FORMAT(`datetime`, '%Y-%m-01') ORDER BY amount DESC, `datetime`) AS amount_rank FROM orders_cleaned ) SELECT * FROM orders_ranked WHERE amount_rank<=2 ORDER BY `month`
order_id customer_id state datetime amount month amount_rank 2 B CA 2017-01-05 00:00:00 250 2017-01-01 1 1 A CA 2017-01-01 00:00:00 200 2017-01-01 2 4 A CA 2017-02-04 00:00:00 400 2017-02-01 1 5 D CA 2017-02-05 12:00:00 300 2017-02-01 2 10 B CA 2017-03-21 00:00:00 600 2017-03-01 1 8 E NY 2017-03-05 00:00:00 500 2017-03-01 2
row_number
计算时间窗口内的行数,这里分月按照amount排序,行数代表了排名,行数越多,说明排名越靠后。下图演示了这个过程:
6 重复下单行为
重复下单(留存)行为非常重要,衡量用户忠实度。
计算重复下单率和平均客单增长率:
WITH customer_orders AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY `datetime`) AS customer_order_n, LAG(amount) OVER (PARTITION by customer_id ORDER BY `datetime`) AS prev_order_amount FROM orders_cleaned ) SELECT ROUND(100.0*SUM(CASE WHEN customer_order_n=2 THEN 1 END)/COUNT(DISTINCT customer_id),1) AS repeat_purchases, AVG(CASE WHEN customer_order_n=2 THEN 1.0*amount/prev_order_amount END) AS revenue_expansion FROM customer_orders
repeat_purchases revenue_expansion 50.0 1.966666667
下图演示了这个过程:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK