3

教程:使用SQL窗口函数做商务分析

 2 years ago
source link: https://zzyongx.github.io/blogs/sql-window-functions-tutorial.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

教程:使用SQL窗口函数做商务分析

原文 statsbot 是一家商业智能公司,本文翻译自它的博客,原文使用PostgreSQL,本文改用MySQL(语法和原文略有不同,并且要求MySQL8),并补上了创建测试数据的sql,方便学习。

本文的sql都非常复杂,我估计性能也高不到哪去,非常不适合OLTP业务。

任何一个从事数据分析的人都有过这样的感叹,一个直观上很简单的方法,使用纯SQL解决却非常困难(例如:TOP N)。

如何计算每月的营收增长和总营收?我们可以相信计算结果吗?有没有重复数据的影响?如何找到每月的TOP N订单?所有这些问题都需要把商业语言转成编程语言。可以使用复杂的JOIN、数据导入excel、SQL存储过程等解决问题。窗口函数,让我们可以使用纯SQL解决问题,它更简单易读,更容易调试。

1 准备工作

1.1 安装启动MySQL

yum install https://repo.percona.com/yum/percona-release-latest.noarch.rpm
percona-release setup ps80
yum install percona-server-server
systemctl status mysqld
# 首次启动 mysql8会把root密码写入日志,需要手动修改密码:alter user 'root'@'localhost' identified by '********'
grep password /var/log/mysqld.log

1.2 准备数据

CREATE TABLE `orders` (
  `order_id` int(11) NOT NULL,
  `customer_id` char(10) DEFAULT NULL,
  `state` char(2) DEFAULT NULL,
  `datetime` datetime DEFAULT NULL,
  `amount` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO orders(order_id, customer_id, state, datetime, amount) VALUES
 (1,  'A', 'CA', '2017-01-01', 200),
 (2,  'B', 'CA', '2017-01-05', 250),
 (3,  'C', 'NY', '2017-01-12', 200),
 (4,  'A', 'CA', '2017-02-04', 400),
 (5,  'D', 'CA', '2017-02-05', 250),
 (5,  'D', 'CA', '2017-02-05 12:00:00', 300),
 (6,  'C', 'NY', '2017-02-19', 300),
 (7,  'A', 'CA', '2017-03-01', 150),
 (8,  'E', 'NY', '2017-03-05', 500),
 (9,  'F', 'CA', '2017-03-09', 250),
 (10, 'B', 'CA', '2017-03-21', 600);

这就是我们的测试数据集了:

order_id customer_id state datetime amount 1 A CA 2017-01-01 00:00:00 200 2 B CA 2017-01-05 00:00:00 250 3 C NY 2017-01-12 00:00:00 200 4 A CA 2017-02-04 00:00:00 400 5 D CA 2017-02-05 00:00:00 250 5 D CA 2017-02-05 12:00:00 300 6 C NY 2017-02-19 00:00:00 300 7 A CA 2017-03-01 00:00:00 150 8 E NY 2017-03-05 00:00:00 500 9 F CA 2017-03-09 00:00:00 250 10 B CA 2017-03-21 00:00:00 600

2 营收增长

在商务中,月营收增长使用公式 100*(m1-m0)/m0 表示,m1是指定月份的营收,m0是它的前一月份的营收。你可以使用下面SQL计算出每个月的营收。

SELECT
DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`,
SUM(amount) AS revenue
FROM orders
GROUP BY `month`
ORDER BY `month`

month revenue 2017-01-01 650 2017-02-01 1250 2017-03-01 1500

你可以把数据导入excel计算,并做出趋势图。这里我们演示如何使用SQL解决。

WITH
monthly_revenue AS (
  SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`,
  SUM(amount) AS revenue
  FROM orders
  GROUP BY `month`
),
prev_month_revenue AS (
  SELECT
  t1.*,
  t2.revenue AS prev_month_revenue
  FROM monthly_revenue t1
  LEFT JOIN monthly_revenue t2
  ON TIMESTAMPDIFF(MONTH, t2.`month`, t1.`month`)=1
)
SELECT *,
ROUND(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) AS revenue_growth
FROM prev_month_revenue
ORDER BY `month`

month revenue prev_month_revenue revenue_growth 2017-01-01 650 NULL NULL 2017-02-01 1250 650 92.3 2017-03-01 1500 1250 20.0

这里我们先计算月营收,然后自JOIN,这个解法没问题,只是 TIMESTAMPDIFF 的可读性不好。下面介绍如何使用窗口函数解决:

WITH
monthly_revenue AS (
  SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`,
  SUM(amount) AS revenue
  FROM orders
  GROUP BY `month`
),
prev_month_revenue AS (
  SELECT *,
  LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue
  FROM monthly_revenue
)
SELECT *,
ROUND(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) AS revenue_growth
FROM prev_month_revenue
ORDER BY `month`

这两个SQL的结果一样,prev_month_revenue 子句用到了窗口函数。这里解释下 LAG

  • lag 是窗口函数,获取 前一行
  • revenue 是表达式,表示你想从前一行获取 revenue 字段
  • over(order by month) 是窗口规格,指定如何定义 前一行 ,这里通过month排序定义行的顺序

窗口函数的结构基本如此: function (expression, [parameters]) OVER (window specification) 非常简单强大。例如,通过添加 partition by 窗口规格,可以把行分组,本例分别计算每个 state 的营收增长。

WITH
monthly_revenue AS (
  SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`,
  `state`,
  SUM(amount) AS revenue
  FROM orders
  GROUP BY `month`, `state`
),
prev_month_revenue AS (
  SELECT *,
  LAG(revenue) OVER (PARTITION BY state ORDER BY month) AS prev_month_revenue
  FROM monthly_revenue
)
SELECT *,
ROUND(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) AS revenue_growth
FROM prev_month_revenue
ORDER BY `month`, `state`

month state revenue prev_month_revenue revenue_growth 2017-01-01 CA 450 NULL NULL 2017-02-01 CA 950 450 111.1 2017-03-01 CA 1000 950 5.3 2017-01-01 NY 200 NULL NULL 2017-02-01 NY 300 200 50.0 2017-03-01 NY 500 300 66.7

通过第一个子句按州计算月营收,第二个子句按州计算上个月营收,第三个子句按州计算月营收增长。

3 流水总计

另一个常见问题是计算一段时间的流水,例如,计算到本月为止的总营收:

WITH
monthly_revenue AS (
  SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`,
  SUM(amount) AS revenue
  FROM orders
  GROUP BY `month`
)
SELECT *,
SUM(revenue) OVER (ORDER BY `month` rows BETWEEN unbounded preceding AND current row) AS `running_total`
FROM monthly_revenue
ORDER BY `month`;

month revenue running_total 2017-01-01 650 650 2017-02-01 1250 1900 2017-03-01 1500 3400

这里新出现的 rows BETWEEN unbounded preceding AND current row 称作 frame clause 它定义了感兴趣的行的子集(用于聚合函数),它的结构是 rows between frame_start and frame_end

frame_startframe_end 取如下值,但是 frame_start 指定的行必须在 frame_end 前面:

  • unbounded preceding 窗口第一行
  • unbounded following 窗口最后一行
  • N preceding 前N行
  • N following 后N行
  • current row 当前行
SELECT
*,
SUM(amount) OVER () AS amount_total,
SUM(amount) OVER (ORDER BY order_id rows BETWEEN unbounded preceding AND current row) as running_sum,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY `datetime` rows BETWEEN unbounded preceding AND current row) as running_sum_by_customer,
AVG(amount) OVER (ORDER BY `datetime` rows BETWEEN 5 preceding AND current row) as trailing_avg
FROM orders
ORDER BY order_id;
  • amount_total 没有指定窗口,默认取所有行
  • running_sum 按照订单ID排序,本订单和之前所有订单
  • running_sum_by_customer 按每个客户分组,本订单和之前所有订单
  • trailing_avg 最近5笔订单的均值
order_id customer_id state datetime amount amount_total running_sum running_sum_by_customer trailing_avg 1 A CA 2017-01-01 00:00:00 200 3400 200 200 200.0000 2 B CA 2017-01-05 00:00:00 250 3400 450 250 225.0000 3 C NY 2017-01-12 00:00:00 200 3400 650 200 216.6667 4 A CA 2017-02-04 00:00:00 400 3400 1050 600 262.5000 5 D CA 2017-02-05 12:00:00 300 3400 1600 550 266.6667 5 D CA 2017-02-05 00:00:00 250 3400 1300 250 260.0000 6 C NY 2017-02-19 00:00:00 300 3400 1900 500 283.3333 7 A CA 2017-03-01 00:00:00 150 3400 2050 750 266.6667 8 E NY 2017-03-05 00:00:00 500 3400 2550 500 316.6667 9 F CA 2017-03-09 00:00:00 250 3400 2800 250 291.6667 10 B CA 2017-03-21 00:00:00 600 3400 3400 850 350.0000

4 处理重复数据

仔细观察数据,发现有两个order_id为5的数据,这是同一笔订单,初始值是250,后来追加到300。这导致计算月增长错误。(应该通过主键的方式避免)

排除重复数据:

SELECT *
FROM (
  SELECT *,
  ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY `datetime` DESC) AS row_num
  FROM orders
) row_num_tbl
WHERE row_num=1

order_id customer_id state datetime amount row_num 1 A CA 2017-01-01 00:00:00 200 1 2 B CA 2017-01-05 00:00:00 250 1 3 C NY 2017-01-12 00:00:00 200 1 4 A CA 2017-02-04 00:00:00 400 1 5 D CA 2017-02-05 12:00:00 300 1 6 C NY 2017-02-19 00:00:00 300 1 7 A CA 2017-03-01 00:00:00 150 1 8 E NY 2017-03-05 00:00:00 500 1 9 F CA 2017-03-09 00:00:00 250 1 10 B CA 2017-03-21 00:00:00 600 1

下图演示了这个过程:

sql-window-duplicate-row.jpg

使用排重后的数据计算月营收增长:

WITH
orders_cleaned as (
  SELECT *
  FROM (
    SELECT *,
    ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY `datetime` DESC) AS row_num
    FROM orders
  ) row_num_tbl
  WHERE row_num=1
),
monthly_revenue AS (
  SELECT DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`,
  SUM(amount) AS revenue
  FROM orders_cleaned
  GROUP BY `month`
),
prev_month_revenue AS (
  SELECT *,
  LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue
  FROM monthly_revenue
)
SELECT *,
ROUND(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) AS revenue_growth
FROM prev_month_revenue
ORDER BY `month`;

对比错误数据,这个数据看起来更正确,月营收增长没有大幅波动。

month revenue prev_month_revenue revenue_growth 2017-01-01 650 NULL NULL 2017-02-01 1000 650 53.8 2017-03-01 1500 1000 50.0

给排重后的数据创建视图:

CREATE VIEW orders_cleaned AS
SELECT order_id, customer_id, `state`, `datetime`, amount
FROM (
  SELECT *,
  ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY `datetime` DESC) AS row_num
  FROM orders
) t
WHERE row_num=1

5 分组计算TOP N

计算每月的TOP 2订单

WITH orders_ranked as (
  SELECT *,
  DATE_FORMAT(`datetime`, '%Y-%m-01') AS `month`,
  ROW_NUMBER() OVER (PARTITION BY DATE_FORMAT(`datetime`, '%Y-%m-01') ORDER BY amount DESC, `datetime`) AS amount_rank
  FROM orders_cleaned
)
SELECT *
FROM orders_ranked
WHERE amount_rank<=2
ORDER BY `month`

order_id customer_id state datetime amount month amount_rank 2 B CA 2017-01-05 00:00:00 250 2017-01-01 1 1 A CA 2017-01-01 00:00:00 200 2017-01-01 2 4 A CA 2017-02-04 00:00:00 400 2017-02-01 1 5 D CA 2017-02-05 12:00:00 300 2017-02-01 2 10 B CA 2017-03-21 00:00:00 600 2017-03-01 1 8 E NY 2017-03-05 00:00:00 500 2017-03-01 2

row_number 计算时间窗口内的行数,这里分月按照amount排序,行数代表了排名,行数越多,说明排名越靠后。下图演示了这个过程:

sql-window-topn.jpg

6 重复下单行为

重复下单(留存)行为非常重要,衡量用户忠实度。

计算重复下单率和平均客单增长率:

WITH
customer_orders AS (
  SELECT *,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY `datetime`) AS customer_order_n,
  LAG(amount) OVER (PARTITION by customer_id ORDER BY `datetime`) AS prev_order_amount
  FROM orders_cleaned
)
SELECT
ROUND(100.0*SUM(CASE WHEN customer_order_n=2 THEN 1 END)/COUNT(DISTINCT customer_id),1) AS repeat_purchases,
AVG(CASE WHEN customer_order_n=2 THEN 1.0*amount/prev_order_amount END) AS revenue_expansion
FROM customer_orders

repeat_purchases revenue_expansion 50.0 1.966666667

下图演示了这个过程:

sql-window-repeat-purchase.jpg


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK