SQL 案例分析:移动平均值与累计求和

许多常见的聚合函数也可以作为窗口函数使用,包括AVG()、SUM()、COUNT()、MAX()以及MIN()函数等。今天我们介绍两个聚合窗口函数的使用案例。如果你觉得文章有用,欢迎关注❤️、点赞👍、推荐🎁
关于聚合函数的语法可以参考这篇文章。
示例表sales_monthly中存储了不同产品(苹果、香蕉以及桔子)每个月份的销量情况,以下是该表的创建脚本和数据:

-- 创建销量表sales_monthly
-- product表示产品名称,ym表示年月,amount表示销售金额(元)
CREATE TABLE sales_monthly(product VARCHAR(20), ym VARCHAR(10), amount NUMERIC(10, 2));

-- 生成测试数据
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201801',10159.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201802',10211.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201803',10247.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201804',10376.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201805',10400.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201806',10565.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201807',10613.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201808',10696.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201809',10751.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201810',10842.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201811',10900.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201812',10972.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201901',11155.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201902',11202.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201903',11260.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201904',11341.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201905',11459.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('苹果','201906',11560.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201801',10138.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201802',10194.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201803',10328.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201804',10322.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201805',10481.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201806',10502.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201807',10589.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201808',10681.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201809',10798.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201810',10829.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201811',10913.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201812',11056.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201901',11161.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201902',11173.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201903',11288.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201904',11408.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201905',11469.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('香蕉','201906',11528.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201801',10154.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201802',10183.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201803',10245.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201804',10325.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201805',10465.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201806',10505.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201807',10578.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201808',10680.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201809',10788.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201810',10838.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201811',10942.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201812',10988.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201901',11099.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201902',11181.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201903',11302.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201904',11327.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201905',11423.00);
INSERT INTO sales_monthly (product,ym,amount) VALUES ('桔子','201906',11524.00);

移动平均值

AVG()函数作为窗口函数使用时,可以用于计算随着当前行移动的窗口内数据行的平均值。例如,以下语句用于查找不同产品截止到每个月份为止、最近3个月的平均销量:

SELECT product AS "产品", ym "年月", amount "销量",
       AVG(amount) OVER (
         PARTITION BY product
         ORDER BY ym
         ROWS BETWEEN 2PRECEDING AND CURRENT ROW
       ) AS "最近平均销量"
FROM sales_monthly
ORDER BY product, ym;

AVG()函数OVER子句中的PARTITION BY选项表示按照产品进行分区,ORDERBY选项表示按照月份进行排序,ROWS BETWEEN 2 PRECEDING AND CURRENT ROW表示窗口从当前行的前2行开始直到当前行结束。该查询返回的结果如下:

产品|年月   |销量     |最近平均销量     
---|------|--------|------------
桔子|201801|10154.00|10154.000000
桔子|201802|10183.00|10168.500000
桔子|201803|10245.00|10194.000000
桔子|201804|10325.00|10251.000000
桔子|201805|10465.00|10345.000000
桔子|201806|10505.00|10431.666667
...

 对于“桔子”,第一个月份的分析窗口只有1行数据,因此平均销量为10154。第二个月份的分析窗口为第1行和第2行数据,因此平均销量为10168.5((10154+10183)/2)。第三个月份的分析窗口为第1行到第3行数据,因此平均销量为10194((10154+10183+10245)/3)。依此类推,直到计算完“桔子”所有月份的平均销量,然后开始计算其他产品的平均销量。

累计求和

SUM()函数作为窗口函数时,可以用于统计指定窗口内的累计值。例如,以下语句用于查找不同产品截止到当前月份为止的累计销量:

SELECT product AS "产品", ym "年月", amount "销量",
       SUM(amount) OVER (
         PARTITION BY product
         ORDER BY ym
         ROWS BETWEENUNBOUNDED PRECEDING AND CURRENT ROW
       ) AS "累计销量"
FROM sales_monthly
ORDER BY product, ym;

SUM()函数OVER子句中的PARTITION BY选项表示按照产品进行分区,ORDERBY选项表示按照月份进行排序,ROWS BETWEENUNBOUNDED PRECEDING AND CURRENT ROW表示窗口从当前分区第1行开始直到当前行结束。该查询返回的结果如下:

产品|年月   |销量      |累计销量    
---|------|--------|---------
桔子|201801|10154.00| 10154.00
桔子|201802|10183.00| 20337.00
桔子|201803|10245.00| 30582.00
桔子|201804|10325.00| 40907.00
桔子|201805|10465.00| 51372.00
桔子|201806|10505.00| 61877.00
...

对于“桔子”,第一个月份的分析窗口只有1行数据,因此累计销量为10154。第二个月份的分析窗口为第1行和第2行数据,因此累计销量为20337(10154+10183)。第三个月份的分析窗口为第1行到第3行数据,因此累计销量为30582(10154+10183+10245)。依此类推,直到计算完“桔子”所有月份的累计销量,然后开始计算其他产品的累计销量。提示:对于聚合窗口函数,如果我们没有指定ORDER BY选项,默认的窗口大小就是整个分区。如果我们指定了ORDERBY选项,默认的窗口大小就是分区的第一行直到当前行。因此,以上示例语句的中ROWSBETWEEN UNBOUNDED PRECEDING AND CURRENT ROW选项可以省略。

除了使用ROWS关键字以数据行为单位指定窗口的偏移量之外,我们也可以使用RANGE关键字以数值为单位指定窗口的偏移量。示例表transfer_log中记录了一些银行账号的交易日志,以下是该表创建脚本:

-- 创建银行交易日志表transfer_log
-- Oracle、MySQL、PostgreSQL以及SQLite
CREATE TABLE transfer_log
( log_id    INTEGER NOT NULL PRIMARY KEY, -- 交易日志编号
  log_ts    TIMESTAMP NOT NULL, -- 交易时间
  from_user VARCHAR(50) NOT NULL, -- 交易发起账号
  to_user   VARCHAR(50), -- 交易接收账号
  type      VARCHAR(10) NOT NULL, -- 交易类型
  amount    NUMERIC(10) NOT NULL -- 交易金额(元)
);

-- SQL Server
CREATE TABLE transfer_log
( log_id    INTEGER NOT NULL PRIMARY KEY, -- 交易日志编号
  log_ts    DATETIME2 NOT NULL, -- 交易时间
  from_user VARCHAR(50) NOT NULL, -- 交易发起账号
  to_user   VARCHAR(50), -- 交易接收账号
  type      VARCHAR(10) NOT NULL, -- 交易类型
  amount    NUMERIC(10) NOT NULL -- 交易金额(元)
);

-- 生成测试数据
-- Oracle 需要执行以下ALTER语句
-- ALTER SESSION SET nls_timestamp_format = 'YYYY-MM-DD HH24:MI:SS';
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (1,'2021-01-02 10:31:40','62221234567890',NULL,'存款',50000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (2,'2021-01-02 10:32:15','62221234567890',NULL,'存款',100000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (3,'2021-01-03 08:14:29','62221234567890','62226666666666','转账',200000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (4,'2021-01-05 13:55:38','62221234567890','62226666666666','转账',150000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (5,'2021-01-07 20:00:31','62221234567890','62227777777777','转账',300000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (6,'2021-01-09 17:28:07','62221234567890','62227777777777','转账',500000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (7,'2021-01-10 07:46:02','62221234567890','62227777777777','转账',100000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (8,'2021-01-11 09:36:53','62221234567890',NULL,'存款',40000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (9,'2021-01-12 07:10:01','62221234567890','62228888888881','转账',10000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (10,'2021-01-12 07:11:12','62221234567890','62228888888882','转账',8000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (11,'2021-01-12 07:12:36','62221234567890','62228888888883','转账',5000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (12,'2021-01-12 07:13:55','62221234567890','62228888888884','转账',6000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (13,'2021-01-12 07:14:24','62221234567890','62228888888885','转账',7000);
INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (14,'2021-01-

以下语句用于查找短期之内(5天)累计转账超过一百万元的账号:

-- Oracle、MySQL以及PostgreSQL
SELECT log_ts, from_user,total_amount
FROM (
    SELECT log_ts, from_user,
    SUM(amount) OVER (
      PARTITION BY from_user
      ORDER BY log_ts
      RANGE INTERVAL '5' DAYPRECEDING
      ) AS total_amount
    FROM transfer_log
    WHERE TYPE = '转账'
    ) t
WHERE total_amount >= 1000000;

其中,SUM()函数OVER子句中的RANGE选项指定了一个5天之内的时间窗口。该查询返回的结果如下。

log_ts               |from_user      |total_amount
-------------------|--------------|------------
2021-01-10 07:46:02|62221234567890|     1050000

账号“62221234567890”截止2021年01月10日07点46份02秒在最近5天之内累计转账105万。SQLite不支持INTERVAL时间常量,我们可以将时间戳数据转换为整数后使用。例如:

-- SQLite
WITH tl(log_ts, unix, from_user,amount) AS (
  SELECT log_ts, CAST(STRFTIME('%s',log_ts) AS INT), from_user, amount
  FROM transfer_log
  WHERE type = '转账'
)
SELECT log_ts, from_user, total_amount
FROM (
    SELECT log_ts, from_user,
    SUM(amount) OVER (
      PARTITION BY from_user
      ORDER BY unix
      RANGE 5 * 86400PRECEDING
      ) AS total_amount
    FROM tl
    ) t
WHERE total_amount >= 1000000;

我们首先定义了一个CTE,字段unix是将log_ts转换为1970年1月1日以来的整数秒。然后我们在SUM()函数中通过RANGE选项指定了一个5天(5*86400秒)之内的时间窗口。Microsoft SQL Server中的RANGE窗口大小选项只能指定UNBOUNDED PRECEDING、UNBOUNDED FOLLOWING或者CURRENT ROW,不能指定一个具体的数值,因此无法实现以上查询。

以上案例来自图书《SQL编程思想》。

发表评论

登录后才能评论
网站客服
网站客服
申请收录 侵权处理
分享本页
返回顶部