SQL statement recursive (PostgreSQL 9.1.4)
source link: https://www.codesd.com/item/sql-statement-recursive-postgresql-9-1-4.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
SQL statement recursive (PostgreSQL 9.1.4)
PostgreSQL 9.1
Business situation
Every month, there is a new batch of accounts given to a specific process. Every batch can be described by month, number of accounts and total balance of accounts. The goal of the process is to recover some of the balance back from customers. Each batch is than tracked separately on a monthly basis (amount recovered on each month since batch was transferred to the process).
Goal
My goal is to predict what amount will be recovered in the future.
Data definition
create table vintage_data (
granularity date, /* Month when account entered process*/
distance_in_months integer, /* Distance in months from date when accounts entered process*/
entry_accounts integer, /* Number of accounts that entered process in a given month*/
entry_amount numeric, /* Total amount for account that entered process in a given month*/
recovery_amount numeric /* Amount recovered in Nth month on accounts that entered process in a given month */
);
Sample data
insert into vintage_data values('2012-01-31',1,200,100000,1000);
insert into vintage_data values('2012-01-31',2,200,100000,2000);
insert into vintage_data values('2012-01-31',3,200,100000,3000);
insert into vintage_data values('2012-01-31',4,200,100000,3500);
insert into vintage_data values('2012-01-31',5,200,100000,3400);
insert into vintage_data values('2012-01-31',6,200,100000,3300);
insert into vintage_data values('2012-02-28',1,250,150000,1200);
insert into vintage_data values('2012-02-28',2,250,150000,1600);
insert into vintage_data values('2012-02-28',3,250,150000,1800);
insert into vintage_data values('2012-02-28',4,250,150000,1200);
insert into vintage_data values('2012-02-28',5,250,150000,1600);
insert into vintage_data values('2012-03-31',1,200,90000,1300);
insert into vintage_data values('2012-03-31',2,200,90000,1200);
insert into vintage_data values('2012-03-31',3,200,90000,1400);
insert into vintage_data values('2012-03-31',4,200,90000,1000);
insert into vintage_data values('2012-04-30',1,300,180000,1600);
insert into vintage_data values('2012-04-30',2,300,180000,1500);
insert into vintage_data values('2012-04-30',3,300,180000,4000);
insert into vintage_data values('2012-05-31',1,400,225000,2200);
insert into vintage_data values('2012-05-31',2,400,225000,6000);
insert into vintage_data values('2012-06-30',1,100,60000,1000);
Calculation Process
You can imagine the data as a triangular matrix (X values are to be forecasted):
distance_in_months 1 2 3 4 5 6
granularity entry_accounts entry_amount
2012-01-31 200 100000 1000 2000 3000 3500 3400 3300
2012-02-28 250 150000 1200 1600 1800 1200 1600 (X-1)
2012-03-31 200 90000 1300 1200 1400 1000 (X0) (X4)
2012-04-30 300 180000 1600 1500 4000 (X1) (X5) (X8)
2012-05-31 400 225000 2200 6000 (X2) (X6) (X9) (X11)
2012-06-30 100 60000 1000 (X3) (X7) (X10) (X12 (X13)
Algorithm
The goal I have is to forecast all the missing points (future). To illustrate the process, this is the calculation for the point X1
1) Get row totals for previous three months using distance up to 4:
2012-01-31 1000+2000+3000+3500=9500 (d4m3)
2012-02-28 1200+1600+1800+1200=5800 (d4m2)
2012-03-31 1300+1200+1400+1000=4900 (d4m1)
2) Get row totals for previous three months using distance up to 3:
2012-01-31 1000+2000+3000=6000 (d3m3)
2012-02-28 1200+1600+1800=4600 (d3m2)
2012-03-31 1300+1200+1400=3800 (d3m1)
3) Calculate weighted average running rate for distance 3 and distance 4 (weighted by entry_amount):
(d4m3+d4m2+d4m1)/(100000+150000+90000) = (9500+5800+4900)/(100000+150000+90000) = 20200/340000 = 0.0594
(d3m3+d3m2+d3m1)/(100000+150000+90000) = (6000+4600+3800)/(100000+150000+90000) = 14400/340000 = 0.0424
4) Calculate the change between distance 3 and distance 4
((d4m3+d4m2+d4m1)/(100000+150000+90000))/((d3m3+d3m2+d3m1)/(100000+150000+90000)) =
= (20200/340000)/(14400/340000) =
= 0.0594/0.0424 = 1.403 (PredictionRateForX1)
5) Calculate row totals for predicted month using distance up to 3:
2012-04-30 1600+1500+4000=7100
6) Calculate rate using entry_amount for predicted month
7100/180000 = 0.0394
7) Calculate rate predicted for X1
0.0394 * PredictionRateForX1 = 0.05534
8) Calculate amount for X1
(0.05534-0.0394)*180000 = 2869.2
Problem
The problem is how to calculate the rest of the matrix (from x-1 to x13) using SQL statement. It is obvious that this will require some sort of recursive algorithm.
It's a big task, split it up to make it more manageable. I would put that in a plpgsql function with RETURN TABLE
:
Create a temporary table for your "Calculation Process" matrix using a crosstab query You need the
tablefunc
module installed for that. Run (once per database):CREATE EXTENSION tablefunc;
Update the temp table field by field.
- Return table.
The following demo is fully functional and tested with PostgreSQL 9.1.4. Building on the table definition provided in the question:
-- DROP FUNCTION f_forcast();
CREATE OR REPLACE FUNCTION f_forcast()
RETURNS TABLE (
granularity date
,entry_accounts numeric
,entry_amount numeric
,d1 numeric
,d2 numeric
,d3 numeric
,d4 numeric
,d5 numeric
,d6 numeric) AS
$BODY$
BEGIN
--== Create temp table with result of crosstab() ==--
CREATE TEMP TABLE matrix ON COMMIT DROP AS
SELECT *
FROM crosstab (
'SELECT granularity, entry_accounts, entry_amount
,distance_in_months, recovery_amount
FROM vintage_data
ORDER BY 1, 2',
'SELECT DISTINCT distance_in_months
FROM vintage_data
ORDER BY 1')
AS tbl (
granularity date
,entry_accounts numeric
,entry_amount numeric
,d1 numeric
,d2 numeric
,d3 numeric
,d4 numeric
,d5 numeric
,d6 numeric
);
ANALYZE matrix; -- update statistics to help calculations
--== Calculations ==--
-- I implemented the first calculation for X1 and leave the rest to you.
-- Can probably be generalized in a loop or even a single statement.
UPDATE matrix m
SET d4 = (
SELECT (sum(x.d1) + sum(x.d2) + sum(x.d3) + sum(x.d4))
/(sum(x.d1) + sum(x.d2) + sum(x.d3)) - 1
-- removed redundant sum(entry_amount) from equation
FROM (
SELECT *
FROM matrix a
WHERE a.granularity < m.granularity
ORDER BY a.granularity DESC
LIMIT 3
) x
) * (m.d1 + m.d2 + m.d3)
WHERE m.granularity = '2012-04-30';
--- Next update X2 ..
--== Return results ==--
RETURN QUERY
TABLE matrix
ORDER BY 1;
END;
$BODY$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_forcast();
I have simplified quite a bit, removing some redundant steps in the calculation.
The solution employs a variety of advanced techniques. You need to know your way around PostgreSQL to work with this.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK