7

In-Line Views versus Correlated Sub-queries – tuning adventures in a Data Wareho...

 3 years ago
source link: https://mikesmithers.wordpress.com/2017/09/30/in-line-views-versus-correlated-sub-queries-tuning-adventures-in-a-data-warehouse-report/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

In-Line Views versus Correlated Sub-queries – tuning adventures in a Data Warehouse Report

Posted on September 30, 2017

Events have taken a worrying turn recently. I’m not talking about Kim Jong Un’s expensive new hobby, although, if his parents had bought him that kite when he was seven…
I’m not talking about the UK’s chief Brexit negotiator David Davies quoting directly from the Agile Manifesto and claiming that the British were “putting people before process” in the negotiations, although Agile as a negotiating strategy is rather…untested.
I’m not even talking about the sunny Bank Holiday Monday we had in England recently even though this may be a sign of Global Warming ( or possibly a portent for the end of days).
The fact is, we have an Ashes series coming up this winter and England still haven’t managed to find a top order that doesn’t collapse like a cheap deckchair in a light breeze.

On top of that, what started out as a relatively simple post – effectively a note to myself about using the row_number analytical function to overcome a recent performance glitch in a Data Warehouse Application – also seems to have developed an unexpected complication…

The Application

I’ve taken the liberty of stripping the application down to it’s essentials for illustrative purposes.

The application accepts a bulk upload of data into a table called INTERESTING_TIMES :
The DDL for our table looks like this :

create table interesting_times
(
id number,
worry varchar2(4000),
record_type varchar2(1)
)
/

Users may then amend individual records via some kind of UI which ultimately calls :

create or replace procedure adjust_interesting_times(
i_id interesting_times.id%type,
i_worry interesting_times.worry%type)
as
begin
merge into interesting_times
using dual
on ( id = i_id and record_type = 'A')
when matched then update
set worry = i_worry
when not matched then  insert( id, worry, record_type)
values(i_id, i_worry, 'A');
end;
/

That’s right, when an amendment is made, the application does not perform an update. Instead, it creates a new record with a RECORD_TYPE of A(mended), preserving the O(riginal) record.
For reasons unknown, it’s apparently OK to update an existing A(mended) record.
Periodically, a report is run to list the latest version of each record in the table.

The report is as follows :

select it.id, it.worry, it.record_type
from interesting_times it
where record_type =
nvl((
select 'A'
from interesting_times it1
where it1.record_type = 'A'
and it1.id = it.id), 'O')
order by 1
/

It’s saved in a file called worry_list.sql.

During application testing, there was no problem. The table was populated with test records…

insert into interesting_times( id, worry, record_type)
select level as id,
case mod(level, 3)
when 2 then 'North Korean Missile Exports'
when 1 then 'Global Warming'
else 'Brexit'
end as worry,   
'O' as record_type
from dual
connect by level <= 100
union all
select level,
q'[England's top order]',
'A'
from dual
where mod(level, 100) = 0
connect by level <= 100
/
commit;
exec dbms_stats.gather_table_stats(user, 'INTERESTING_TIMES')

…and the report ran nice and quickly…

set timing on
@worry_list.sql
set timing off
...
ID WORRY                                    R
---------- ---------------------------------------- -
97 Global Warming                           O
98 North Korean Missile Exports             O
99 Brexit                                   O
100 England's top order                      A
100 rows selected.
Elapsed: 00:00:00.053

However, in production, data volumes are (or have grown to be) somewhat more significant…

truncate table interesting_times
/
insert into interesting_times( id, worry, record_type)
select level as id,
case mod(level, 3)
when 2 then 'North Korean Missile Exports'
when 1 then 'Global Warming'
else 'Brexit'
end as worry,   
'O' as record_type
from dual
connect by level <= 100000
union all
select level,
q'[England's top order]',
'A'
from dual
where mod(level, 100) = 0
connect by level <= 100000
/
commit;
exec dbms_stats.gather_table_stats(user, 'INTERESTING_TIMES')

…and the application now has some performance issues…

...
ID WORRY                                    R
---------- ---------------------------------------- -
99995 North Korean Missile Exports             O
99996 Brexit                                   O
99997 Global Warming                           O
99998 North Korean Missile Exports             O
99999 Brexit                                   O
100000 England's top order                      A
100000 rows selected.
Elapsed: 00:19:58.67

Yes, the runtime has mushroomed to almost 20 minutes and users are not happy.

The tkprof output makes it clear where the problem lies…

select it.id, it.worry, it.record_type
from interesting_times it
where record_type =
nvl((
select 'A'
from interesting_times it1
where it1.record_type = 'A'
and it1.id = it.id), 'O')
order by 1
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.02          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     6668   1183.16    1191.72          0   41003158          0      100000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     6670   1183.17    1191.75          0   41003158          0      100000
Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 48
Number of plan statistics captured: 1
Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
100000     100000     100000  SORT ORDER BY (cr=41003158 pr=0 pw=0 time=1191452315 us cost=10216576 size=24 card=1)
100000     100000     100000   FILTER  (cr=41003158 pr=0 pw=0 time=1189848194 us)
101000     101000     101000    TABLE ACCESS FULL INTERESTING_TIMES (cr=406 pr=0 pw=0 time=115893 us cost=116 size=2424000 card=101000)
1992       1992       1992    TABLE ACCESS FULL INTERESTING_TIMES (cr=41002752 pr=0 pw=0 time=1189610376 us cost=113 size=7 card=1)
Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait  Total Waited
----------------------------------------   Waited  ----------  ------------
SQL*Net message to client                    6668        0.00          0.01
resmgr:cpu quantum                             16        0.13          0.70
SQL*Net message from client                  6668        0.02          6.32
********************************************************************************

Time to rethink our Report query…

It’s row_number(), not ROWNUM

The row_number() analytical function may well come in handy here. Consider…

select id, worry, record_type,
row_number() over( partition by id order by (case nvl( record_type, 'O') when 'A' then 1 else 2 end)) as recnum
from interesting_times
where id between 99995 and 100000
/
ID WORRY                          R     RECNUM
---------- ------------------------------ - ----------
99995 North Korean Missile Exports   O          1
99996 Brexit                         O          1
99997 Global Warming                 O          1
99998 North Korean Missile Exports   O          1
99999 Brexit                         O          1
100000 England's top order            A          1
100000 Global Warming                 O          2
7 rows selected.

Using row_number to order records for each id, we can now eliminate the correlated sub-query from our report…

with ilv as
(
select id, worry, record_type,
row_number() over( partition by id order by case nvl(record_type, 'O') when 'A' then 1 else 2 end) as recnum
from interesting_times
)
select id, worry, record_type
from ilv
where recnum = 1
/

…and experience a stunning improvement in performance…

...
ID WORRY                                    R
---------- ---------------------------------------- -
99994 Global Warming               O
99995 North Korean Missile Exports         O
99996 Brexit                   O
99997 Global Warming               O
99998 North Korean Missile Exports         O
99999 Brexit                   O
100000 England's top order              A
100000 rows selected.
Elapsed: 00:00:07.89

Yes, you are reading that right, that’s 7.89 seconds. No I didn’t forget to flush the cache between running the original report and this new version.

Now, if that was all there was to it, this would be a rather short post. However…

Hubble, Bubble, Toil And Trouble…

What happens if we just re-write the query with an in-line view and don’t bother with analytical functions at all ?

with latest_records as
(
select id,
max( case record_type when 'A' then 2 else 1 end) as recnum
from interesting_times
group by id
)
select it.id, it.worry, it.record_type
from interesting_times it
inner join latest_records lr
on lr.id = it.id
and lr.recnum = case it.record_type when 'A' then 2 else 1 end
order by id
/
...
ID WORRY                                    R
---------- ---------------------------------------- -
99994 Global Warming               O
99995 North Korean Missile Exports         O
99996 Brexit                   O
99997 Global Warming               O
99998 North Korean Missile Exports         O
99999 Brexit                   O
100000 England's top order              A
100000 rows selected.
Elapsed: 00:00:08.06

Yep, the performance is virtually identical to using row_number().

Looking at the respective execution plans provides a clue as to why this is.

First, the row_number query :

with ilv as
(
select id, worry, record_type,
row_number() over( partition by id order by case nvl(record_type, 'O') when 'A' then 1 else 2 end) as recnum
from interesting_times
)
select id, worry, record_type
from ilv
where recnum = 1
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.04          0          0          0           0
Execute      1      0.00       0.01          0          0          0           0
Fetch     6668      0.58       0.63        404        406          0      100000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     6670      0.59       0.69        404        406          0      100000
Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 48
Number of plan statistics captured: 1
Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
100000     100000     100000  VIEW  (cr=406 pr=404 pw=0 time=403080 us cost=830 size=8080000 card=101000)
101000     101000     101000   WINDOW SORT PUSHED RANK (cr=406 pr=404 pw=0 time=361666 us cost=830 size=2424000 card=101000)
101000     101000     101000    TABLE ACCESS FULL INTERESTING_TIMES (cr=406 pr=404 pw=0 time=43133 us cost=113 size=2424000 card=101000)
Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait  Total Waited
----------------------------------------   Waited  ----------  ------------
SQL*Net message to client                    6668        0.00          0.01
db file sequential read                         1        0.00          0.00
db file scattered read                         19        0.01          0.02
SQL*Net message from client                  6668        0.01          6.45
********************************************************************************

…and now the straight in-line view…

with latest_records as
(
select id,
max( case record_type when 'A' then 2 else 1 end) as recnum
from interesting_times
group by id
)
select it.id, it.worry, it.record_type
from interesting_times it
inner join latest_records lr
on lr.id = it.id
and lr.recnum = case it.record_type when 'A' then 2 else 1 end
order by id
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     6668      0.63       0.65        404        812          0      100000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     6670      0.63       0.66        404        812          0      100000
Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 48
Number of plan statistics captured: 1
Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
100000     100000     100000  SORT ORDER BY (cr=812 pr=404 pw=0 time=421638 us cost=852 size=32320 card=1010)
100000     100000     100000   HASH JOIN  (cr=812 pr=404 pw=0 time=329296 us cost=851 size=32320 card=1010)
100000     100000     100000    VIEW  (cr=406 pr=404 pw=0 time=174590 us cost=466 size=806592 card=100824)
100000     100000     100000     HASH GROUP BY (cr=406 pr=404 pw=0 time=145304 us cost=466 size=705768 card=100824)
101000     101000     101000      TABLE ACCESS FULL INTERESTING_TIMES (cr=406 pr=404 pw=0 time=24266 us cost=113 size=707000 card=101000)
101000     101000     101000    TABLE ACCESS FULL INTERESTING_TIMES (cr=406 pr=0 pw=0 time=21423 us cost=113 size=2424000 card=101000)
Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait  Total Waited
----------------------------------------   Waited  ----------  ------------
SQL*Net message to client                    6668        0.00          0.01
db file sequential read                         1        0.00          0.00
db file scattered read                         19        0.00          0.00
asynch descriptor resize                        3        0.00          0.00
SQL*Net message from client                  6668        0.01          6.73
********************************************************************************

Essentially, the Optimizer “materializes” the in-line-view – i.e. it creates it as a de facto temporary table.
Turns out then that it’s actually the humle in-line view that turbo charges the query rather than the nifty row_number() analytical function.

That’s one less problem I’ve got to deal with. Now to figure out England’s top order for Brisbane. Hmmm, I may need to resort to DBMS_RANDOM…


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK