In-Line Views versus Correlated Sub-queries – tuning adventures in a Data Warehouse Report

Posted on September 30, 2017

Events have taken a worrying turn recently. I’m not talking about Kim Jong Un’s expensive new hobby, although, if his parents had bought him that kite when he was seven…
I’m not talking about the UK’s chief Brexit negotiator David Davies quoting directly from the Agile Manifesto and claiming that the British were “putting people before process” in the negotiations, although Agile as a negotiating strategy is rather…untested.
I’m not even talking about the sunny Bank Holiday Monday we had in England recently even though this may be a sign of Global Warming ( or possibly a portent for the end of days).
The fact is, we have an Ashes series coming up this winter and England still haven’t managed to find a top order that doesn’t collapse like a cheap deckchair in a light breeze.

On top of that, what started out as a relatively simple post – effectively a note to myself about using the row_number analytical function to overcome a recent performance glitch in a Data Warehouse Application – also seems to have developed an unexpected complication…

The Application

I’ve taken the liberty of stripping the application down to it’s essentials for illustrative purposes.

The application accepts a bulk upload of data into a table called INTERESTING_TIMES :
The DDL for our table looks like this :

create table interesting_times

(

id number,

worry varchar2(4000),

record_type varchar2(1)

)

/

Users may then amend individual records via some kind of UI which ultimately calls :

create or replace procedure adjust_interesting_times(

i_id interesting_times.id%type,

i_worry interesting_times.worry%type)

as

begin

merge into interesting_times

using dual

on ( id = i_id and record_type = 'A')

when matched then update

set worry = i_worry

when not matched then insert( id, worry, record_type)

values(i_id, i_worry, 'A');

end;

/

That’s right, when an amendment is made, the application does not perform an update. Instead, it creates a new record with a RECORD_TYPE of A(mended), preserving the O(riginal) record.
For reasons unknown, it’s apparently OK to update an existing A(mended) record.
Periodically, a report is run to list the latest version of each record in the table.

The report is as follows :

select it.id, it.worry, it.record_type

from interesting_times it

where record_type =

nvl((

select 'A'

from interesting_times it1

where it1.record_type = 'A'

and it1.id = it.id), 'O')

order by 1

/

It’s saved in a file called worry_list.sql.

During application testing, there was no problem. The table was populated with test records…

insert into interesting_times( id, worry, record_type)

select level as id,

case mod(level, 3)

when 2 then 'North Korean Missile Exports'

when 1 then 'Global Warming'

else 'Brexit'

end as worry,

'O' as record_type

from dual

connect by level <= 100

union all

select level,

q'[England's top order]',

'A'

from dual

where mod(level, 100) = 0

connect by level <= 100

/

commit;

exec dbms_stats.gather_table_stats(user, 'INTERESTING_TIMES')

…and the report ran nice and quickly…

set timing on

@worry_list.sql

set timing off

...

ID WORRY R

---------- ---------------------------------------- -

97 Global Warming O

98 North Korean Missile Exports O

99 Brexit O

100 England's top order A

100 rows selected.

Elapsed: 00:00:00.053

However, in production, data volumes are (or have grown to be) somewhat more significant…

truncate table interesting_times

/

insert into interesting_times( id, worry, record_type)

select level as id,

case mod(level, 3)

when 2 then 'North Korean Missile Exports'

when 1 then 'Global Warming'

else 'Brexit'

end as worry,

'O' as record_type

from dual

connect by level <= 100000

union all

select level,

q'[England's top order]',

'A'

from dual

where mod(level, 100) = 0

connect by level <= 100000

/

commit;

exec dbms_stats.gather_table_stats(user, 'INTERESTING_TIMES')

…and the application now has some performance issues…

...

ID WORRY R

---------- ---------------------------------------- -

99995 North Korean Missile Exports O

99996 Brexit O

99997 Global Warming O

99998 North Korean Missile Exports O

99999 Brexit O

100000 England's top order A

100000 rows selected.

Elapsed: 00:19:58.67

Yes, the runtime has mushroomed to almost 20 minutes and users are not happy.

The tkprof output makes it clear where the problem lies…

select it.id, it.worry, it.record_type

from interesting_times it

where record_type =

nvl((

select 'A'

from interesting_times it1

where it1.record_type = 'A'

and it1.id = it.id), 'O')

order by 1

call count cpu elapsed disk query current rows

------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse        1      0.00       0.02          0          0          0           0

Execute      1      0.00       0.00          0          0          0           0

Fetch 6668 1183.16 1191.72 0 41003158 0 100000

------- ------ -------- ---------- ---------- ---------- ---------- ----------

total 6670 1183.17 1191.75 0 41003158 0 100000

Misses in library cache during parse: 1

Optimizer mode: ALL_ROWS

Parsing user id: 48

Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max) Row Source Operation

---------- ---------- ---------- ---------------------------------------------------

100000     100000     100000  SORT ORDER BY (cr=41003158 pr=0 pw=0 time=1191452315 us cost=10216576 size=24 card=1)

100000 100000 100000 FILTER (cr=41003158 pr=0 pw=0 time=1189848194 us)

101000     101000     101000    TABLE ACCESS FULL INTERESTING_TIMES (cr=406 pr=0 pw=0 time=115893 us cost=116 size=2424000 card=101000)

1992       1992       1992    TABLE ACCESS FULL INTERESTING_TIMES (cr=41002752 pr=0 pw=0 time=1189610376 us cost=113 size=7 card=1)

Elapsed times include waiting on following events:

Event waited on Times Max. Wait Total Waited

---------------------------------------- Waited ---------- ------------

SQL*Net message to client 6668 0.00 0.01

resmgr:cpu quantum 16 0.13 0.70

SQL*Net message from client 6668 0.02 6.32

********************************************************************************

Time to rethink our Report query…

It’s row_number(), not ROWNUM

The row_number() analytical function may well come in handy here. Consider…

select id, worry, record_type,

row_number() over( partition by id order by (case nvl( record_type, 'O') when 'A' then 1 else 2 end)) as recnum

from interesting_times

where id between 99995 and 100000

/

ID WORRY R RECNUM

---------- ------------------------------ - ----------

99995 North Korean Missile Exports O 1

99996 Brexit O 1

99997 Global Warming O 1

99998 North Korean Missile Exports O 1

99999 Brexit O 1

100000 England's top order A 1

100000 Global Warming O 2

7 rows selected.

Using row_number to order records for each id, we can now eliminate the correlated sub-query from our report…

with ilv as

(

select id, worry, record_type,

row_number() over( partition by id order by case nvl(record_type, 'O') when 'A' then 1 else 2 end) as recnum

from interesting_times

)

select id, worry, record_type

from ilv

where recnum = 1

/

…and experience a stunning improvement in performance…

...

ID WORRY R

---------- ---------------------------------------- -

99994 Global Warming O

99995 North Korean Missile Exports O

99996 Brexit O

99997 Global Warming O

99998 North Korean Missile Exports O

99999 Brexit O

100000 England's top order A

100000 rows selected.

Elapsed: 00:00:07.89

Yes, you are reading that right, that’s 7.89 seconds. No I didn’t forget to flush the cache between running the original report and this new version.

Now, if that was all there was to it, this would be a rather short post. However…

Hubble, Bubble, Toil And Trouble…

What happens if we just re-write the query with an in-line view and don’t bother with analytical functions at all ?

with latest_records as

(

select id,

max( case record_type when 'A' then 2 else 1 end) as recnum

from interesting_times

group by id

)

select it.id, it.worry, it.record_type

from interesting_times it

inner join latest_records lr

on lr.id = it.id

and lr.recnum = case it.record_type when 'A' then 2 else 1 end

order by id

/

...

ID WORRY R

---------- ---------------------------------------- -

99994 Global Warming O

99995 North Korean Missile Exports O

99996 Brexit O

99997 Global Warming O

99998 North Korean Missile Exports O

99999 Brexit O

100000 England's top order A

100000 rows selected.

Elapsed: 00:00:08.06

Yep, the performance is virtually identical to using row_number().

Looking at the respective execution plans provides a clue as to why this is.

First, the row_number query :

with ilv as

(

select id, worry, record_type,

row_number() over( partition by id order by case nvl(record_type, 'O') when 'A' then 1 else 2 end) as recnum

from interesting_times

)

select id, worry, record_type

from ilv

where recnum = 1

call count cpu elapsed disk query current rows

------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse        1      0.00       0.04          0          0          0           0

Execute      1      0.00       0.01          0          0          0           0

Fetch     6668      0.58       0.63        404        406          0      100000

------- ------ -------- ---------- ---------- ---------- ---------- ----------

total     6670      0.59       0.69        404        406          0      100000

Misses in library cache during parse: 1

Optimizer mode: ALL_ROWS

Parsing user id: 48

Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max) Row Source Operation

---------- ---------- ---------- ---------------------------------------------------

100000 100000 100000 VIEW (cr=406 pr=404 pw=0 time=403080 us cost=830 size=8080000 card=101000)

101000     101000     101000   WINDOW SORT PUSHED RANK (cr=406 pr=404 pw=0 time=361666 us cost=830 size=2424000 card=101000)

101000     101000     101000    TABLE ACCESS FULL INTERESTING_TIMES (cr=406 pr=404 pw=0 time=43133 us cost=113 size=2424000 card=101000)

Elapsed times include waiting on following events:

Event waited on Times Max. Wait Total Waited

---------------------------------------- Waited ---------- ------------

SQL*Net message to client 6668 0.00 0.01

db file sequential read 1 0.00 0.00

db file scattered read 19 0.01 0.02

SQL*Net message from client 6668 0.01 6.45

********************************************************************************

…and now the straight in-line view…

with latest_records as

(

select id,

max( case record_type when 'A' then 2 else 1 end) as recnum

from interesting_times

group by id

)

select it.id, it.worry, it.record_type

from interesting_times it

inner join latest_records lr

on lr.id = it.id

and lr.recnum = case it.record_type when 'A' then 2 else 1 end

order by id

call count cpu elapsed disk query current rows

------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse        1      0.00       0.00          0          0          0           0

Execute      1      0.00       0.00          0          0          0           0

Fetch     6668      0.63       0.65        404        812          0      100000

------- ------ -------- ---------- ---------- ---------- ---------- ----------

total     6670      0.63       0.66        404        812          0      100000

Misses in library cache during parse: 1

Optimizer mode: ALL_ROWS

Parsing user id: 48

Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max) Row Source Operation

---------- ---------- ---------- ---------------------------------------------------

100000 100000 100000 SORT ORDER BY (cr=812 pr=404 pw=0 time=421638 us cost=852 size=32320 card=1010)

100000 100000 100000 HASH JOIN (cr=812 pr=404 pw=0 time=329296 us cost=851 size=32320 card=1010)

100000 100000 100000 VIEW (cr=406 pr=404 pw=0 time=174590 us cost=466 size=806592 card=100824)

100000     100000     100000     HASH GROUP BY (cr=406 pr=404 pw=0 time=145304 us cost=466 size=705768 card=100824)

101000     101000     101000      TABLE ACCESS FULL INTERESTING_TIMES (cr=406 pr=404 pw=0 time=24266 us cost=113 size=707000 card=101000)

101000     101000     101000    TABLE ACCESS FULL INTERESTING_TIMES (cr=406 pr=0 pw=0 time=21423 us cost=113 size=2424000 card=101000)

Elapsed times include waiting on following events:

Event waited on Times Max. Wait Total Waited

---------------------------------------- Waited ---------- ------------

SQL*Net message to client 6668 0.00 0.01

db file sequential read 1 0.00 0.00

db file scattered read 19 0.00 0.00

asynch descriptor resize 3 0.00 0.00

SQL*Net message from client 6668 0.01 6.73

********************************************************************************

Essentially, the Optimizer “materializes” the in-line-view – i.e. it creates it as a de facto temporary table.
Turns out then that it’s actually the humle in-line view that turbo charges the query rather than the nifty row_number() analytical function.

That’s one less problem I’ve got to deal with. Now to figure out England’s top order for Brisbane. Hmmm, I may need to resort to DBMS_RANDOM…

In-Line Views versus Correlated Sub-queries – tuning adventures in a Data Wareho...

In-Line Views versus Correlated Sub-queries – tuning adventures in a Data Warehouse Report

The Application

It’s row_number(), not ROWNUM

Hubble, Bubble, Toil And Trouble…

Recommend

私域流量的构成解读

Announcing first-class Mercurial support on Sourcehut

Choosing a VPN service is a serious decision

Using Cage for a seamless remote Wayland session

Importing an Excel Spreadsheet into an Oracle database with SQLDeveloper

The "shut up and get back to work" coding style guide

哪个电商详情页更能获客？靠“经验”选不如..... - 牛透社

Series 6 swipe up and down for notification and control centers not working?

Announcing Wio: A clone of Plan 9's Rio for Wayland

Calculating your donation's value following Patreon's fee changes

About Joyk