Archive for April, 2011

Analyzing Oracle Data Warehouse with Analytic Functions II.

April 29, 2011

Window and Report Aggregate Family

Window aggregate functions operate on an ordered set of rows and for each row they return an aggregate value for a window rooted at that row.
The window can be unbounded, partially unbounded or bounded. The size of the window can be specified by physically (eg. With number of rows, ROWS) or logically (typically with date/time values, RANGE).

There are several keywords available:

CURRENT ROW – specifies that window starts or ends at the current row

UNBOUNDED PRECEDING – specifies that the window starts at the first row of the partition.

UNBOUNDED FOLLOWING – specifies that the window ends at the last row of the partition.
If we don’t specify explicitly the one of the boundaries the CURRENT ROW is the default.
Example 1:
We want to create a cummulated aggregation (sometimes called running aggregates) for the salaries for the whole table and for each department:

SELECT department_id,employee_id, last_name, salary, 
SUM(salary) 
OVER (ORDER BY salary DESC RANGE UNBOUNDED PRECEDING) cum_range,
SUM(salary) 
OVER (PARTITION BY department_id 
ORDER BY salary DESC RANGE UNBOUNDED PRECEDING) 
AS cum_range_for_depts,
SUM(salary) 
OVER (ORDER BY salary desc ROWS UNBOUNDED PRECEDING)  cum_rows,
SUM(salary) 
OVER (PARTITION BY department_id 
ORDER BY salary DESC ROWS UNBOUNDED PRECEDING)  cum_rows_for_depts
FROM employees
 --ORDER BY department_id ;


Example 2:
Now we define an increasing, decreasing and fix sized window for each department:

SELECT department_id, last_name, salary,
SUM(salary) 
OVER (PARTITION BY department_id ORDER BY salary DESC
ROWS UNBOUNDED PRECEDING)  AS cum_sum,
SUM(salary) 
OVER (PARTITION BY department_id ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)  till_current,
AVG(salary) OVER (PARTITION BY department_id ORDER BY salary DESC
RANGE UNBOUNDED PRECEDING) cum_avg , AVG(salary)
OVER ( PARTITION BY department_id 
ORDER BY salary DESC 
RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) avg_sal_rest,
AVG(salary)   OVER (PARTITION BY department_id) AS dept_avg
FROM employees


Example 3:
We want to select to 5 most popular products and running total for each quarter in each year.

SELECT CASE WHEN t.rank=1 THEN t.year END year, t.quarter, t.product, 
t.amount, t.quantity, t.rsum_for_quarter, t.rank rank_for_amount
FROM
(SELECT t.calendar_year year,t.calendar_quarter_desc quarter, 
p.prod_name product, SUM(s.amount_sold) amount, 
SUM(SUM(s.amount_sold))
OVER (PARTITION BY t.calendar_year,t.calendar_quarter_desc
ORDER BY SUM(s.amount_sold) DESC RANGE UNBOUNDED PRECEDING) 
rsum_for_quarter, 
SUM(s.quantity_sold) quantity,
RANK()  OVER (PARTITION BY t.calendar_year , t.calendar_quarter_desc
ORDER BY SUM(s.amount_sold) DESC ) rank
FROM sh.sales s, sh.times t, sh.products p
WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id
GROUP BY t.calendar_year,t.calendar_quarter_desc,p.prod_name) t
WHERE t.rank < 5;


In the SUM(SUM(s.amount_sold)) expression the inner SUM is standard group function, which is the argument of the outer SUM that is an analytical function.
Example 4:
Many times we want to see each element’s contribution (rate) for the total value.
Also called percentages or sharing value.
In the following example we compute the percentages of each  salary of subtotal (department level) of grand total:

SELECT department_id, last_name, salary,
RATIO_TO_REPORT(salary)
OVER (PARTITION BY department_id)  AS rate_for_dept,
RATIO_TO_REPORT(salary) OVER ()   AS rate_for_total
FROM employees;


Example 5:
Now we want to compare each employee’s salary to the lowest anf highest salaries of his/her department using the FIRST_VALUE and LAST_VALUE functions.

SELECT t.*,t.salary/t.lowest_sal  rate_for_lowest_sal,
t.salary/t.highest_sal AS rate_for_highest_sal
FROM ( SELECT department_id , last_name, salary,
FIRST_VALUE(salary)
OVER ( PARTITION BY department_id ORDER BY salary
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) lowest_sal,
LAST_VALUE(salary) OVER ( PARTITION BY department_id ORDER BY salary
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) highest_sal
FROM employees 
ORDER BY department_id) t


Example 6:
Many times we need to compute moving values(MAX,SUM,AVG). In this example we calculate moving sum for better understanding:

SELECT last_name, hire_date, salary,
SUM(salary)  
OVER (ORDER BY hire_date RANGE BETWEEN NUMTOYMINTERVAL(3,'MONTH') 
PRECEDING AND NUMTOYMINTERVAL(3,'MONTH') FOLLOWING) AS mov_sum_3_months,
SUM(salary)  
OVER (ORDER BY hire_date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) 
AS mov_sum_3_rows
FROM employees;


Observe the difference between RANGE and ROWS.
With RANGE we specified a six months window, while with ROWS we created a 7 row width window.


To be continued…

Advertisements

Efficient analysis with IN operator

April 28, 2011

SQL developers frequently hear and realize that the subqueries are not efficient,
avoid using them.
Indeed, many times those SELECT statements, which contains subqueries
in the WHERE or HAVING clause are not very efficient.

But this not a general rule!

I'd like to demonstrate that subqueries with IN operator and
correlated subqueries can be very efficient.

Let see the next example:
 
SELECT last_name, department_id, job_id 
FROM employees
WHERE department_id IN 
(SELECT department_id FROM departments WHERE location_id = 1700);

Here is the execution plan for the SELECT statement:

 
SQL_ID  gm32t8f65882q, child number 0
-------------------------------------
SELECT last_name, department_id, job_id FROM   employees WHERE
department_id IN (SELECT department_id FROM departments WHERE  location_id = 1700)
Plan hash value: 1021246405
--------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                   |       |       |     6 (100)|          |
|   1 |  NESTED LOOPS                |                   |       |       |            |          |
|   2 |   NESTED LOOPS               |                   |    37 |   814 |     6   (0)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL         | DEPARTMENTS       |     4 |    20 |     4   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN          | EMP_DEPARTMENT_IX |    10 |       |     0   (0)|          |
|   5 |   TABLE ACCESS BY INDEX ROWID| EMPLOYEES         |    10 |   170 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("LOCATION_ID"=1700)
4 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
====================================================================================================

Let us see the second solution and the execution plan!

 
SELECT last_name, e.department_id, job_id
FROM employees e,departments d
WHERE e.department_id =d.department_id and location_id = 1700;
 
SQL_ID  gq10rm3ypv7v2, child number 0
-------------------------------------
SELECT last_name, e.department_id, job_id FROM   employees
e,departments d WHERE  e.department_id =d.department_id and
location_id = 1700
Plan hash value: 1021246405
--------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                   |       |       |     6 (100)|          |
|   1 |  NESTED LOOPS                |                   |       |       |            |          |
|   2 |   NESTED LOOPS               |                   |    37 |   814 |     6   (0)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL         | DEPARTMENTS       |     4 |    20 |     4   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN          | EMP_DEPARTMENT_IX |    10 |       |     0   (0)|          |
|   5 |   TABLE ACCESS BY INDEX ROWID| EMPLOYEES         |    10 |   170 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("LOCATION_ID"=1700)
4 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
====================================================================================================

We can realize that both quires use the same execution plan (The plan Hash values are same!)
so they are same efficient!

Analyzing Oracle Data Warehouse with Analytic Functions I.

April 28, 2011

Introduction. The Rank family

 

The traditional SQL statements and functions do not provide strong support for analytic tasks. Basic business intelligence calculations such as moving averages, rankings and lead/lag comparisons have required extensive programming outside of standard SQL , often with performance challenges.

From Oracle 8.1.6 Oracle provides powerful new family of functions which are useful for analyzing data warehouses or simple big data sets.

Oracle has created four families of analytic functions

  • Ranking family: RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST
  • Window Aggregate family: AVG, SUM, MIN, MAX, COUNT, VARIANCE and STDDEV (typically with fix sized window)
  • Reporting Aggregate family: AVG, SUM, MIN, MAX, COUNT, VARIANCE and STDDEV , RATIO_TO_REPORT (typically with varying sized window)
  • LAG/LEAD family: LAG  and LEAD

It is important to understand the processing order of the functionalyties of the analytic SQL statements

  • The joins will be executed
  • Evaluating the WHERE, GROUP BY, HAVING clauses
  • Partitions created
  • Analytic functions applied to each row in each partition
  • ORDER BY is processed to allow for precise output ordering.

In the following examples we will use the the sample schemas (HR,SH) which are available in all Oracle database.

Ranking family:

Example 1:

Let’s suppose we want rank the salaries in the employees table.

Issuing the following SELECT

SELECT ROWNUM,last_name,salary from
(SELECT * FROM EMPLOYEES ORDER BY salary DESC) t
WHERE ROWNUM<=&counter;

for counter=2  :

1 KING               24000

2 KOCHHAR      17000

for counter=3  :

 1 KING               24000

2 KOCHHAR      17000

3 DE HAAN        17000

Which are the ranks for Kochhar and De Haan?
Solution:

SELECT last_name,salary,
RANK() OVER( ORDER BY salary DESC) normal_rank,
DENSE_RANK() OVER( ORDER BY salary desc) dense_rank
FROM employees;

Example 2:
Let’s see the top 5 best products regarding the revenue for each year:

SELECT t.*
FROM (select t.calendar_year year , p.prod_name product, SUM(s.amount_sold), SUM(s.quantity_sold),
RANK()  OVER (PARTITION BY t.calendar_year ORDER BY SUM(s.amount_sold) DESC ) rank
FROM sh.sales s, sh.times t ,sh.products p
WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id
GROUP BY t.calendar_year,p.prod_name) t
WHERE t.rank<=5;

Example 3:

We can retrieve the 4 best products regarding the revenue for each product category:

SELECT t.*
FROM (select p.prod_category,t.calendar_quarter_desc quarter  , ROUND(SUM(s.amount_sold)) amount ,SUM(s.quantity_sold) quantity,
RANK()  OVER (PARTITION BY P.prod_category
ORDER BY SUM(s.amount_sold) DESC ) rank
FROM sh.sales s, sh.times t ,sh.products p
WHERE s.time_id=t.time_id AND s.prod_id=P.prod_id
GROUP BY p.prod_category,calendar_quarter_desc ) t
WHERE t.rank<=4
/

Example 4:
Now we consider the cumulative distribution of the salaries for each departments:

SELECT department_id, last_name, last_name, salary,
CUME_DIST() OVER(PARTITION BY department_id ORDER BY  Salary)
FROM employees;

Example 5:
We create ranks and subtotals  for amounts in  each year:

SELECT t.*
FROM (select t.calendar_year year , p.prod_name  product , SUM(s.amount_sold) amount,SUM(s.quantity_sold),
RANK()  OVER (PARTITION BY t.calendar_year ORDER BY SUM(s.amount_sold) DESC ) -1 rank,GROUPING(p.prod_name) grp_id
FROM sh.sales s, sh.times t ,sh.products p
WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id
GROUP BY ROLLUP ( t.calendar_year,p.prod_name)) t
WHERE t.rank<=5
ORDER BY year,grp_id,amount DESC
/

Example 6:
The ROW_NUMBER function – which is not equivalent with neither ROWNUM nor RANK – assigns a unique number, in the sequence defined by the ORDER BY clause, to each row within the partition.

SELECT  ROW_NUMBER() OVER(ORDER BY department_id) row_number , ROWNUM ,
ROW_NUMBER() OVER( PARTITION BY department_id ORDER BY salary DESC) dept_row_number,
department_id,last_name,salary,
RANK() OVER( PARTITION BY department_id ORDER BY salary DESC) normal_rank,
DENSE_RANK() OVER( PARTITION BY department_id ORDER BY salary DESC) dense_rank
FROM employees
/
To be continued…