Synopsis. Oracle 10g has extended the already robust features of Structured Query Language (SQL) with a plethora of intriguing new capabilities. This article - the second in this series - reviews the new capabilities Oracle 10g provides that, in concert with analytical functions, permit the creation of "densified" output with a few simple queries.
The previous article in this series reviewed several new SQL features in Oracle 10g, including improved query and access methods for nested tables, upgrades to the MERGE command, and hierarchical query enhancements. Impressive as these new capabilities are, Oracle 10g's new data modeling features allow any Oracle developer or DBA to produce sophisticated, complex calculations using just the database engine, thus eliminating the need to massage data output via a third-party tool. Many of the new Oracle 10g query features depend upon the analytical functions established in Oracle 9i, so I will first delve into those features before discussing partitioned outer joins and data densification.
I have been using SQL since 1985, and in my humble but biased opinion, Oracle's extensions to standard ANSI SQL are superior to those of its competitors. Oracle 9i added a complete suite of analytical functions to provide some impressive reporting and calculation capabilities.
Just about every major system I have worked on has at least one historical table, i.e. a table that stores multiple historical entries to track changes to a particular data domain over time. A good example of an historical table is the JOB_HISTORY table in the HR demo schema, which contains multiple rows for each employee detailing that employee's job assignments over time. A typical reporting requirement usually required against historical tables is the retrieval of either the least or most recent entry- for example, the least recent job assignment for a specific employee based on the employee's start date in that position.
If I were using another DBMS, I would probably have little choice but to construct the following SQL query and utilize a subquery to find the minimum starting date, and then use the result from that subquery to isolate the remainder of the employee's job history information:
SQL> TTITLE 'Earliest Employee Job History Entry (Traditional Method)'
SQL> COL employee_id FORMAT 9999 HEADING 'EmpID'
SQL> COL full_name FORMAT A24 HEADING 'Employee Name'
SQL> COL job_id FORMAT A10 HEADING 'JobID'
SQL> COL start_date FORMAT A10 HEADING 'Start Date'
SQL> COL end_date FORMAT A10 HEADING 'End Date'
SQL>
SQL> SELECT
2 E.employee_id
3 ,E.last_name || ', ' || E.first_name full_name
4 ,JH.job_id
5 ,TO_CHAR(JH.start_date, 'mm-dd-yyyy') start_date
6 ,TO_CHAR(JH.end_date, 'mm-dd-yyyy') end_date
7 FROM
8 hr.employees E
9 ,hr.job_history JH
10 WHERE E.employee_id = JH.employee_id
11 AND JH.start_date = (
12 SELECT MIN(start_date)
13 FROM hr.job_history
14 WHERE employee_id = 114)
15 AND JH.employee_id = 114;
Earliest Employee Job History Entry (Traditional Method)
EmpID Employee Name JobID Start Date End Date
----- ------------------------ ---------- ---------- ----------
114 Raphaely, Den ST_CLERK 03-24-1998 12-31-1999
As of Oracle 9i, however, I have an alternative to this method: I can use an Oracle analytical function to return just the first entry in the list of job history entries for the specific employee. Briefly, here is how analytical functions perform their tasks:
Execution Order. A SQL statement that uses an analytical function first executes all joins and then processes all WHERE, GROUP BY, and HAVING statements. Oracle then hands off the result set to the analytical function for calculation and processing, and then finally any query-level ORDER BY statements are processed.
Partitioning of the Result Set. If the analytical function's interior query has any GROUP BY statements, Oracle will logically divide the result set into partitions (not to be confused with partitioned tables or indexes) based on the columns specified. A single partition can be as large as all the rows in the result set, or the partitions can be relatively tiny, depending on the chosen columns. An analytical function can operate against aggregated values in each result set partition.
The Current Row and "Windowing." As the analytical function processes each row in the result set partition, that row is marked as the current row. The current row becomes a reference point for any windowing operations. Windowing operations occur whenever an analytical function defines a sliding window to limit the selection of rows within a result set partition. This means that analytical processing can be limited to a range of rows based on a displacement of a specific number of rows from the current row in the result set partition (e.g., plus or minus 3 rows from the current row). Analytical processing can also be based on specified inclusion criteria.
Analytical functions are typically used heavily in decision-support and data warehousing reporting because they are extremely powerful for calculating rankings and percentiles within a result set, as well as calculating moving and cumulative aggregations based on a displacement from a specific row in the result set. It is also possible to calculate lagging and leading values within a partition, or first and last values within a partition. Here is an example of how to return the same results using Oracle 9i's FIRST_VALUE analytical function:
SQL> TTITLE 'Earliest Employee Job History Entry (Analytic Function)'
SQL> COL eid FORMAT 9999 HEADING 'EmpID'
SQL> COL name FORMAT A24 HEADING 'Employee Name'
SQL> COL jobid FORMAT A10 HEADING 'JobID'
SQL> COL begin_date FORMAT A10 HEADING 'Start Date'
SQL> COL stop_date FORMAT A10 HEADING 'End Date'
SQL>
SQL> SELECT
2 eid
3 ,NAME
4 ,jobid
5 ,begin_date
6 ,stop_date
7 FROM (
8 SELECT
9 JH.employee_id eid
10 ,E.last_name || ', ' || E.first_name NAME
11 ,JH.job_id jobid
12 ,JH.START_DATE begin_date
13 ,JH.end_date stop_date
14 ,FIRST_VALUE(JH.start_date) OVER (
15 ORDER BY JH.employee_id, JH.start_date
16 ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
17 ) recent_history
18 FROM
19 hr.employees E
20 ,hr.job_history JH
21 WHERE E.employee_id = JH.employee_id
22 AND JH.employee_id IN (114)
23 )
24 WHERE begin_date = recent_history
25 ORDER BY eid;
Earliest Employee Job History Entry (Analytic Function)
EmpID Employee Name JobID Start Date End Date
----- ------------------------ ---------- ---------- ----------
114 Raphaely, Den ST_CLERK 03/24/1998 12/31/1999
In this example, note that the FIRST_VALUE analytical function indicates the retrieval of a single result set partition, sorted within Employee ID and Start Date based on its ORDER BY clause. The ROWS directive tells Oracle to use all entries in each result set partition to determine which entry has the first (and therefore earliest) start date in each partition.
Listing 2.1 shows the DML statements I issued to add rows into the original JOB_HISTORY table to demonstrate these queries and the original queries themselves. An expanded query that shows how the PARTITION BY clause could be used within an analytical query to return the first job history entry for each employee is shown in Listing 2.2. One morning I arrived early at my client site to find a
developer running multiple SQL queries, then feverishly transferring the
resulting output into a Microsoft Excel spreadsheet. When I asked her the
reason for her frantic activity, she told me she needed to perform some complex
financial calculations on the results, but did not know how to perform them
from within Oracle. I took an informal poll of the developers I supported, and I
found out that many of them were still creating queries to extract raw data
from tables and/or views, but then dumping that output to comma-delimited files
and then into a spreadsheet so that the raw data could be manipulated via
Excel's analytical functions. Some typical requirements these developers were
pursuing included these scenarios: The good news is that Oracle 10g has beefed up analytical
function processing so that all these scenarios can be handled by simply
running an appropriate query directly against the database. I will cover the
first scenario in the following sections, and I will tackle the last two
scenarios in my next article when I cover the MODEL clause in depth. Oracle 10g offers a new type of join, a partitioned outer
join, that not only lets me join disparate sets of data, but also allows me
to fill in "gaps" in the result sets because some data might be
missing. For example, even though sales of a particular product may exist in
most of the states in the Midwest, it is entirely possible that no sales exist
in a particular calendar year for that product because of the time period in
which it was introduced. Oracle calls the concept of filling in missing data with
partitioned outer joins data densification. To illustrate, I will create
a new view, SH.CALENDAR_YEARS, that
contains all existing calendar years in the SH.TIMES time periods table and adds in all
current calendar years up to and including 2005. I will also create a new view
named SH.SALES_MIDWEST_ONLY
that will gather only a small subset of sales data from the sales history
table, SALES. (I am
using a small subset only to delineate the results of data densification; in
real life, of course, I would use all the available sales data.) Next, I will construct a query that uses a partitioned outer
join to link together sales data from the SH.SALES_MIDWEST_ONLY view with a larger subset
of Middle Western U.S. states to illustrate how data densification can be
implemented: Listing 2.3
shows the creation of both of these views, the initial query and this complete
result set. The previous query does produce zeroed totals for those
remaining states (i.e. Ohio and Iowa) which are present in the second subquery
but not in the first subquery. That is better than nothing, but unfortunately,
data is only present for the years 1999, 2000 and 2001 (the only time periods
present in the SALES table's data). However, there is nothing to stop me from
employing multiple partitioned outer joins to populate more than one
data dimension. Here is an example of how to accomplish this via one of my
favorite, elegant SQL features that Oracle provides -- the WITH clause
-- to specify the three component subqueries before invoking the multiple
partitioned outer joins. The resulting output of this query will contain a
fully "densified" result set: A fully documented version of this query as well as its
complete result set is shown in Listing 2.4. The new analytical function enhancements in Oracle 10g
give developers and DBAs significantly more horsepower for advanced data
modeling -- without the use of third-party software to perform these
analyses. These new features make Structured Query Language within Oracle 10g even
more robust and flexible. The next article in this series will concentrate on
the new MODEL clause, an extremely powerful enhancement that Oracle 10g
provides that significantly extends Oracle's robust analytical reporting
capabilities. While there is no substitute for direct experience, reading
the manual is not a bad idea, either. I have drawn upon the following Oracle
10g documentation for the deeper technical details of this article: B10736-01 Oracle Database
Data Warehousing Guide, Ch. 21 B10750-01 Oracle Database
New Features Guide B10759-01 SQL Reference
Oracle 10g Enhancements to Analytical Functions
Using Partitioned Outer Joins To "Densify" Data
SQL> TTITLE 'Partitioned Outer Join Example #1'
SQL> COL prod FORMAT A32 HEADING 'Product'
SQL> COL state FORMAT A4 HEADING 'State'
SQL> COL year FORMAT 9999 HEADING 'JobID'
SQL> COL tot_sales FORMAT 999,999,999.99 HEADING 'Total Sales'
SQL>
SQL> SELECT
2 SLS.prod
3 ,RGNS.state
4 ,SLS.year
5 ,NVL(SLS.sales,0) tot_sales
6 FROM
7 (SELECT
8 MWO.state
9 ,MWO.prod
10 ,MWO.year
11 ,SUM(MWO.sale) sales
12 FROM
13 sh.sales_midwest_only MWO
14 GROUP BY
15 MWO.state
16 ,MWO.prod
17 ,MWO.YEAR
18 ) SLS
19 PARTITION BY (SLS.prod)
20 RIGHT OUTER JOIN
21 (SELECT
22 DISTINCT cust_state_province state
23 FROM sh.customers
24 WHERE cust_state_province IN ('IA', 'IL','IN','MI','OH','WI')
25 ) RGNS
26 ON (RGNS.state = SLS.state)
27 ORDER BY SLS.prod, SLS.state, SLS.year;
Partitioned Outer Join Example #1
Product Stat JobID Total Sales
-------------------------------- ---- ----- ---------------
And 2 Crosscourt Tee Kids IL 1998 826.00
And 2 Crosscourt Tee Kids IL 1999 1,187.90
And 2 Crosscourt Tee Kids IL 2000 2,619.40
And 2 Crosscourt Tee Kids IN 1998 266.00
And 2 Crosscourt Tee Kids IN 2000 2,086.00
And 2 Crosscourt Tee Kids MI 1999 53.20
And 2 Crosscourt Tee Kids MI 2000 1,209.60
And 2 Crosscourt Tee Kids WI 1998 686.00
And 2 Crosscourt Tee Kids WI 1999 504.00
And 2 Crosscourt Tee Kids WI 2000 378.00
And 2 Crosscourt Tee Kids IA .00
And 2 Crosscourt Tee Kids OH .00
Coin Pocket Twill Cargo Trousers MI 1999 234.00
Coin Pocket Twill Cargo Trousers MI 2000 78.00
Coin Pocket Twill Cargo Trousers WI 1999 722.15
Coin Pocket Twill Cargo Trousers WI 2000 390.00
Coin Pocket Twill Cargo Trousers IA .00
Coin Pocket Twill Cargo Trousers IL .00
Coin Pocket Twill Cargo Trousers OH .00
Coin Pocket Twill Cargo Trousers IN .00
Gurfield& Murks Pleated Trousers IL 1998 9,100.00
Gurfield& Murks Pleated Trousers IL 1999 13,825.00
Gurfield& Murks Pleated Trousers IL 2000 2,100.00
Gurfield& Murks Pleated Trousers IN 1999 7,525.00
Gurfield& Murks Pleated Trousers MI 1998 26,040.00
Gurfield& Murks Pleated Trousers MI 1999 12,600.00
Gurfield& Murks Pleated Trousers MI 2000 19,425.00
Gurfield& Murks Pleated Trousers WI 1998 175.00
Gurfield& Murks Pleated Trousers IA .00
Gurfield& Murks Pleated Trousers OH .00
Kahala Pleated Chino Short IL 1998 504.00
Kahala Pleated Chino Short IL 1999 3,738.00
Kahala Pleated Chino Short IA .00
Kahala Pleated Chino Short IN .00
Kahala Pleated Chino Short MI .00
Kahala Pleated Chino Short OH .00
Kahala Pleated Chino Short WI .00
37 rows selected.N-Dimensional Data Densification
SQL> TTITLE 'Partitioned Outer Join Densifying Across Two Dimensions'
SQL> COL prod FORMAT A32 HEADING 'Product'
SQL> COL state FORMAT A4 HEADING 'State'
SQL> COL year FORMAT 9999 HEADING 'JobID'
SQL> COL tot_sales FORMAT 999,999,999.99 HEADING 'Total Sales'
SQL>
SQL> WITH
2 -- Gather all Midwest-only sales data
3 Q1 AS (
4 SELECT
5 prod
6 ,state
7 ,year
8 ,SUM(sale) sale
9 FROM
10 sh.sales_midwest_only
11 GROUP BY state, prod, year
12 ),
13 -- Gather a list of distinct states
14 Q2 AS (
15 SELECT DISTINCT cust_state_province state
16 FROM sh.customers
17 WHERE cust_state_province IN ('IA', 'IL','IN','MI','OH','WI')
18 ),
19 -- Gather a list of distinct calendar years
20 Q3 AS (
21 SELECT calendar_year year
22 FROM sh.calendar_years
23 )
24 SELECT
25 Q4.prod
26 ,Q4.state
27 ,Q3.year
28 ,NVL(Q4.sale, 0) tot_sales
29 FROM
30 (SELECT
31 Q1.prod
32 ,Q2.state
33 ,Q1.year
34 ,Q1.sale
35 FROM Q1
36 PARTITION BY (prod)
37 RIGHT OUTER JOIN Q2
38 ON (Q1.state = Q2.state)
39 ) Q4
40 PARTITION BY (prod,state)
41 RIGHT OUTER JOIN Q3
42 ON (Q4.YEAR = Q3.year)
43 ORDER BY 1, 2, 3;Conclusion
References and Additional Reading