Oracle Performance and Engineering

Random notes on database features, code, practice, optimization and techniques.

Wednesday, August 23, 2006

One thing I learnt today (or, something I knew was wrong!)

A few months back, I was investigating a slow query for a client that had a huge IN-list (over 1100 items in IN-list). The IN-listed column was a unique key and yet the query was not using an index (complicated query fired on a UNION ALL-ed view -- each part of the UNION ALL - separately fired - would use the index however). I was trying to play with statistics and asked for generating statistics with histogram across the schema. Or, asked to fire the following -



exec dbms_stats.gather_schema_stats(ownname => 'my_owner',method => 'FOR ALL COLUMNS SIZE AUTO', cascade => TRUE);

I *thought* that the "AUTO" will decide which columns to collect histograms on. Though the strategy had solved the problem and confirmed my wrong assumption - today I learnt that "AUTO" option will only work - if at all - on a database that's being used in full production capacity for a while; or in other words, has all typical SQL statements fired by the application in the cache.

Given AUTO option, Oracle actually queries SYS.COL_USAGE$ table (that stores details of columns used in WHERE clause) and determines what columns should it collect histograms on. This column usage monitoring is apparently *not* governed by "alter table monitoring on;" (as given wrongly in some documents - including this one on Oracle website). SYS.COL_USAGE$ population depends on an undocumented parameter _COLUMN_TRACKING_LEVEL.

Thus, to collect histograms on the columns with most skewed data - sepcially after upgrade, db restart, migration etc. - we should really use "SKEWONLY" option. While this is time consuming, this one looks like the best option. For subsequent execution - we can always use "REPEAT" (thereby not repeating same calculation to see which columns actually have skewed enough value to warrant a histogram).

I also learnt about a new Oracle 10g-R2 (10.2) optimizer parameter called optimizer_secure_view_merging

Wasn't this achieved with MERGE / NO_MERGE and PUSH_PRED / NO_PUSH_PRED hints? Have to do more testing on this parameter.

posted by Nilendu # 4:44 PM 0 comments

Monday, August 21, 2006

Why correct datatype is SO important in schema design!

We often, randomly, use a VARCHAR2 for a field that should really be a NUMBER (or DATE). This could have drastic impact on product performance as followed in this case study.

The following SQL was found to about 75% of the total load (logical I/O) from AWRStats report on a 10.1.0.4 database.


SELECT * FROM V WHERE report_type_id = :1;

On querying, I found the view looks like -


CREATE OR REPLACE VIEW V as
SELECT ROWNUM AS MIGRATOR_KEY
     , PARENT_RT.report_type_id AS PARENT_MIGRATOR_KEY
     , 'RT_SG_REVERSED' AS RESTRICTION_TYPE
     , 'Reversed image of RT_SG restrictions' AS RESTRICTION_TYPE_NAME
     , PARENT_RT.report_type_id
     , PARENT_RT.report_type_name
     , E.entity_name
     , CHILD_SG.security_group_id
     , CHILD_SG.security_group_name
  FROM KNTA_Entities E
     , KNTA_REPORT_TYPES PARENT_RT
     , KNTA_SECURITY_GROUPS CHILD_SG
 WHERE CHILD_SG.enabled_flag = 'Y'
   AND CHILD_SG.security_group_id NOT IN (
       SELECT ER.child_value
         FROM KNTA_ENTITY_RESTRICTIONS ER
        WHERE ER.restriction_type = 'RT_SG'
          AND ER.parent_value = TO_CHAR( PARENT_RT.report_type_id )
       )
   AND PARENT_RT.report_view_access_code <> 'ALL'
   AND E.entity_id = 15 -- security group entity id;

There are (to me) two things wrong with the highlighted part.
(a) It refers to an OUTER table within a NOT IN block.
(b) It compares VARCHAR2 with TO_CHAR(NUMBER)

So, I rewrote the view with a more fitting NOT EXISTS -


CREATE OR REPLACE VIEW v1 AS
SELECT ROWNUM AS MIGRATOR_KEY
     , PARENT_RT.report_type_id AS PARENT_MIGRATOR_KEY
     , 'RT_SG_REVERSED' AS RESTRICTION_TYPE
     , 'Reversed image of RT_SG restrictions' AS RESTRICTION_TYPE_NAME
     , PARENT_RT.report_type_id
     , PARENT_RT.report_type_name
     , E.entity_name
     , CHILD_SG.security_group_id
     , CHILD_SG.security_group_name
  FROM KNTA_ENTITIES E
     , KNTA_REPORT_TYPES PARENT_RT
     , KNTA_SECURITY_GROUPS CHILD_SG
 WHERE CHILD_SG.enabled_flag = 'Y'
 AND NOT EXISTS (
       SELECT  ER.child_value
         FROM KNTA_ENTITY_RESTRICTIONS ER
        WHERE ER.restriction_type = 'RT_SG'
          AND ER.parent_value =  PARENT_RT.report_type_id
          AND er.child_value = child_sg.security_group_id)
   AND PARENT_RT.report_view_access_code <> 'ALL'
   AND E.entity_id = 15 -- security group entity id;

And, changed the data types of PARENT_VALUE and CHILD_VALUE from VARCHAR2(80) to NUMBER (easy with 10g and all -- create a new field with desired datatype; update it to fill with values; drop the old column; rename the new column; rebuild the indexes).

The plan changed from (25000 LIO, 2 sec+)

Operation OBJECT Name ROWS Bytes Cost OBJECT Node IN/OUT PStart PStop

SELECT STATEMENT Optimizer MODE=ALL_ROWS 14 K 45912.0421527342
VIEW MITG.KNTA_RESTRICT_RT_SG_MIGE_V 14 K 4 M 45912.0421527342
COUNT
FILTER
NESTED LOOPS 14 K 1 M 332.450479722286
NESTED LOOPS 113 7 K 6.02673530593326
TABLE ACCESS BY INDEX ROWID MITG.KNTA_ENTITIES 1 20 1.00224632611494
INDEX UNIQUE SCAN MITG.KNTA_ENTITIES_U1 1 .00111934092369478
TABLE ACCESS FULL MITG.KNTA_REPORT_TYPES 113 4 K 5.02448897981832
TABLE ACCESS FULL MITG.KNTA_SECURITY_GROUPS 132 3 K 2.88870570279958
TABLE ACCESS FULL MITG.KNTA_ENTITY_RESTRICTIONS 1 34 5.11315481360784

to (doing just about 50LIO -- returning within a fraction of second)


Operation          OBJECT Name     ROWS    Bytes    Cost     OBJECT Node      IN/OUT   PStart   PStop

SELECT STATEMENT Optimizer MODE=ALL_ROWS                    1                      338.287484705643                                                          
  VIEW  MITG.V1           1          297       338.287484705643                                                         
    COUNT                                                                                                         
      HASH JOIN RIGHT ANTI                   1          127       338.287484705643                                                         
        TABLE ACCESS FULL       MITG.KNTA_ENTITY_RESTRICTIONS     224       7 K       5.10449694166571                                                               
        MERGE JOIN CARTESIAN             14 K     1 M       332.450479722286                                                         
          NESTED LOOPS             113       7 K       6.02673530593326                                                         
            TABLE ACCESS BY INDEX ROWID        MITG.KNTA_ENTITIES   1          20         1.00224632611494                                                               
              INDEX UNIQUE SCAN  MITG.KNTA_ENTITIES_U1         1                      .00111934092369478                                                      
            TABLE ACCESS FULL   MITG.KNTA_REPORT_TYPES   113       4 K       5.02448897981832                                                          
          BUFFER SORT               132       3 K       327.425990742468                                                         
            TABLE ACCESS FULL   MITG.KNTA_SECURITY_GROUPS         132       3 K       2.88870570279958

posted by Nilendu # 2:51 PM 1 comments

Friday, August 18, 2006

Yet another bug discovered - MATERIALIZE hint bringing wrong results (9i databases)


WITH temp AS
     (
        SELECT /*+ MATERIALIZE */
               rsc_pool_id
          FROM krsc_rsc_pools p
         WHERE linked_entity_id = 313
           AND linked_entity_primary_key IN (
                  SELECT DISTINCT org_unit_id
                             FROM krsc_org_units o
                       START WITH org_unit_id IN (
                                     SELECT linked_entity_primary_key
                                       FROM krsc_rsc_pools
                                      WHERE linked_entity_id = 313
                                        AND rsc_pool_id IN (30014))
                       CONNECT BY PRIOR org_unit_id = parent_org_unit_id))
SELECT *
  FROM temp;

This query brings totally different results between (a) when it's using MATERIALIZE hint and (b) when it's not (i.e., by default using the INLINE mechanism).

On reading the trace file - I found Oracle is totally re-writing the query and dropped a few filters. This is how the temp table is created (materialized) -


INSERT      /*+ APPEND BYPASS_RECURSIVE_CHECK */INTO sys_temp_0fd9d6607_e098df80
   SELECT rsc_pool_id
     FROM krsc_rsc_pools p
    WHERE linked_entity_id = 313
      AND linked_entity_primary_key =
             ANY (SELECT DISTINCT org_unit_id
                             FROM krsc_org_units o
                       CONNECT BY parent_org_unit_id = PRIOR o.org_unit_id
                       START WITH org_unit_id =
                                           ANY (SELECT linked_entity_primary_key
                                                  FROM krsc_rsc_pools))

Notice the filters “WHERE linked_entity_id=313 AND rsc_pool_id IN (30014)” are totally gone!!

Advised by Oracle support, I tried the following three tweaks, none worked -



alter session set "_pred_move_around"=false ;
alter session set "_complex_view_merging"=false ;
alter session set "_unnest_subquery"=false ;

The issue does not reproduce in 10g databases. Finally, Oracle opened a bug - #5421368 on this.

posted by Nilendu # 12:06 PM 0 comments

Severe Performance issues using TABLESPACE GROUPS w/ multiple TEMP TABLESPACES

In short, please refrain from using “Tablespace Groups” feature on Oracle 10g as there’re some very expensive, ill-optimized recursive checks that could put enormous load on the system – particularly during sorting (or other set- ) operations.

Issue

All "portlets" (snapshot dashboard reports that do sorting) taking (initially) 30~60 sec on first execution. On immediate second execution the time comes down to low single digit figure.

Resolution Details

We resolved the issue in two-phase improvements. Right now, most of the portlets come back within 3~5 seconds on any given execution.

We also expect the same resolution to effect *all* 10.1.0.3 customers to give a much better portlet performance and bring down the total database load on the client DB by around 20%.

Phase One – Data Skewness: Response time improved from 50-60 sec to about 20-30 sec.

We found that more than 95% of the requests belong to just one request type. In other words, requests are not evenly distributed across the request types. We deleted the column statistics on REQUEST_TYPE_ID to trick Oracle into believing that requests are evenly distributed – and thus using the index on request_type_id. This gave us relatively smaller, but very visible improvement of up to 30%-50% on many portlets. Still some of the portlets were taking up to 20 sec!

Phase Two – Oracle Bug /Configuration Issue: Response time improved from 20 sec to 0.2 sec

Observation One - We found that every time there’s a “sort” operation (ORDER BY) in a portlet – the first execution takes up to 20 sec. Immediate next execution comes down to <>


select min(bitmapped)
from
ts$ where dflmaxext =:1 and  bitand(flags,1024) =1024

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 261649 10.06 9.72 0 0 0 0
Execute 261649 8.24 7.53 0 0 0 0
Fetch 523298 68.57 68.96 0 18577079 0 261649
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 1046596 86.87 86.22 0 18577079 0 261649

Observation Three – We also found that the said Oracle user uses a “Temporary Tablespace Group” (a new feature in Oracle10g) rather than a single temporary tablespace.

A temporary tablespace group consists of several temporary tablespace and the recursive query above tries to determine which of these tablespaces is best suited for a particular sort-ing (ORDER BY – present in all portlets!) operation. For some reason (probably an Oracle bug) this recursive SQL is ill-optimized and consumes a lot of resource.

Thus, we changed the user from using a “Temporary Tablespace Group” to a “Temporary Tablespace” – so that the above check does not have to occur.

After making the change, we experienced a dramatic improvement.

Here's more on temporary tablespace groups.

posted by Nilendu # 11:59 AM 0 comments