Oracle Performance and Engineering

Random notes on database features, code, practice, optimization and techniques.

Tuesday, February 01, 2005

 

CLUSTERING FACTOR - How important this stuff is?

Howard Rogers has neatly defined "Clustering Factor" as "a count of how many visits to the table would you have to make if each entry in the index was read in turn, with consecutive visits to the same table block not being counted".

In other words, CLUSTERING FACTOR of an index (CF, hereafter ;) ) is really the measure of total IO if the whole table is read via that index. When database reads index, it gets the data block address of the actual row in the table and reads it. In other words, the column in an ordinary index is always ordered (sorted), but not necessarily the table has sorted indexed column(s) on its data blocks. CF is a measure of how "ordered" or "sorted" the indexed column(s) really are in the table.

Observed Notes on CF -

(1) The maximum number of times you can read a table block-by-clock is once per row (if the row size is, of course, less than the amount of IO the system can perform in each run). Thus the maximum value of clustering factor could be the number of rows in that table. At that point, the indexed data is pretty badly ordered in the table - and thus index-range scan may be effected.

(2) The minimum number of times rows can be read on a block-by-block scheme is equal to the number of non-empty data blocks.

If the table has many empty blocks, they will not be read during an index scan since the index entry in the leaf blocks will not have a ROWID that maps to an empty block.

(Note: A deleted value has a NULL rowid in an index!)

Thus, CF can actually be less than the number of total blocks in a table!

(3) Rebuilding an index or coalescing it will not impact the CF. This is normal since the CF is really a table property than an index property. It measures "orderliness of indexed data in the table". It does not measure anything of the index.

(4) However, rebuilding the table (such as, lowering PCTFREE thus ensuring more number of rows in each block OR "create table as select ...order by indexed_column(s) ") will impact Clustering Factor. In both cases above, Clustering Factor should be lower after the table reorg.

(5) Someone equated CF with the number of "Consistent Gets" when the table is read through the index here. But it's only partially true. Let's see how the optimizer is affected by the CF.

According to Wolfgang Breitling (well, this is database basic - but he gave a nice perspective of the same w.r.t. Oracle), COST of a table access through index range is derived from the formula -
_________________________________________________

COST = blevel + (FF * leaf_blocks) + (FF * Clustering_Factor)
_________________________________________________


Where FF = "Filter Factor" or selectivity; or the fraction of the rows being read.
For an UNIQUE index scan - selectivity is ZERO.
When all rows are read (say, no filter is applied) - selectivity is 1.

Thus for a unique index scan the cost comes out to be independent of Clustering_Factor, since then the FF = 0. Logically this also makes sense, since if only one key of the unique index is read - the physical distribution of that value in the table (only one ROWID!) does not matter at all.

While reading *all* the rows (i.e., FF = 1), thus the COST of access through the index becomes LARGER (much larger - will be dealt with in a later article) than the Full-table scan (which, at worse, will equal to CF!).

From the above equation, it also seems the larger % of rows are read the more important CF becomes in calculating COST. Thus we can say, CF plays a big role in determining large index-range scans.

(6) A quick (and provocative!) way to try to enable the range-scan of a large index is to rebuild the table as

"CREATE TABLE AS SELECT * FROM ORDER BY INDEXED_COLUMN(S)"


While this will ORDER the data on the data blocks of the table for the indexed column(s) thus making indexed-access to multiple rows cheaper, but at the same time it will break some other sorting in the table (if there was one). Mostly, if an ordinary table contains a sequence generated primary key - the table is more-or-less sorted by that.

(7) I found a more detailed formula to calculate cost from this article. According to this, the cost of a table access through index scan can be equated as -

_________________________________________________
COST =
(BLEVEL * 9742.88
+ CEIL(FF * Leaf_Blocks) * 9742.88 -- INDEX ACCESS COST
+ CEIL(FF * CF) * 9742.88 -- TABLE BLOCK ACCESS COST
+ FF * CF * 330 -- ROW PROCESSING COST
+ FF * CF * NCOL * 20 ) / 10,000 -- COLUMN PROCESSING COST
_________________________________________________

Where, FF = Filtering Factor or SELECTIVITY
CF = Clustering Factor
NCOL = Number of columns accessed

On the same note, in future articles I plan to discuss some more hands on CF and other index parameters.

Here's another good article by Dan Hotka on Clustering Factor of an index.

Comments: Post a Comment



<< Home

Archives

January 2005   February 2005   March 2005   April 2005   May 2005   March 2006   April 2006   August 2006   September 2006   October 2006   January 2007   May 2008  

This page is powered by Blogger. Isn't yours?