Oracle Performance and Engineering

Random notes on database features, code, practice, optimization and techniques.

Wednesday, March 30, 2005

 

Storage of (fixed or variable) Multi-byte data in varchar2

Took the 10g upgrade exam this Sunday. It was *long*. It took me about 2 hours 15 mins to crack all 161 questions. Was a good jump-start on 10g.

One of the applications I advise had client issues yesterday. Apparently, they are "UTF8 enabled" but still when customers tried to store 250 multi-byte characters in a VARCHAR2(250) it failed with " ORA-12899: value too large for column <>"

I was asked if there's something to be done on Oracle side. Of course, there is. In fact, this is an issue that's supposed to be handle totally on Oracle side.

Here's what I wrote in different emails back on forth on the issue -

"Yes, there's an Oracle cure :)

Oracle lets you declare the data type according to "character" and not "bytes" (the default). i.e., if you declare a field "name" to be a varchar2 (40 bytes or 40 characters).

In a single-byte charater set the name will contain 40 characters (==40 bytes). In a double-byte character set the name will contain 40 characters (80 bytes). Oracle auto-adjusts bytes when you declare the length semantics on 'characters'.

The syntax looks like -

varchar2(20 byte) -- if byte is not specified that's the default. varchar2(20 char)

This is doable from 9i onwards.

We can change the "default way" Oracle stores the string by adjusting NLS_LENGTH_SEMANTICS (i.e., Oracle should follow the "byte" or "character" semantics). The default length semantic, naturally, is byte. However, the setting of this parameter should not be changed more than once in a database since it gives way to mixed-semantics database.

Any table changed *after* the NLS_LENGTH_SEMANTICS change will inherit the "char" semantics. However, old tables will continue to be on the "byte" semantic.

I was trying with different use cases against a 9i database. Here're the notes on findings -

(1) On changing "byte" semantic to "char" -

If we change NLS_LENGTH_SEMANTICS to CHAR -- all the old tables still are maintained in "Byte Semantic" and new tables are maintained in "Char Semantic".

(2) Effect on "char" semantic on named PL/SQL -

If we change the NLS_LENGTH_SEMANTICS to CHAR -- all named PL/SQL codes will use the Character semantics. i.e., VARCHAR2(20) inside a *named* PL/SQL(as opposed to an anonymous PL/SQL block) will be VARCHAR2(20 CHAR).

Anonymous PL/SQL blocks, however, have to be recompiled.

(3) If VARCHAR2 can store more than 4000 bytes with "char" semantic -

If we change the NLS_LENTH_SEMANTICS we still are bound by maximum bytes (4000) of VARCHAR2.

Let's do some math here - max byte a character can hold is 3 for UTF8 (Oracle 8i); 4 bytes for AL32UTF8 (Oracle 9i, we're going to use this for database, i.e., VARCHAR2s); and 2 bytes for AL16UTF16 (for NVARCHAR data type).

Thus the maximum number of character a VARCHAR2(4000) will hold is 1333 in UTF8 (8i) and 1000 in AL32UTF8 (9i).

Here's a nice table I found on Oracle metalink -


UTF8 (1 to 3 bytes) AL32UTF8 (1 to 4 bytes) AL16UTF16 ( 2 bytes)

MIN MAX MIN MAX MIN MAX
CHAR 2000 666 2000 500 N/A N/A

VARCHAR2 4000 1333 4000 1000 N/A N/A

NCHAR 2000 666 N/A N/A 1000 1000

NVARCHAR2 4000 1333 N/A N/A 2000 2000

N/A means Not possible!

However, if the width of the data type is well below max, i.e., 4000 - max number of characters will be stored.

i.e., a VARCHAR2(100 char) will store 100 characters leading to a max. 400 bytes with AL32UTF8 (Unicode 3.0 encoding)

Note – Pl/SQL code can handle upto 32K bytes in a VARCHAR2 variable!

(4) What if we need to store more than 4000 bytes in a varchar2 field for multi-byte characters without changing the length semantics -

We need to resort to using NVARCHAR2.

NVARCHAR2 falls under "national character set" and *ALWAYS* stores data in "character" semantic, even if the NLS_LENGTH_SEMANTICS is set to be "byte".

(5) What's the easiest way to solve the problem "varchar2(250) field cannot take 250 characters because it was declared in 'byte' semantic” -

Case 1 -- Easiest for an already up and running database --

Alter table modify column (problem_column varchar2(250 char));

OR

Alter table modify column (problem_column nvarchar2(250));

Change the database nls_length_semantics to CHAR to let PL/SQL handle these extra bytes for multi-byte characters.


Case 2 -- Easiest for a "fresh" Install --

Alter system set nls_length_semantics = char;

-- create all the tables or Import them fresh.

If the problem happens only with a few tables - I would say case 1 is a better way to go. For ideal cases or instance-wide multi-byte data, "Case 2" is the one.

(6) What's the performance implication of "char" semantic?

Well, when Oracle stores string as "characters" Oracle

(a) cannot do the easier "byte" arithmetic for string computations. It has to then count characters to begin and start, not bytes. So string-extensive operations - particularly SUBSTR(:string,1,40) types, are slower. I've to benchmark to find out just how much slower. I would guess "LIKE" operator would also be slower.

But then I received an interesting feedback from a co-worker -

"In regard to performance implications resulting from not being able to do byte-level arithmetic for "short-cutting" string manipulation and comparison calculations, the detrimental effect of this is probably marginal.

For example, all string-manipulation done internally by win32 (MS Windows NT and later) is multi-byte. This doesn't have cause major slow-down (not even in heavily multi-byte versions, such as Chinese, Japanese, Korean). Unless oracle does something radically different, and wildly inefficient, this is probably not a concern for us."

(b) (obviously), increases the space requirements.


Comments: Post a Comment



<< Home

Archives

January 2005   February 2005   March 2005   April 2005   May 2005   March 2006   April 2006   August 2006   September 2006   October 2006   January 2007   May 2008  

This page is powered by Blogger. Isn't yours?