PRIMARY INDEX IN TERADATA
In this article, we will learn about PRIMARY INDEX IN TERADATA
In Teradata, Parallelism is one of the main feature. As we already know AMPs are the one which perform the actual operation of the query. So, to maintain the parallelism each AMP needs to balance work, so that, query completes within less time. But the question here is, who can help to maintain the balanced data for the table across the AMPs.
with the help of primary index in teradata, we can achieve this balanced data distribution.
What is primary index :
Primary Index is the physical mechanism for storing and retrieving data row in Teradata AMP. Each table in Teradata must have at least one column as Primary Index. It is defined at the time of creating table. It is used to specify where the data resides in Teradata. It is used to specify which AMP gets the data row. Each table in Teradata is required to have a primary index defined. If the primary index is not defined, Teradata automatically assigns the primary index. Primary index provides the fastest way to access the data. A primary may have a maximum of 64 columns.PI can’t be altered or modified.
It is the most preferred and important index for below reasons:
- Data Distribution
- Known access path
- Improves Join performance
Guidelines for choosing primary index :
There are three essential factors to consider when choosing a primary index:
- Uniform data distribution (the most important consideration)
- Optimal access to the data
- The volatility of indexed column values
Use the following guidelines for selecting columns to be used as primary indexes:
>> Select columns that consist mainly of unique, distinct values.
This is the most important consideration. Columns with distinct values distribute data rows evenly across all AMPs in the configuration. This maximizes parallel processing by allowing all the processors to participate in processing the data rows for a request.
>> Avoid using columns with a small number of distinct values that are repeated frequently or columns that have many nulls. This will cause uneven distribution of data rows resulting in some AMPs having more rows to process than others, and will increase the potential for performance bottlenecks.
Recommendation: Select columns having significantly more distinct values than the number of AMPs in the configuration.
>> Select columns that are most frequently used in equality predicate conditions.
>> Select columns that are most frequently used to access rows.
The PI provides the most efficient method to access data in a table. Therefore, choosing the PI on the most frequently used access path provides for optimal data access. For example, if the table is frequently joined with a specific set of tables, consider defining the PI on the column set that is typically used as the join condition.
>> Equality conditions permit direct access to the row. Inequality conditions require additional processing.
>> Select columns that do not have any of the following data types: BLOB, CLOB, Period, UDT, ST_Geometry, MBR.
>> Choose as few columns as possible for the PI to optimize its generality.
>> Select primary indexes so that query plans use AMP-local processing and avoid row redistributions.
>> Select columns that are not volatile.
Everyone assumes primary index is same as primary key. Primary key is just to identify row for a table but primary index can be used to distribute the data as well as for accessing the data.
We have two types of primary indexes.
- Unique Primary Index (UPI).
- Non-Unique Primary Index(NUPI)
Unique Primary Index :
As the name suggests a UPI allows unique values i.e. no duplicates are allowed. It is a one AMP operation and data distribution is even. It can contain one null value
CREATE SET TABLE Salary (
UNIQUE PRIMARY INDEX(EmployeeNo);
For eg: We have an Employee table where EMP_NO is the primary index (we have chosen this as EMP_NO is unique to all).
Data distribution using UPI ( Unique primary index ) :
Sample query :
INSERT INTO DBNAME.EMPOYEE VALUES (011,'Wilson',20,'2010-10-26',5000);
When a user submits a insert query for a table with Primary Index the following processes occur:
The index value goes through a hashing algorithm and gives out a 32-bit Row-hash value something like this 0011 0011 0101 0101 0000 0001 0110 0001 for EMP_NO 011.
First 20 bit of this 32 bit Row hash value determines the AMP on which the row will reside. This is decided from the Hash map which contains 1 million hash bucket. Hash map looks something like this for 4 AMP system.
So now the hash value will point to the particular amp from HASH MAP. e.g : Our value 0011 0101 0101 0000 0001 0110 0001 points to 2nd row , 1st column i.e. AMP 4. Now we have AMP 3 where the row will reside. 5. The PE will send the row to the AMP with the hash value attached to it, something like this:
An uniqueness value is defined for each row. As EMP_NO is unique to all, in case of UPI the uniqueness value will be 1 for all. This can be well understood while we will study NUPI. So, this is how the row can be distributed to the AMP. Same is the process for retrieval.
Non-Unique Primary Indexes ( NUPI ) :
If the table is defined to be having NUPI, then the column deemed as UPI can accept duplicate values.
CREATE TABLE sample_1
PRIMARY INDEX (col_a);
Data distribution and accessing is same as UPI but if allows duplicate values for PI. All duplicate values belong to same PI goes to same AMP.