Teradata : DATA DISTRIBUTION AND PRIMARY INDEX

Share via:

DATA DISTRIBUTION AND PRIMARY INDEX

 

 

DATA DISTRIBUTION AND PRIMARY INDEX

Each AMP is designed in a way to hold portion of rows from each table i.e. data is distributed across all the AMPs in the system and each AMP operates upon its share of data for any operation. Teradata database uses hashing algorithm to randomly distribute data across all the AMPs. Normally, the data of all the tables can be distribute across all the amps except certain scenarios when the number of AMPs are more than the number of rows in a table.

Ideally, the rows of each table will be evenly distributed across all the AMPs. Even distribution is desirable so that each AMP has equal amount of work to do on equal amount of data. When the data distribution is not even, the response of the any query will be as fast as the slowest amp as the amp having more data will need more time to serve the request (shared nothing architecture). The column(s) of a table based on which the data distribution takes place is known as a Primary Index of the table.

 

Primary Indexes:

Primary indexe is the physical mechanism for assigning a data row to an amp and its location on disks. It is also an index which is used to directly access row without having to go for a full table scan. Choosing a parimary index is a very crucial and important decision as it is on this column that the data distribution takes place and we want to make sure that the primary index is a column which is as unique as possible so that the hashing algorithm direct the data to different AMPs. Non unique values give the same output after the hashing algorithm and hence go on same amp. So higher the duplicates, the worse the data distribution and hence worse the performance of queries on the said table.

There are two types of primary indexes:

  • Unique Primary Index (UPI)
  • Non Unique Primary Index (NUPI)

These will be covered in detail in the next article. A primary index is defined at the time of creation of a table and can be either a single column or a combination of upto 64 columns. Once primary index is defined for a table it cannot be changed and the table must be dropped and recreated with new PI if the PI needs to be changed. That is why choosing a good PI is very important for Teradata’s performance.

Note: While the PI can’t be changed, we can change the data in PI column, once it is changed, it simply gets rehashed to a different amp and the row moves there.

 

Data Distribution via Hashing:

 

 

 

Teradata database has a hashing algorithm that it uses to distribute rows of data across amps. The primary index value is fed to the Teradata Hashing Algorithm which generates a 32 bit row hash. The high order 16 bits is referred to as hash bucket number and is used to identify the hash map entry. This is used to identify the amp the row will go to. The remaining 16 bits are used to locate the amp.

When system grows, new amps are usually added. Change in number of amps need to have a change in hash map as well.

 

Share via:
Note: Please test scripts in Non Prod before trying in Production.
1 Star2 Stars3 Stars4 Stars5 Stars (22 votes, average: 5.00 out of 5)
Loading...

6 thoughts on “Teradata : DATA DISTRIBUTION AND PRIMARY INDEX

Add Comment