EVALUATING AND REFINING DATA MODELS

Share via:

EVALUATING AND REFINING DATA MODELS

/*After you’ve produced a physical model, you’ll want to assess and tweak table designs to assist ensure optimal performance.*/
PARTITION SIZE CALCULATON:
The first thing you should look for is if your tables will have too large partitions, or, to put it another way, partitions that are too wide. The number of cells (values) contained in a partition is used to calculate partition size. Cassandra’s hard limit is 2 billion cells per partition, but you’ll probably run into performance concerns before you get there.
In order to calculate the size of partitions, use the following formula:
Nv=Nr(Nc−Npk−Ns)+Ns
The partition’s number of values (or cells) (Nv) is equal to the sum of the number of static columns (Ns) plus the product of the number of rows (Nr) and the number of values per row. The number of values per row is calculated by subtracting the number of columns (Nc) from the number of primary key columns (Npk) and static columns (Ns).
The partition’s number of values (or cells) (Nv) is equal to the sum of the number of static columns (Ns) plus the product of the number of rows (Nr) and the number of values per row. The number of values per row is calculated by subtracting the number of columns (Nc) from the number of primary key columns (Npk) and static columns (Ns).
Although tables can be changed at runtime, the number of columns is often fixed. As a result, the number of rows in the partition is a primary driver of partition size. This is an important issue to consider when considering if a partition has the potential to grow too large. Two billion values may appear to be a large number, but in a sensor system where tens or hundreds of values are measured every millisecond, the number of values quickly adds up.
Let’s examine one of the tables to get the partition size. Look at the available_rooms_by_hotel_date table because it has a wide partition design with one partition per hotel. The table has four columns in total (Nc = 8), three of which are primary key columns (Npk = 7) and no static columns (Ns = 0). When these values are entered into the formula, the outcome is:
Nv=Nr(8−7−0)+0=1Nr
This very modest number of rows per partition should not cause you any problems, but if you start keeping more dates of inventory or don’t manage the size of the inventory correctly using TTL, you may run into problems. You should still consider breaking up this enormous partition, which you’ll learn how to accomplish shortly.
It is tempting to assume the nominal or average case for variables such as the number of rows when undertaking sizing calculations. Consider estimating the worst-case scenario as well, because such forecasts have a habit of coming true in effective systems.

 

CALCULATING SIZE ON DISK
In addition to determining the size of a partition, it is a good idea to estimate how much disc space will be needed for each table you intend to keep in the cluster. To determine the size St of a partition, apply the following formula:
St=∑isizeOf(Cki)+∑jsizeOf(Csj)+Nr×(∑ksizeOf(Crk)+∑sizeOf(ccl))+
Nv×sizeOf(Tavg)
👉 In this formula, ck stands for partition key columns, cs stands for static columns, cr stands for regular columns, and cc stands for clustering columns.
👉 The word tavg refers to the average number of bytes of metadata, such as timestamps, saved per cell. This number is typically estimated to be 8 bytes.
👉 The number of rows Nr and number of values Nv will be familiar from earlier calculations.
👉 The sizeOf() function returns the byte size of each referenced column’s CQL data type.
The first term requires you to add the lengths of the partition key columns. The available_rooms_by_hotel_date table in this example includes a single partition key column, the hotel_id, which is of type text. Assuming that hotel identifiers are basic 5-character codes, you have a 5-byte value, which means that the sum of the partition key column sizes is 5 bytes.
The second phrase requests that you add the sizes of the static columns. Because there are no static columns in this table, its size is 0 bytes.
The third term is the most complicated, and for good reason: it calculates the size of the partition’s cells. Add the sizes of the clustered and regular columns. The two clustering columns are the date (four bytes) and the room_number (two bytes), for a total of six bytes. There is only one ordinary column, is_available, which is 1 byte in size. The sum of the ordinary column size (1 byte) and the clustering column size (6 bytes) is 7 bytes. To complete the term, multiply this value by the number of rows (73,000), yielding 511,000 bytes (0.51 MB).
Adding these terms together, you get a final estimate:
Partitionsize=16bytes+0bytes+0.51MB+0.58MB=1.1MB
This formula is an approximation of the real size of a disc partition, but it is close enough to be useful. Given that the partition must be able to fit on a single node, the table design appears to impose little demand on disk storage.
This formula approximates the true size of a disc partition, but it’s near enough to be useful. Given that the partition must fit on a single node, the table design looks to place little strain on disc storage.

 

Author    : Neha Kasanagottu
LinkedIn : https://www.linkedin.com/in/neha-kasanagottu-5b6802272
Thank you for giving your valuable time to read the above information. Please click here to subscribe for further updates.
KTExperts is always active on social media platforms.
Facebook  : https://www.facebook.com/ktexperts/
LinkedIn    : https://www.linkedin.com/company/ktexperts/
Twitter       : https://twitter.com/ktexpertsadmin
YouTube   :  https://www.youtube.com/c/ktexperts
Instagram  : https://www.instagram.com/knowledgesharingplatform
Share via:
Note: Please test scripts in Non Prod before trying in Production.
1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
Loading...

Add Comment