2

Can Disk Space Be Saved in MySQL by Adding a Primary Key?

 6 months ago
source link: https://www.percona.com/blog/can-disk-space-be-saved-in-mysql-by-adding-a-primary-key/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Can Disk Space Be Saved in MySQL by Adding a Primary Key?

February 9, 2024

Przemysław Malkowski

Historically, MySQL does not require explicit primary key defined on tables, and it’s like that by default till this day (MySQL version 8.3.0). Such a requirement is imposed through two replication methods, though: Group Replication and Percona XtraDB Cluster (PXC), where using tables without a primary key is not allowed by default. There are many well-known negative performance implications for tables lacking a primary key, where the most painful is terrible replication speed.

Today, I would like to make a quick point about one more reason to have a primary key: data size on disk!

Let’s consider this very simple example table:

MySQL
mysql > show create table test1\G
*************************** 1. row ***************************
       Table: test1
Create Table: CREATE TABLE `test1` (
  `a` int NOT NULL,
  `b` bigint DEFAULT NULL,
  KEY `a` (`a`),
  KEY `b` (`b`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

Populated with 10M test rows, it takes 748M on disk. Now, given that my test table has column “a” with unique values:

MySQL
mysql > select count(*) from test1;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (1.34 sec)
mysql > select count(DISTINCT(a)) from test1;
+--------------------+
| count(DISTINCT(a)) |
+--------------------+
|           10000000 |
+--------------------+
1 row in set (5.25 sec)

I will change the (secondary) index type to primary:

MySQL
mysql > alter table test1 add primary key(a), drop key a;
Query OK, 0 rows affected (48.90 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql > show create table test1\G
*************************** 1. row ***************************
       Table: test1
Create Table: CREATE TABLE `test1` (
  `a` int NOT NULL,
  `b` bigint DEFAULT NULL,
  PRIMARY KEY (`a`),
  KEY `b` (`b`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

The table was re-created as a result, and its size on disk was reduced to 588M, quite significantly! Why has this happened? We have exactly the same data, and both columns are indexed in both cases! Let’s check more details about the table before and after the change.

Before, without PK, when both columns were indexed via a secondary key, we could see the following:

MySQL
mysql > select SPACE,INDEX_ID,i.NAME as index_name, t.NAME as table_name, CLUST_INDEX_SIZE, OTHER_INDEX_SIZE from information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLESPACES t USING(space) JOIN information_schema.INNODB_TABLESTATS ts WHERE t.NAME=ts.NAME AND t.NAME='db1/test1'\G
*************************** 1. row ***************************
           SPACE: 50
        INDEX_ID: 232
      index_name: a
      table_name: db1/test1
CLUST_INDEX_SIZE: 24699
OTHER_INDEX_SIZE: 22242
*************************** 2. row ***************************
           SPACE: 50
        INDEX_ID: 231
      index_name: b
      table_name: db1/test1
CLUST_INDEX_SIZE: 24699
OTHER_INDEX_SIZE: 22242
*************************** 3. row ***************************
           SPACE: 50
        INDEX_ID: 230
      index_name: GEN_CLUST_INDEX
      table_name: db1/test1
CLUST_INDEX_SIZE: 24699
OTHER_INDEX_SIZE: 22242
3 rows in set (0.00 sec)

So, there is a third index! As seen in a more detailed view of each index via the innodb_ruby tool, it’s the biggest one in terms of size (id=230):

$ innodb_space -f msb_8_3_0/data/db1/test1.ibd space-indexes
id      name  root        fseg        fseg_id     used        allocated   fill_factor
230           4           internal    3           27          27          100.00%    
230           4           leaf        4           24634       24672       99.85%      
231           5           internal    5           21          21          100.00%    
231           5           leaf        6           12627       12640       99.90%      
232           6           internal    7           13          13          100.00%    
232           6           leaf        8           9545        9568        99.76%

This is how the InnoDB engine works; if no explicit PK is defined, it will add an internal one named GEN_CLUST_INDEX. As it contains the whole data row, it’s size overhead is significant.

After replacing the secondary index with the explicit primary key, the hidden one is no longer needed. Therefore, we’re left with two indexes in total:

MySQL
mysql > select SPACE,INDEX_ID,i.NAME as index_name, t.NAME as table_name, CLUST_INDEX_SIZE,OTHER_INDEX_SIZE from information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLESPACES t USING(space) JOIN information_schema.INNODB_TABLESTATS ts WHERE t.NAME=ts.NAME AND t.NAME='db1/test1'\G
*************************** 1. row ***************************
           SPACE: 54
        INDEX_ID: 237
      index_name: b
      table_name: db1/test1
CLUST_INDEX_SIZE: 23733
OTHER_INDEX_SIZE: 13041
*************************** 2. row ***************************
           SPACE: 54
        INDEX_ID: 236
      index_name: PRIMARY
      table_name: db1/test1
CLUST_INDEX_SIZE: 23733
OTHER_INDEX_SIZE: 13041
2 rows in set (0.01 sec)
$ innodb_space -f msb_8_3_0/data/db1/test1.ibd space-indexes
id      name  root        fseg        fseg_id     used        allocated   fill_factor
236           4           internal    3           21          21          100.00%    
236           4           leaf        4           20704       23712       87.31%      
237           5           internal    5           17          17          100.00%    
237           5           leaf        6           11394       13024       87.48%

Hidden (internal) clustered index (GEN_CLUST_INDEX) vs. generated invisible primary key (GIPK)

Every InnoDB table has a clustered key, so not defining one will not save any disk space, and sometimes, on the contrary, as demonstrated above. So, even if none of the existing columns of your problematic table is unique, it is still best to add another unique column to act as the primary key. The internal GEN_CLUST_INDEX one is not exposed to the upper MySQL layer, and only the InnoDB engine is aware of it, hence it is useless for replication speed. Therefore, an explicit PK is always a better solution.
However, if adding a new PK column is out of the question due to a legacy application problem, you should still enforce the primary key by using an invisible one. This way, you will gain performance benefits, and at the same time, the change is transparent to applications.

Let’s see how it works in practice:

MySQL
mysql > set sql_require_primary_key=1;
Query OK, 0 rows affected (0.00 sec)
mysql > create table nopk (a int);
ERROR 3750 (HY000): Unable to create or change a table without a primary key, when the system variable 'sql_require_primary_key' is set. Add a primary key to the table or unset this variable to avoid this message. Note that tables without a primary key can cause performance problems in row-based replication, so please consult your DBA before changing this setting.
mysql > set sql_generate_invisible_primary_key=1;
Query OK, 0 rows affected (0.00 sec)
mysql > create table nopk (a int);
Query OK, 0 rows affected (0.02 sec)
mysql > show create table nopk\G
*************************** 1. row ***************************
       Table: nopk
Create Table: CREATE TABLE `nopk` (
  `my_row_id` bigint unsigned NOT NULL AUTO_INCREMENT /*!80023 INVISIBLE */,
  `a` int DEFAULT NULL,
  PRIMARY KEY (`my_row_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
mysql > select * from nopk;
+------+
+------+
|  100 |
+------+
1 row in set (0.00 sec)

So, our application is not aware of the new column at all. But we can still use it if needed, for example, to easily split table reads or writes into predictable chunks:

MySQL
mysql > select my_row_id,a from nopk;
+-----------+------+
| my_row_id | a    |
+-----------+------+
|         1 |  100 |
+-----------+------+
1 row in set (0.00 sec)

Please note that for existing schemas lacking a primary key, before you enforce the sql_require_primary_key variable, best to first enable the sql_generate_invisible_primary_key and re-create the data using logical dump and restore. A simple table optimization will not add the invisible PK. In any case, having an invisible PK should be a win-win solution for legacy applications.

To summarize:
• It may be worth checking if changing the index type can save disk space!
• If adding a primary key wasn’t possible due to the application limits, consider using an invisible one!

Percona Distribution for MySQL is a complete, stable, scalable, and secure open source MySQL solution, delivering enterprise-grade database environments for your most critical business applications. Deploy anywhere and implement easily with one-to-one compatibility with MySQL Community Edition.

Try Percona Distribution for MySQL today!

Share This Post!

Subscribe
Connect with
guest
Label
0 Comments

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK