pt-archiver Misbehaving With Secondary Index

pt-archiver Misbehaving With Secondary Index Not long ago, we found a long-running query in one of our clients on Managed Services. While reviewing this long-running query, we found it was generated by a pt-archiver operation. That doesn’t sound good, right?

I was able to reproduce the issue on my test environment. In this article, I will show you how to use pt-archiver to purge data from a table using a secondary index. First, we need to have the full picture before going any deeper. Here is a representation of the table structure:

MySQL

Create Table: CREATE TABLE `sbtest1` (

`id` int(11) NOT NULL AUTO_INCREMENT,

`k` int(11) NOT NULL DEFAULT '0',

`c` char(120) NOT NULL DEFAULT '',

`pad` char(60) NOT NULL DEFAULT '',

PRIMARY KEY (`id`),

KEY `k_1` (`k`)

) ENGINE=InnoDB AUTO_INCREMENT=4000001 DEFAULT CHARSET=latin1

As you can see, this test table was generated with sysbench. The criteria defined to perform the purge process was to remove all rows where column “k” is lower than 1,500,000. In my test environment, this represents around 40K rows.

MySQL

mysql> select count(1) from `dbtest`.`sbtest1` where (k < 1500000);

+----------+

| count(1) |

+----------+

| 41687 |

+----------+

Examining the pt-archiver operation used, we found that the options used were logical according to the archiving criteria and the table structure. These are the pt-archiver options used:

Shell

pt-archiver \

--source=h=localhost,P=3306,D=dbtest,t=sbtest1,b=0,i=k_1 \

--where='k < 1500000' --limit=1000 \

--primary-key-only --bulk-delete --purge

As you can see, it is suggesting the usage of index “k_1”, which references the column used on the purging criteria, and it is also using the option “–primary-key-only” which is suggested if you just want to purge data, which is our case. The option “–primary–key-only” makes pt-archiver include only the primary key columns on the SELECT clause to avoid fetching all columns from the table; this simple feature speeds up the purge process, especially on wide tables.

However, reviewing the sentence executed in the database, I found that the query running was not using the suggested index:

MySQL

SELECT /*!40001 SQL_NO_CACHE */ `id` FROM `dbtest`.`sbtest1` FORCE INDEX(`PRIMARY`) WHERE (k < 1500000) AND (`id` < '4000000') ORDER BY `id` LIMIT 1000

Now, if we get the actual explain plan for this query, we will see it is not optimal at all, as it is fetching almost two million rows and forcing the usage of the “PRIMARY” index. No wonder why this was a long-running query.

MySQL

mysql> explain format=json SELECT /*!40001 SQL_NO_CACHE */ `id` FROM `dbtest`.`sbtest1` FORCE INDEX(`PRIMARY`) WHERE (k < 1500000) AND (`id` < '4000000') ORDER BY `id` LIMIT 1000\G

*************************** 1. row ***************************

EXPLAIN: {

"query_block": {

"select_id": 1,

"cost_info": {

"query_cost": "783367.33"

"ordering_operation": {

"using_filesort": false,

"table": {

"table_name": "sbtest1",

"access_type": "range",

"possible_keys": [

"PRIMARY"

"key": "PRIMARY",

"used_key_parts": [

"key_length": "4",

"rows_examined_per_scan": 1928044,

"rows_produced_per_join": 642617,

"filtered": "33.33",

"cost_info": {

"read_cost": "654843.92",

"eval_cost": "128523.41",

"prefix_cost": "783367.33",

"data_read_per_join": "117M"

"used_columns": [

"id",

"attached_condition": "((`dbtest`.`sbtest1`.`k` < 1500000) and (`dbtest`.`sbtest1`.`id` < '4000000'))"

But why is pt-archiver doing this? Why isn’t it using the index suggested? After playing around with pt-archiver options, I realized that the usage of option “–primary-key-only” inhibits the usage of the suggested secondary index. This was reported initially on ticket PT-157, and it was marked as fixed. However, we can confirm that the most recent version of pt-archiver 3.4.0 is behaving as expected. So I opened this ticket PT-2112 to follow up on the fix.

Anyways, we need to fix the issue on our client, and to do so, we just need to remove the “–primary-key-only” option, which will allow us to use the suggested index. The downside of doing this is that it will fetch all columns from the table. We can see the SQL statement pt-archiver will generate by specifying the option “–dry-run”; this is a cool feature to see if the statement generated is optimal.

Shell

michael@testbox1:~$ pt-archiver \

> --source=h=localhost,P=3306,D=dbtest,t=sbtest1,b=0,i=k_1 \

> --where='k < 1500000' --limit=1000 \

> --bulk-delete --purge --dry-run

SELECT /*!40001 SQL_NO_CACHE */ `id`,`k`,`c`,`pad` FROM `dbtest`.`sbtest1` FORCE INDEX(`k_1`) WHERE (k < 1500000) ORDER BY `k` LIMIT 1000

SELECT /*!40001 SQL_NO_CACHE */ `id`,`k`,`c`,`pad` FROM `dbtest`.`sbtest1` FORCE INDEX(`k_1`) WHERE (k < 1500000) AND ((`k` >= ?)) ORDER BY `k` LIMIT 1000

DELETE FROM `dbtest`.`sbtest1` WHERE (((`k` >= ?))) AND (((`k` <= ?))) AND (k < 1500000) LIMIT 1000

If we examine the explain plan for this generated statement we can see that the query cost for this new sentence is up to 7X times better than the previous one, and we are fetching only 79K rows compared to the 2M from the previous query. So definitely a better query.

MySQL

mysql> explain format=json SELECT /*!40001 SQL_NO_CACHE */ `id`,`k`,`c`,`pad` FROM `dbtest`.`sbtest1` FORCE INDEX(`k_1`) WHERE (k < 1500000) ORDER BY `k` LIMIT 1000\G

*************************** 1. row ***************************

EXPLAIN: {

"query_block": {

"select_id": 1,

"cost_info": {

"query_cost": "111060.21"

"ordering_operation": {

"using_filesort": false,

"table": {

"table_name": "sbtest1",

"access_type": "range",

"possible_keys": [

"k_1"

"key": "k_1",

"used_key_parts": [

"key_length": "4",

"rows_examined_per_scan": 79328,

"rows_produced_per_join": 79328,

"filtered": "100.00",

"index_condition": "(`dbtest`.`sbtest1`.`k` < 1500000)",

"cost_info": {

"read_cost": "95194.61",

"eval_cost": "15865.60",

"prefix_cost": "111060.21",

"data_read_per_join": "14M"

"used_columns": [

"id",

"pad"

Conclusion

Hopefully, we will see this feature fixed in the future, but for now, we can just use pt-archiver as we did to take advantage of the secondary indexes. Also, as a recommendation, always use the “–dry-run” option to test your pt-archiver command and make sure the query produced is optimal for your database.

Conclusion

Recommend

恒宝股份：前三季度营收约7.15亿归母净利润约7273万元

车企争抢自动驾驶芯片

Modern PHP features explained - PHP 8.0 and 8.1

Droplet——一款轻量的Golang应用层框架 - ShiningRush

Caravaggio Was the Other Michelangelo of the Renaissance

Easy Ads for Facebook

Amazon's Response to Mauled Driver? A Paw Print Emoji

Elon Musk tells Twitter employees he is not going to fire 75% of the staff

4.6亿人次冲进李佳琦直播间！双11直播“开战”

英雄联盟：电竞投注官方APP划分阵营，新英雄与监视者意见不合

About Joyk