“Look: I/O thread is waiting for disk space!”

MySQL 8.0.1 introduced a work with replication threads mutexes in order to improve performance. In MySQL 8.0.2 the same work was extended, focusing in usability, and revamped how replication deals with disk-full conditions, improving the responsiveness of both monitoring commands and administrative commands such as KILL, as well as making status messages much more precise and helpful.

Before this work, a troubleshooting session may look like:

Operator 1

Operator 2

Operator 2 connects to the MySQL instance.

Operator 2

Operator 2 queries performance_schema.replication_connection_status table.

Operator 2

Operator 1

Operator 2 connects a new client session and queries performance_schema.threads table.

Operator 2

Operator 1

Operator 2

Operator 2 connects to the server host to check the MySQL instance error log.

Operator 2

Operator 1

The above could happen because when the replication receiver thread gets disk full while writing new events into the relay log, the server goes to a loop where it polls the disk every 60 seconds to see if enough space has been freed. It used to hold a mutex while doing so, and that prevented queries into performance_schema.replication_connection_* tables (and other monitoring statements like SHOW SLAVE STATUS) from executing.

Checking replication threads statuses also gave almost no clue about what was happening:

slave57> USE performance_schema;

slave57> SELECT name, processlist_state, processlist_time FROM threads WHERE name LIKE '%slave%';

+----------------------+----------------------------------------+------------------+

| name | processlist_state | processlist_time |

+----------------------+----------------------------------------+------------------+

| thread/sql/slave_io | Queueing master event to the relay log | 106 |

| thread/sql/slave_sql | Reading event from the relay log | 120 |

+----------------------+----------------------------------------+------------------+

2 rows in set (0,00 sec)

The only effective hint about the disk space issue was a set of messages into the server error log like:

[ERROR] [000000] Disk is full writing './slave-relay-bin.000002' (Errcode: 28 - No space left on device). Waiting for someone to free space...

[ERROR] [000000] Retry in 60 secs. Message reprinted in 600 secs

Trying to kill the receiver thread would make the server send the kill signal to the process, but the signal would only be checked by the thread when it was actually checking for disk space. So, as it slept for 60 seconds between attempts, the receiver thread sometimes waited up to 1 minute before being actually killed.

Also, when killed, the receiver thread left any event’s partial content already written into the relay log. Trying to show the contents of the relay log (using mysqlbinlog client program) would lead to an error reading the last (incomplete) event in the relay log file. In 5.7 (since 8.0.1 this should not happen), this partial content would lead to a SQL thread error:

slave57> USE performance_schema;

slave57> SELECT * FROM replication_connection_status\G

*************************** 1. row ***************************

CHANNEL_NAME:

GROUP_NAME:

SOURCE_UUID: cb3480c2-adc5-11e7-bb36-b86b232ba6ca

THREAD_ID: NULL

SERVICE_STATE: OFF

COUNT_RECEIVED_HEARTBEATS: 0

LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00

RECEIVED_TRANSACTION_SET: cb3480c2-adc5-11e7-bb36-b86b232ba6ca:1-7

LAST_ERROR_NUMBER: 1595

LAST_ERROR_MESSAGE: Relay log write failure: could not queue event from master

LAST_ERROR_TIMESTAMP: 2017-10-10 15:19:15

1 row in set (0,01 sec)

slave57> SELECT * FROM replication_applier_status_by_worker\G

*************************** 1. row ***************************

CHANNEL_NAME:

WORKER_ID: 0

THREAD_ID: NULL

SERVICE_STATE: OFF

LAST_SEEN_TRANSACTION: cb3480c2-adc5-11e7-bb36-b86b232ba6ca:8

LAST_ERROR_NUMBER: 1594

LAST_ERROR_MESSAGE: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.

LAST_ERROR_TIMESTAMP: 2017-10-10 15:19:15

1 row in set (0,00 sec)

Improved usability

In MySQL 8.0.2 we introduced an improvement where the receiver thread no longer holds the mutex required for monitoring activities while writing into the relay log.

This means that, when blocked waiting for disk space, the receiver thread no longer blocks monitoring threads.

In order to allow operators to know what is happening without the need of inspecting server error log, the receiver thread will now state that it is waiting for disk space in any place showing replication thread states.

slave80> USE performance_schema;

slave80> SELECT name, processlist_state, processlist_time FROM threads WHERE name LIKE '%slave%';

+----------------------+--------------------------------------------------------+------------------+

| name | processlist_state | processlist_time |

+----------------------+--------------------------------------------------------+------------------+

| thread/sql/slave_io | Waiting for disk space | 28 |

| thread/sql/slave_sql | Slave has read all relay log; waiting for more updates | 44 |

+----------------------+--------------------------------------------------------+------------------+

2 rows in set (0,38 sec)

We also changed the wait for disk space procedure to check for the kill signal every second (freed disk space is still checked every minute). So, a KILL of the receiver thread waiting for disk space will make it stop almost promptly (no need to wait for up to a minute anymore). Also, when killed, the receiver thread will now truncate the relay log file in the end of the last completely received event (there will be no trash in the relay log file anymore).

[ERROR] Disk is full writing '/media/relay_log/slave-relay-bin.000004' (Errcode: 28 - No space left on device). Waiting for someone to free space...

[ERROR] Retry in 60 secs. Message reprinted in 600 secs

[ERROR] Slave I/O for channel '': Relay log write failure: failed to write event to the relay log file, Error_code: 1595

[Note] Relaylog file /media/relay_log/slave-relay-bin.000004 size was 1912832, but was truncated at 1904837.

Conclusion

When you don’t know exactly what is happening with replication threads, it takes longer to identify the effective actions to fix the issue, which leads to further replication delays or incorrect actions that don’t deal with the problem. We have now improved the usability of commands to monitor replication threads, allowing a receiver thread to report that it is waiting for disk space as a thread state, and also allowing replication monitoring commands to not be blocked while the receiver thread is waiting for disk space. Time to update monitoring procedures!

Another day, in an operations room with MySQL 8…

Operator 1

Operator 2

Operator 2 connects to the MySQL instance and checks replication threads statuses.

Operator 2

Operator 1

32,824 total views, 46 views today

About João Gramacho

João Gramacho is a Principal Software Developer for the MySQL Replication Core team at Oracle. Before joining Oracle, he worked for more than ten years with IT infrastructure support (servers hardware, operating systems, networks, management and security) and did a PhD in high-performance computing. View all posts by João Gramacho →

2 thoughts on ““Look: I/O thread is waiting for disk space!””

I have the same problem happened in slave running group replication in version 8.0.17. I am wondering how can I recover my slave. the slave server shut down and when I restarted the server MySQL error log shows the relay log or master log is corrupted
error log
] Disk is full writing ‘./ic-1-relay-bin-group_replication_applier.000039’ (OS errno 28 – No space left on device). Waiting for someone to free space… Retry in 60 secs. Message reprinted in 600 secs.
2019-08-21T19:02:37.095609Z 33 [ERROR] [MY-010584] [Repl] Slave SQL for channel ‘group_replication_applier’: Could not execute Write_rows event on table sbtest.sbtest3; Error writing file ‘/tmp/MLfd=28’ (OS errno 22019-08-22T06:55:36.766622Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.17) starting as process 910
2019-08-22T06:55:43.510632Z 1 [Warning] [MY-012637] [InnoDB] 81920 bytes should have been written. Only 77824 bytes written. Retrying for the remaining bytes.
2019-08-22T06:55:43.510669Z 1 [Warning] [MY-012638] [InnoDB] Retry attempts for writing partial data failed.
2019-08-22T06:55:43.510683Z 1 [ERROR] [MY-012639] [InnoDB] Write to file ./#innodb_temp/temp_6.ibt failed at offset 0, 81920 bytes should have been written, only 77824 were written. Operating system error number 28. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.
….
Error number 28 means ‘No space left on device’
2019-08-22T07:02:37.278732Z 1 [ERROR] [MY-012267] [InnoDB] Could not set the file size of ‘./ibtmp1’. Probably out of disk space
2019-08-22T07:02:37.278746Z 1 [ERROR] [MY-012926] [InnoDB] Unable to create the shared innodb_temporary.
2019-08-22T07:02:37.278764Z 1 [ERROR] [MY-012930] [InnoDB] Plugin initialization aborted with error Generic error.
2019-08-22T07:02:37.776328Z 1 [ERROR] [MY-010334] [Server] Failed to initialize DD Storage Engine
2019-08-22T07:02:37.776504Z 0 [ERROR] [MY-010020] [Server] Data Dictionary initialization failed.
2019-08-22T07:02:37.779913Z 0 [ERROR] [MY-010119] [Server] Aborting
…
[MY-010818] [Server] Error reading GTIDs from relaylog: -1
Slave SQL for channel ‘group_replication_applier’: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master’s binary log is corrupted (you can check this by running ‘mysqlbinlog’ on the binary log), the slave’s relay log is corrupted (you can check this by running ‘mysqlbinlog’ on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the master’s or slave’s MySQL code. If you want to check the master’s binary log or slave’s relay log, you will be able to know their names by issuing ‘SHOW SLAVE STATUS’ on this slave. Error_code: MY-013121
1. Hi Dan,
  
  From the logs you posted I guess:
  – GR could not write into GR applier relay log;
  – InnoDB complained about not being able to write to disk;
  – There was probably a server restart;
  – InnoDB complained about disk space and the server failed to load its plug-in;
  – Server DD failed to initialize (probably because of no InnoDB plug-in loaded);
  – Finally, there is a GR applier channel initialization failure.
  
  If the interpretation of the logs are correct, you can try to recover your slave by:
  
  1) Fixing the server disk space issue (ensure server initializes InnoDB and DD correctly);
  2) If GR still doesn’t start correctly, try cleaning up its applier relay log files with RESET SLAVE FOR CHANNEL 'group_replication_applier';
  3) Try to start GR again;

“Look: I/O thread is waiting for disk space!”

Improved usability

Conclusion

Another day, in an operations room with MySQL 8…

About João Gramacho

Post navigation

2 thoughts on ““Look: I/O thread is waiting for disk space!””

Leave a Reply Cancel reply

Recommend

Why Psychology Should Play a Part in Your Packaging

Sentiment Analysis: First Steps With Python's NLTK Library

记者以职务身份开设自媒体，将被纳入记者证核验重点！

第五届中国康养产业发展论坛开幕攀枝花发布康养产业标准体系

新晋程序员奶爸的云原生之路

2021可转债投资（双低脉冲+长期绩优）

腐烂的尸体不能说话，却可以拆穿不在场证明

Objective evaluation of NLG texts – Ehud Reiter's Blog

PyQt5系列教程（二）：基本功能之显示提示语

劝降书！致双低可转债持有者

About Joyk