The Underlying Importance of the server_id Parameter

January 8, 2024

One of the most underlooked parameters to configure MySQL is server_id, which is an integer number to identify a server inside a replication topology uniquely. Note that two servers within a replication set can’t have the same server_id value.

It is generally set up as a “random” number, just different from the one configured on the other replicas, and once it is set up, it is generally not reviewed or changed later, which is fine, but if overlooked, it could lead to an unwanted skip of transactions in a recovery scenario like the one I will describe below.

Suppose we have the following topology:

db2 - primary - server_id = 82

_ db1 - replica - server_id = 81

Once the traffic is moved from the previous primary (db1), we can start the maintenance on it. During the maintenance, there was a disk issue, leaving the data in MySQL inconsistent.

Don’t worry, we have the backups taken from db2 earlier that same day, let’s use that!!

We copy the backup from db2 using rsync into db1.
We run the steps on how to restore a compressed backup from here.
We’re starting MySQL with the same configuration that db1 had before the crash. (server_id=81).
When setting up replication, we can use the replication positions from xtrabackup_binlog_info, as we need the coordinates from db2 at the time that the consistent backup was taken.

Between the backup was taken and the time of the restore, many binlogs were generated that needed to be applied, so we expect the replication to take some time to fully catch up with the primary.

Once the replication was configured, we noticed that the binlogs were advancing too fast.

Suppose we configured replication using the following positions:

mysql> CHANGE MASTER TO

-> MASTER_HOST='10.10.10.10',

-> MASTER_USER='repluser',

-> MASTER_PASSWORD='XXXXXXXXXXXXXXXX',

-> MASTER_LOG_FILE='mysql-bin.034322',

-> MASTER_LOG_POS=375677230;

Query OK, 0 rows affected, 2 warnings (0.36 sec)

We review the replication status:

*************************** 1. row ***************************

Slave_IO_State: Waiting for master to send event

Master_Host: 10.10.10.10

Master_User: repluser

Master_Port: 3306

Connect_Retry: 60

Master_Log_File: mysql-bin.034324

Read_Master_Log_Pos: 3898727534

Relay_Log_File: mysql-relay-bin.000006

Relay_Log_Pos: 369

Relay_Master_Log_File: mysql-bin.034324

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Exec_Master_Log_Pos: 3898727534

Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

Replicate_Ignore_Server_Ids:

Master_Server_Id: 2002

Master_UUID: 94318777-57fe-11ee-a884-b496916f3834

Master_Info_File: mysql.slave_master_info

SQL_Delay: 0

SQL_Remaining_Delay: NULL

Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

Master_Retry_Count: 86400

Noticing something odd? Yes, Seconds_behind_master: 0, even when there are around 40+ binlogs to apply.

It kept running at a “fast pace”; it advanced 40 binlogs in a matter of a couple of minutes, until it failed:

mysql> show slave statusG

*************************** 1. row ***************************

Slave_IO_State:

Master_Host: 10.10.10.10

Master_User: repluser

Master_Port: 3306

Connect_Retry: 60

Master_Log_File: mysql-bin.034372

Read_Master_Log_Pos: 107962899

Relay_Log_File: mysql-relay-bin.000086

Relay_Log_Pos: 6294

Relay_Master_Log_File: mysql-bin.034364

Slave_IO_Running: No

Slave_SQL_Running: No

Replicate_Do_DB:

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

Last_Errno: 1032

Last_Error: Could not execute Update_rows event on table test.mysql1; Can't find record in mysql1, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.034364, end_log_pos 123596930

Why? Was our backup broken?

Upon checking the binlogs, it’s found that the binlogs events that were processed “too fast” were generated by the server_id=81, which is the same as the current replica (as those records were generated on db1, which was the primary when the backup was taken). Once these records are tried to be applied on a server with the same server_id, those are skipped.

Then, once the failover was done, the binlogs events are now being generated with the server_id=82 (db2). This is when it started to actually apply the events, but since it skipped multiple events from the same server_id, it missed many updates and inserts to the DB, leading it to an inconsistent state.

In this case, there’s no other option but to retry the restoration, this time changing the server_id value on the restore server before starting replication.

Another approach that could have been applied, other than changing the server ID, was to enable the option replicate-same-server-id.

As per the documentation:

“This option is for use on replicas. The default is 0 (FALSE). With this option set to 1 (TRUE), the replica does not skip events that have its own server ID. This setting is normally useful only in rare configurations.”

Conclusion

The scenario described in this blog may not apply to all restorations, but it is a good reminder to review the server_id when performing restores on a replication environment. It can avoid headaches!

It was also mentioned briefly the parameter “replicate-same-server-id” is not recommended to be used unless we’re sure what we’re doing.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Try Percona Distribution for MySQL today!

Share This Post!

Connect with

Label

0 Comments

The Underlying Importance of the server_id Parameter

The Underlying Importance of the server_id Parameter

Conclusion

Share This Post!

Recommend

What Fourth-Quarter Earnings Are Telling Us About Wall Street's Hiring Plans

周鸿祎公开课回应程前道歉：我觉得没什么的，不需要道歉的

These AI binoculars just made birdwatching a whole lot easier

Women's health on show, a little, at CES

如何实现PSD文件在线编辑？

how to set delay hours or minutes to zero if calculated overtime covers detected...

交互设计工具Framer，一文带你带你全面了解！

Report: Bain Capital and Hellman & Friedman are competing to buy DocuSign

Accenture Interview Experience (On-Campus)

Ransomware: Dallas says cyberattack targeted more people than previously disclos...

About Joyk