6
WL#1720: Semi-synchronous replication
source link: https://dev.mysql.com/worklog/task/?id=1720
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
WL#1720: Semi-synchronous replication
SUMMARY - The interface for semi-synchronous replication was done in WL#4398. - This WL#1720 will be closed when the components have been pushed to the main tree and are part of the server releases. RATIONALE Get HA by having slave acknowledge transactions before it is committed. At COMMIT time on master, wait for a ACK from slave before return to user Ensures that data always exist in two places. DESIGN LEFT TO DO 1. How to configure master? Esp if there are multiple slaves. 2. Ensure that the solution is general enough, e.g. any engine can use it. 3. Check if there can be clean interface to "replicator" DESCRIPTION That's a feature request from a customer, who calls it "semi-synchronous replication": - user sends COMMIT to master - master writes the transaction to the binlog - master waits for the slave to say "I have received the transaction and saved it into my relay log" (ACK of the slave I/O thread) or the same plus "I have executed the transaction" (ACK of the slave SQL thread). Exactly, the thread which writes to the binlog waits for a global variable "slave_pos" (and "slave_binlog") to be updated (a condition should be signaled), and the Binlog_dump thread is responsible for upating this global variable: the Binlog_dump thread sends the events to the slave, and waits for an "OK" packet from the slave, then broadcasts the condition. Ideally, for multi-statement transactions, the Binlog_dump thread should not wait for a slave "OK" for every event; ideally, only for the event which is really updating the slave's data: for InnoDB, the COMMIT; for MyISAM, any statement; in other words, all events which get directly written to the disk binary log, plus the COMMIT event. "This event requires an OK" could be a flag in the event. If the master does not receive an OK before a timeout is exceeded, the master switches this slave to asynchronous replication (and prints a message to the error log "slave XXX was switched to async rep (ie. no more waits on COMMIT) because a timeout was exceeded for binary log YYY position ZZZ (when we have transaction ID later we can print it too"). DBA should be able to request semi-sync or async replication on the fly. SHOW SLAVE HOSTS would give info about if the slave is doing semi-sync or async replication and why (timeout exceeded, or DBA requested this mode). The customer has an urgent need of "ACK that it's in the relay log" more than "ACK that it's executed by the slave". He would agree to sponsor it. Customer also suggested that, if the master has more than one slave, the master waits for a configurable number of slaves to have ACKed the transaction. But customer does not need this extension immediately. This new feature is somehow similar to 2-phase commit: master waits for slave's commit before saying "ok" to the user; however it does not rollback if the slave failed; hence it may be better for critical sites which need always ongoing operations. Guilhem should continue discussion with the customer to clarify and amend the above description, with Brian too. Monty presently agrees _on_the_idea_ of this type of replication. References: - Google implemented a version of this feature: http://mysqlha.blogspot.com/2007/05/semi-sync-replication-for-mysql-5037.html CAN WE USE THIS INTERFACE? ========================== We should consider if we can use (a superset/subset of) this interface for semi-sync. The below is draft. Components ---------- MySQL - MySQL Server Replicator - Library that is linked with MySQL server Structure --------- +--------------------------------+ | MySQL | +--------------------------------+ | Service and Pluging Interfaces | +--------------------------------+ | Replicator | +--------------------------------+ Distributed concurrency control =============================== A separate thread in the Replicator checks if the prioritized transaction and the local transactions conflict. If so then the local transaction is rolled back. This proposal is only supporting an optimistic approach. For the conservative approach, we also need a notifier hook at the begin of the transaction. MySQL - Replicator Interface ---------------------------- The interface has four main calls: 1) Commit requestor: request_commit(trx_id) (MySQL -> Replicator) Before MySQL commit a transaction it sends a request to Replicator with the following information: a) Transaction Id Replicator replies with Yes or No, indicating if the transaction is allowed to commit or not. If "No", then the transaction is rolled back. If "Yes', then MySQL can commit the transaction. This should be implemented in handler.cc:ha_commit_trans(), after 2PC prepare and right before the call to ha_commit_one_phase(). Timeline: 1 2 3 4 5 6 7 8 ----|-------------|-------------|-------|--------|---------|---> prepare | ^ Store commit commit | | Xid SE1 SE2 v | Commit Approval (Set seq no) NOTE Recovery: If the MySQL server crash between 4 and 5, then the Replicator will send the transaction to other nodes, while it is not being recovered by this MySQL server. This means that the replicator needs to ask the server for its last executed xid, and and then send local lost transactions back to the MySQL server for reapplication. And this as part of the recovery protocol. NOTE: The DMBS is not allowed to fail after step 5. The global commit has already happened. Sample code: /* Request commit from replicator */ if (!replicator->request_commit()) { /* NEW */ ha_rollback_trans(thd, all); /* NEW */ error= 1; /* NEW */ goto end; /* NEW */ } /* NEW */ error=ha_commit_one_phase(thd, all) ? (cookie ? 2 : 1) : 0; DBUG_EXECUTE_IF("crash_commit_before_unlog", abort();); if (cookie) tc_log->unlog(cookie, xid); DBUG_EXECUTE_IF("crash_commit_after", abort();); end: if (is_real_trans) start_waiting_global_read_lock(thd); /* Report commit completed to replicator */ replicator->report_commit(); /* NEW */ 2) Commit report (MySQL -> Replicator) After MySQL has committed the transaction, it must report the transaction id to Replicator. See code above. 3) Abort report (MySQL -> Replicator) Add call in: handler.cc:ha_rollback_trans(THD *thd, bool all) 4) Master: Report rows: log_subscriber::report(event) (MySQL -> Replicator) The binlog event will be sent to the subscriber. Extractor --------- Every row that is applied in the handler interface is caught in the same way as row-based replication does and reported to Replicator: a) Row being changed/binlog event b) Transaction Id Alternatives: a) Trigger level: Catch rows b) Handler level: Catch rows c) Binlogging level: Catch binlog events (log.cc:MYSQL_BIN_LOG::write(IO_CACHE*)) DECISION: Try with c. Serialization ------------- Will be implemented using log_event.cc:write_data_header() functions that are re-written not to write the stuff to file, but keep in buffer. 5) Transaction start (MySQL -> Replicator) Called when transaction is started a) Transaction id 6) Rollback request (Replicator -> MySQL) Replicator can at any time call MySQL with a request to rollback an ongoing local transaction. MySQL then needs to roll it back. 7) Slave Executor Alternatives: a) SQL level - outside client b) SQL level - mysql_parse() c) Binlog events d) Individual rows, exucute on handler level DECISION: Try with c. FILTERING --------- All events except rows events (3 types) and table map events are discarded. EXECUTION --------- This would probably mean to take the sql_binlog.cc:mysql_client_binlog_statement(THD* thd) function and strip away the base64 stuff, and use the same method to execute events. 8) Recovery function si::get_last_executed_transaction() At recovery time the Replicator needs to know the last executed transaction, so that it may replay possible local transaction that was lost if the server failed between 4, 5 above and any remote transaction that was not yet applied. This is probably some functionality in log.cc for XA reovery.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK