6

WL#1720: Semi-synchronous replication

 2 years ago
source link: https://dev.mysql.com/worklog/task/?id=1720
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

WL#1720: Semi-synchronous replication

SUMMARY
- The interface for semi-synchronous replication was done in WL#4398.
- This WL#1720 will be closed when the components have been pushed
  to the main tree and are part of the server releases.

RATIONALE

Get HA by having slave acknowledge transactions before it is committed.
At COMMIT time on master, wait for a ACK from slave before return to user
Ensures that data always exist in two places.

DESIGN LEFT TO DO

1. How to configure master?  Esp if there are multiple slaves.
2. Ensure that the solution is general enough, e.g. any engine
   can use it.
3. Check if there can be clean interface to "replicator"

DESCRIPTION

That's a feature request from a customer, who calls it "semi-synchronous
replication":
- user sends COMMIT to master
- master writes the transaction to the binlog
- master waits for the slave to say "I have received the transaction and saved
it into my relay log" (ACK of the slave I/O thread) or the same plus "I have
executed the transaction" (ACK of the slave SQL thread).
Exactly, the thread which writes to the binlog waits for a global variable
"slave_pos" (and "slave_binlog") to be updated (a condition should be signaled),
and the Binlog_dump thread is responsible for upating this global variable: the
Binlog_dump thread sends the events to the slave, and waits for an "OK" packet
from the slave, then broadcasts the condition.
Ideally, for multi-statement transactions, the Binlog_dump thread should not
wait for a slave "OK" for every event; ideally, only for the event which is
really updating the slave's data: for InnoDB, the COMMIT; for MyISAM, any
statement; in other words, all events which get directly written to the disk
binary log, plus the COMMIT event. "This event requires an OK" could be a flag
in the event.

If the master does not receive an OK before a timeout is exceeded, the master
switches this slave to asynchronous replication (and prints a message to the
error log "slave XXX was switched to async rep (ie. no more waits on COMMIT)
because a timeout was exceeded for binary log YYY position ZZZ (when we have
transaction ID later we can print it too").

DBA should be able to request semi-sync or async replication on the fly.

SHOW SLAVE HOSTS would give info about if the slave is doing semi-sync or async
replication and why (timeout exceeded, or DBA requested this mode).

The customer has an urgent need of "ACK that it's in the relay log" more than
"ACK that it's executed by the slave". He would agree to sponsor it.

Customer also suggested that, if the master has more than one slave, the master
waits for a configurable number of slaves to have ACKed the transaction. But
customer does not need this extension immediately.

This new feature is somehow similar to 2-phase commit: master waits for slave's
commit before saying "ok" to the user; however it does not rollback if the slave
failed; hence it may be better for critical sites which need always ongoing
operations.

Guilhem should continue discussion with the customer to clarify and amend the
above description, with Brian too. Monty presently agrees _on_the_idea_ of this
type of replication.

References:
- Google implemented a version of this feature:
  http://mysqlha.blogspot.com/2007/05/semi-sync-replication-for-mysql-5037.html


CAN WE USE THIS INTERFACE?
==========================
We should consider if we can use (a superset/subset of)
this interface for semi-sync.  The below is draft.

Components
----------
MySQL - MySQL Server
Replicator - Library that is linked with MySQL server

Structure
---------
  +--------------------------------+
  |             MySQL              |
  +--------------------------------+
  | Service and Pluging Interfaces |
  +--------------------------------+
  |           Replicator           |
  +--------------------------------+


  
Distributed concurrency control
===============================
A separate thread in the Replicator checks if the prioritized transaction
and the local transactions conflict.  If so then the local transaction
is rolled back.

This proposal is only supporting an optimistic approach.  For the
conservative approach, we also need a notifier hook at the begin of
the transaction.

MySQL - Replicator Interface 
----------------------------
The interface has four main calls:

1) Commit requestor: request_commit(trx_id) (MySQL -> Replicator)

   Before MySQL commit a transaction it sends a request to 
   Replicator with the following information:

   a) Transaction Id

   Replicator replies with Yes or No, indicating if the transaction is
   allowed to commit or not.  

   If "No", then the transaction is rolled back.
   If "Yes', then MySQL can commit the transaction.

   This should be implemented in handler.cc:ha_commit_trans(), after
   2PC prepare and right before the call to ha_commit_one_phase().

   Timeline:

        1             2   3     4   5       6        7         8
    ----|-------------|-------------|-------|--------|---------|--->
          prepare         |     ^     Store   commit   commit   
                          |     |      Xid     SE1      SE2    
                          v     |     
                      Commit Approval
                       (Set seq no)

   NOTE Recovery:
   If the MySQL server crash between 4 and 5, then the Replicator will
   send the transaction to other nodes, while it is not being
   recovered by this MySQL server.  This means that the replicator
   needs to ask the server for its last executed xid, and and then
   send local lost transactions back to the MySQL server for 
   reapplication.  And this as part of the recovery protocol.

   NOTE:
   The DMBS is not allowed to fail after step 5.  The global commit
   has already happened.

   Sample code:

     /* Request commit from replicator */
     if (!replicator->request_commit()) {   /* NEW */
       ha_rollback_trans(thd, all);         /* NEW */
       error= 1;                            /* NEW */
       goto end;                            /* NEW */
     }                                      /* NEW */
     error=ha_commit_one_phase(thd, all) ? (cookie ? 2 : 1) : 0;
     DBUG_EXECUTE_IF("crash_commit_before_unlog", abort(););
     if (cookie)
       tc_log->unlog(cookie, xid);
     DBUG_EXECUTE_IF("crash_commit_after", abort(););
   end:
     if (is_real_trans)
       start_waiting_global_read_lock(thd);
     /* Report commit completed to replicator */
     replicator->report_commit();           /* NEW */

2) Commit report (MySQL -> Replicator)

   After MySQL has committed the transaction, it must report the 
   transaction id to Replicator.

   See code above.

3) Abort report  (MySQL -> Replicator)

   Add call in:
   handler.cc:ha_rollback_trans(THD *thd, bool all)

4) Master: Report rows: log_subscriber::report(event) (MySQL -> Replicator)

   The binlog event will be sent to the subscriber.

   Extractor
   ---------
   Every row that is applied in the handler interface is caught in the
   same way as row-based replication does and reported to Replicator:

   a) Row being changed/binlog event
   b) Transaction Id

   Alternatives:
   a) Trigger level: Catch rows
   b) Handler level: Catch rows
   c) Binlogging level: Catch binlog events 
      (log.cc:MYSQL_BIN_LOG::write(IO_CACHE*))

   DECISION: Try with c.

   Serialization
   -------------
   Will be implemented using log_event.cc:write_data_header()
   functions that are re-written not to write the stuff to file, but
   keep in buffer.

5) Transaction start (MySQL -> Replicator)

   Called when transaction is started 

   a) Transaction id

6) Rollback request (Replicator -> MySQL)

   Replicator can at any time call MySQL with a request to rollback an
   ongoing local transaction.  MySQL then needs to roll it back.

7) Slave Executor

   Alternatives:
   a) SQL level - outside client
   b) SQL level - mysql_parse()
   c) Binlog events
   d) Individual rows, exucute on handler level

   DECISION: Try with c.

   FILTERING
   ---------
   All events except rows events (3 types) and table map events 
   are discarded. 

   EXECUTION
   ---------
   This would probably mean to take the 
   sql_binlog.cc:mysql_client_binlog_statement(THD* thd)
   function and strip away the base64 stuff, and use the 
   same method to execute events.

8) Recovery function  si::get_last_executed_transaction()

   At recovery time the Replicator needs to know the last executed
   transaction, so that it may replay possible local transaction that
   was lost if the server failed between 4, 5 above and any remote
   transaction that was not yet applied.

   This is probably some functionality in log.cc for XA reovery.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK