MongoDB 4.4 复制机制变化 —— oplogTruncateAfterPoint 在主库上的应用

背景

在内部针对 MongoDB 4.4 的测试中，我们发现在带写压力的情况下对主库进行 unclean shutdown （kill -9），主库在重启之后少了很多数据，通过分析日志可以看到在加载完 stable checkpoint，进行 replication recovery，即 apply oplog 之前，有很多在 oplogTruncateAfterPoint 之后的 oplog 被 truncate 掉了，导致只有很少的 oplog 被加载。

{"t":{"$date":"2021-01-27T16:40:22.601+08:00"},"s":"I",  "c":"REPL",     "id":21557,   "ctx":"SwitchWarmStandbyThread","msg":"Removing unapplied oplog entries after oplogTruncateAfterPoint","attr":{"oplogTruncateAfterPoint":{"":{"$timestamp":{"t":1611736805,"i":1799}}}}}
{"t":{"$date":"2021-01-27T16:40:22.651+08:00"},"s":"I",  "c":"REPL",     "id":21553,   "ctx":"SwitchWarmStandbyThread","msg":"Truncating oplog from truncateAfterOplogEntryTimestamp (non-inclusive)","attr":{"truncateAfterOplogEntryTimestamp":{"$timestamp":{"t":1611736805,"i":1799}},"oplogTruncateAfterPoint":{"$timestamp":{"t":1611736805,"i":1799}}}}
{"t":{"$date":"2021-01-27T16:40:22.656+08:00"},"s":"I",  "c":"REPL",     "id":21554,   "ctx":"SwitchWarmStandbyThread","msg":"Replication recovery oplog truncation finished","attr":{"durationMillis":55}}
{"t":{"$date":"2021-01-27T16:40:22.657+08:00"},"s":"I",  "c":"REPL",     "id":21544,   "ctx":"SwitchWarmStandbyThread","msg":"Recovering from stable timestamp","attr":{"stableTimestamp":{"$timestamp":{"t":1611736795,"i":1259}},"topOfOplog":{"ts":{"$timestamp":{"t":1611736805,"i":1799}},"t":1},"appliedThrough":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"oplogTruncateAfterPoint":{"$timestamp":{"t":0,"i":0}}}}
...
{"t":{"$date":"2021-01-27T16:40:25.860+08:00"},"s":"I",  "c":"REPL",     "id":21536,   "ctx":"SwitchWarmStandbyThread","msg":"Completed oplog application for recovery","attr":{"numOpsApplied":14901,"numBatches":19,"applyThroughOpTime":{"ts":{"$timestamp":{"t":1611736805,"i":1799}},"t":1}}}

在 4.2 及之前的版本中，oplogTruncateAfterPoint 只用于备库，用来保证 oplog batch 应用的原子性，最终目的是确保主备数据的一致性，但是在 4.4 中主库也会维护 oplogTruncateAfterPoint ，而这个问题根本原因就是由 oplogTruncateAfterPoint 在 4.4 中的作用变化导致，下面简要分析一下。

分析

在 4.2 中 oplogTruncateAfterPoint 只有备库会维护，用来保证 oplog batch write 的原子性，主库是不需要维护的（有 replset.oplogTruncateAfterPoint集合，但是对应的 value 总是TImestamp(0,0)空值），所以主库在 replication recovery 时不会进行 oplog 的 truncate。

但是在 4.4 中，主库也会维护 oplogTruncateAfterPoint，这个做的原因是和 oplog hole 相关。

oplog hole 是因为事务在 WiredTiger 层面的提交顺序和 oplog 顺序不一致导致，复制时如果要保证主备一致性需要避开这个 hole，这个大家应该都比较熟悉了，不做展开。

但是我们考虑一个场景，op1，op2，op3 三条 oplog，对应三个事务，在 WT 层面的提交顺序为 op1，op3，op2，那么当 op2 也成功提交后，1~3 这三条 oplog 对应备库来说显然是 no hole，可以放心复制。但是即使 no hole，我们也需要需要区分是 in-memory no hole ，还是 on-disk no hole 。

因为 op2 对应的 WAL 在 unclean shutdown 的情况下，因为在尾部是可能丢失的。如果备库在 in-memory no hole 的条件下就对 op 1~3 做了复制，那么在主库在重启的时候，不管是继续做主还是降级为从，都会和副本集其他成员数据不一致（缺失了 op2）。

在分析 4.2 和 4.4 是怎么避免上面这个问题之前，我们需要先了解下 in-memory no hole 是怎么维护的，这个在 4.2 和 4.4 中的方式是一样的。

在更早的版本，比如 3.2，3.4 ，是通过在 server 层自己维护一个队列来跟踪 oplog 对应的事务的提交情况来确定到哪个点是 no hole 的。由于在 4.0 之后，server 层都使用了带时间戳的事务，而这个时间戳实际上是 oplog 中的 ts 字段（ PS：这个时间戳在事务开始前就申请好了），所以可以依靠引擎层（WT）来告知我们截止到哪个时间点之前的事务都提交了，这个事情对于引擎来说很容易做到，也更应该由它们来做。这个时间点在 WT 中被称之为 all_durable ，但是千万不要被这个坑爹的 “durable” 给欺骗了，它并不是真正的说事务被持久化了，见这段注释，

/**
 * Manages oplog visibility.
 *
 * On demand, queries WiredTiger's all_durable timestamp value and updates the oplog read timestamp.
 * This is done asynchronously on a thread that startVisibilityThread() will set up.
 *
 * The WT all_durable timestamp is the in-memory timestamp behind which there are no oplog holes
 * in-memory. Note, all_durable is the timestamp that has no holes in-memory, which may NOT be
 * the case on disk, despite 'durable' in the name.
 *
 * The oplog read timestamp is used to read from the oplog with forward cursors, in order to ensure
 * readers never see 'holes' in the oplog and thereby miss data that was not yet committed when
 * scanning passed. Out-of-order primary writes allow writes with later timestamps to be committed
 * before writes assigned earlier timestamps, creating oplog 'holes'.
 */

4.2 和 4.4 正是依靠从 WT 查询 all_durable timestamp 来获得 in-memory no hole point，并设置 oplogReadTimestamp，所有需要需要读 opog 的场景（复制，changeStream 等）都需要基于 oplogReadTimestamp 来读，oplogReadTimestamp 之后的 oplog 对读者不可见（可能有 hole(坑) :-)）。

既然从 WT 拿到的是 in-memory no hole 的时间戳，4.2 是如何避免上面那个 unclean shutdown 导致的问题的？4.2 在从 WT 获取到 all_durable ts 之后，还会等待这个 ts 之前的事务全部都持久化（WAL 刷盘）才会设置新的 oplogReadTimestamp，

// 4.2: src/mongo/db/storage/wiredtiger/wiredtiger_oplog_manager.cpp
        const uint64_t newTimestamp = fetchAllDurableValue(sessionCache->conn());

        // The newTimestamp may actually go backward during secondary batch application,
        // where we commit data file changes separately from oplog changes, so ignore
        // a non-incrementing timestamp.
        if (newTimestamp <= _oplogReadTimestamp.load()) {
            LOG(2) << "No new oplog entries were made visible: " << Timestamp(newTimestamp);
            continue;
        }

        // In order to avoid oplog holes after an unclean shutdown, we must ensure this proposed
        // oplog read timestamp's documents are durable before publishing that timestamp.
        sessionCache->waitUntilDurable(/*forceCheckpoint=*/false, false); // 这里显式等待

这个方式实际上是杜绝了备库可能复制到比主库更多数据的可能，但是这么做的缺点是会导致复制延迟的增加（毕竟需要等到事务提交并且落盘才能复制到备库），而 unclean shutdown 显然是一种很不常见的 case，不应该因为这个影响通用场景。所以在 4.4 中对这里做了优化，主库更新 oplogReadTimestamp 不再调用 waitUntilDurable 等待刷盘，而是 WAL 刷盘的过程中，异步的去更新 oplogTruncateAfterPoint，主库上的 oplogTruncateAfterPoint 表示这之前的 oplog 是 on-disk no hole 的。

// 4.4: src/mongo/db/repl/replication_coordinator_external_state_impl.cpp
JournalListener::Token ReplicationCoordinatorExternalStateImpl::getToken(OperationContext* opCtx) {
    // If in state PRIMARY, the oplogTruncateAfterPoint must be used for the Durable timestamp in
    // order to avoid majority confirming any writes that could later be truncated.
    //
    // TODO (SERVER-45847): temporary hack for the ephemeral storage engine that passes in a nullptr
    // for the opCtx. The ephemeral engine does not do parallel writes to cause oplog holes,
    // therefore it is safe to skip updating the oplogTruncateAfterPoint that tracks oplog holes.
    if (MONGO_likely(opCtx)) {
        auto truncatePoint = repl::ReplicationProcess::get(opCtx)
                                 ->getConsistencyMarkers()
                                 ->refreshOplogTruncateAfterPointIfPrimary(opCtx); // 更新主库 oplogTruncateAfterPoint
        if (truncatePoint) {
            return truncatePoint.get();
        }
    }

    // All other repl states use the last applied.
    return repl::ReplicationCoordinator::get(_service)->getMyLastAppliedOpTimeAndWallTime();
}

在真的出现 unclean shutdown 时，主库重启时会把 oplogTruncateAfterPoint 之后的 oplog 清理掉，避免出现和副本集其他成员不一致，而自己还不感知的情况。

至此就分析清楚了，最开始问题的原因，因为测试过程中总是 kill -9 的（unclean shutdown），在带写压力的情况下切换，oplogTruncateAfterPoint 肯定是没有更新到最新的 oplog ts，导致新主库部分 oplog 没有加载。

验证：通过 kill（clean shutdown），而不是 kill -9 来触发切换，可以看到数据不会变少。

影响

初看之下，主库允许备库复制自己还没有完全持久化的 oplog，似乎副本集的持久化能力降低了，但仔细思考一下，持久化能力实际是增强了。

考虑一下上面那个 op 1-3 的例子，op2 最后提交，如果说在 op2 持久化之前，主库 crash 了，在 4.2 里面，新的主库实际上是没有 op1-3 的数据的（虽然对用户返回写成功了），但是在 4.4 里面因为不等 op2 持久化就复制，那么新主库有这个数据。在分布式环境下，复制也是持久化能力的一部分，更低的复制延迟就意味着在整个分布式系统层面更好的持久性。

背景

分析

影响

Recommend

成人“哄睡”这门生意

深度拆解贾跃亭57页造车PPT：恒大是最大股东？

Netflix之后，如何用Spring Cloud 新组件构建微服务架构？

在 WebAssembly 中实现回调的方式

RocketMQ 常用消息类型

独家｜前菜鸟网络CTO谷雪梅加入网约车平台「T3出行」，担任首席科学家一职

程序员必备的 10 大 GitHub 仓库，前端占了 7 个！

智能汽车“增量部件”争夺战：特斯拉的“弃子”激光雷达，缘何成为华为、小鹏、蔚来的“香...

仅使用 HTML 和 CSS 实现的标签云效果

Elasticsearch简史：源自给老婆开发烹饪App

About Joyk