MongoDB Oplog Stones 实现分析

本文基于 4.2 代码分析

背景

Oplog Collection 首先是作为一个 Capped Collection 来实现的，但是单纯的 Capped Collection 会在每一次的写操作之后，如果发现集合大小超出配置的最大值，就会 同步的去进行删除文档的操作 。

删除文档的步骤大约是，

计算：设置一个 end cursor 指向 capped collection 头部，然后不断的调用 cursor->next()，通过累积 value 大小来确定需要删除的文档数
删除：
- 需要删除的文档数量小于等于 3 个，直接循环调用 cursor->remove() 删除文档
- 否则，设置两个 cursor，一个 start，一个 end，start 直接设置在 capped collection 的头部，然后调用 session->truncate(session, start, end) 来批量删除文档，session->truncate() 实际最终也是调用 cursor->remove()

具体代码可参考： WiredTigerRecordStore::_cappedDeleteAsNeeded_inlock

可以看到 确定需要删除的文档数是这里面比较耗时的部分 ，cursor->next() 来统计 value 大小，实际就是在不断的做读取。对于 Oplog Collection 来说，所有的用户写操作都会记录 Oplog，当 Oplog 写满后，每一次的用户写操作都会触发同步的删除操作，显然效率很低。

所以 MongoDB 采用了一种标记删除位点，然后批量删除的策略来解决这个问题。

Oplog Stones 初始化

所谓的 Oplog Stone，实际上就是用 truncate point（删除位点）在 oplog 上分隔的逻辑区域，而 truncate point 本质上就是 oplog entry 的 ts 字段，同时作为 RecordID ，实际也是对应的 WiredTiger Table 中一条记录的 key。

Oplog Stone 的信息 MongoDB 并没有做持久化，而是选择 每次重启的时候重新初始化 。

Oplog Stones 整体初始化的逻辑还是比较简单的，首先是根据 cappedMaxSize 计算需要多少个 stone，

// The minimum oplog stone size should be BSONObjMaxInternalSize.
    const unsigned int oplogStoneSize =
        std::max(gOplogStoneSizeMB * 1024 * 1024, BSONObjMaxInternalSize);

    // IDL does not support unsigned long long types.
    const unsigned long long kMinStonesToKeep = static_cast<unsigned long long>(gMinOplogStones);
    const unsigned long long kMaxStonesToKeep =
        static_cast<unsigned long long>(gMaxOplogStonesDuringStartup);

    unsigned long long numStones = maxSize / oplogStoneSize;
    size_t numStonesToKeep = std::min(kMaxStonesToKeep, std::max(kMinStonesToKeep, numStones));

这里有很多默认值参与计算，我们需要知道的是，oplog stone 最多 100 个，最少 10 个，如果 oplogSizeMB 配置值超过 2GB，在默认情况下，基本上就需要 100 个 stone（这个是根据 Capped Collection配置的最大值算出来的基准，如果 Capped Collection 实际还没有写满，会根据实际大小来换算，stone 会更少，但是无论如何，上限不会超过 100）。

确定了 Oplog Stone 的个数，下面要做的就是确定每个 Oplog Stone 的边界，分 两种情况 ，

Oplog 集合当前的 record 数量太少，小于 20 倍的需要 sample 的数量（对于 100 个 stone 来说，每个 stone 默认 sample 10 条记录，所以 Collection record 数量低于 2w 条，就走全表的逻辑），直接通过全表扫描的方式来确定 Oplog Stone 边界，这个逻辑很简单，就是对扫描到的 oplog entry 累加大小，超过单个 oplog stone 大小上限就生成一个 stone，保存下来，直到扫描结束。（代码： WiredTigerRecordStore::OplogStones::_calculateStonesByScanning ）
否则，就不能通过全表扫描了，效率太低。MongoDB 借助于 WiredTiger 提供的 random cursor 来进行采样，从而快速确定每个 oplog stone 的边界。（代码： WiredTigerRecordStore::OplogStones::_calculateStonesBySampling ）
1. 正常来说，100 个 oplog stone，采样 100 次，似乎就可以确定所有 stone 的边界了，但是为了保证边界尽可能准确，MongoDB 采用了 oversampling 的方式，即，对于每一个 stone 采样 10 次（默认），100 个 stone 就采样 1000 次。然后，把这些采集到的 sample 按 key（opTime）排序，每个 oplog stone 使用每 10 个 sample 中最后一个的 key 作为其边界，即，从 0 开始数，那么 9th, 19th, 29th, ……号 sample 是顺序对应的 oplog stone 的边界。
2. 另外，MongoDB 在后面的版本迭代中还使用了优化的 random cursor，可以保证更精确的采样，通过 next_random_sample_size 告知 WT 需要 random 采样的个数，WT 会把底层的 table 划分为 next_random_sample_size 份，分别从其中获取一个采样，显然这样划分后，每第 10 个 sample 可以更贴近预期的 oplog stone 边界。

这个是进程启动时 oplog stone 的初始化的方式，随着有新的写入，还会创建新的 oplog stone，这个时候 oplog stone 的大小是可以保证精确的，因为在写入的时候，可以很方便的统计到当前已经在最新的 stone 里面写了多少数据，这个数值是精确的。所以，如果初始启动的时候因为 oplog stone 边界不精确导致 oplog 删除的过多或过少，并不是个大问题，这个会在后续的更新中把误差抹掉。

Oplog 回收

有了 Oplog Stone 后，oplog 的回收就不需要再每次写入时去计算要删除的文档数，再同步去删除，只需要在当前 oplog stone 写满后，创建新的 oplog stone 时，把最老的 oplog stone 删除掉即可。

这个按 stone 删除通常会一次性删除比较多的文档，所以oplog 的删除动作是放在后台的 OplogTruncaterThread 来做的，删除时会直接调用 session->truncate 方法，使用 oldest oplog stone 的边界作为 range truncate 的上界。

//WiredTigerRecordStore::reclaimOplog
            setKey(cursor, stone->lastRecord);
            invariantWTOK(session->truncate(session, nullptr, nullptr, cursor, nullptr));

删除的时候还需要考虑 stable checkpoint 对 oplog 的依赖，具体逻辑后面再发文分析。

Oplog Stones 初始化的时间开销分析

单纯从代码逻辑看，可以看到 Oplog Stones 的初始化的绝大部分时间都会花在 random cursor 的采样上，因为其他的步骤都是简单的 in-memory compute，几个 ms 足以完成。总的初始化时长会和采样的个数成正比，采样个数最多是 1000 个，所以这个初始化时间是有上限的，并不是随 oplog 集合大小无限增长。

为了验证上述结论，我们构造一个测试场景来证明。

先构造数据集，创建 副本集 1 ，使用 mongo shell 自带的 benchmark 工具生成数据，

benchRun(
  {
  "host": "127.0.0.1:9111",
  "ops": [
    {
      "ns": "test.foo",
      "op": "insert",
      "doc": {
        "x": {
          "#RAND_STRING": [
            64
          ]
        },
        "y": {
          "#RAND_STRING": [
            128
          ]
        }
      }
    }
  ],
  "parallel": 16,
  "seconds": $seconds
  }
)

生成大约 1.5 亿条 oplog，33GB ，（有压缩，实际是在 50G左右）

xdjmgset-dbfs1:PRIMARY> show dbs
admin    0.000GB
config   0.000GB
local   33.363GB
test    35.284GB
xdjmgset-dbfs1:PRIMARY> use local
switched to db local
xdjmgset-dbfs1:PRIMARY> db.oplog.rs.count()
150531637

在代码上加了一些日志用于查看 cursor random 的时间开销，以及每一次 random 采样的开销，

diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp
index f2c3d1c220..7f029b788d 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp
@@ -460,8 +460,14 @@ void WiredTigerRecordStore::OplogStones::_calculateStonesBySampling(OperationCon
     // each logical section.
     auto cursor = _rs->getRandomCursorWithOptions(opCtx, extraConfig);
     std::vector<RecordId> oplogEstimates;
+    const std::uint64_t startWaitTime = curTimeMicros64();
     for (int i = 0; i < numSamples; ++i) {
+        const std::uint64_t startWaitTime = curTimeMicros64();
         auto record = cursor->next();
+        auto waitTime = curTimeMicros64() - startWaitTime;
+        LOG(1) << "WT cursor random sample " << i << ", "
+               << Timestamp(record->id.repr()).toStringPretty() << ", took "
+               << waitTime / 1000.0 << "ms";
         if (!record) {
             // This shouldn't really happen unless the size storer values are far off from reality.
             // The collection is probably empty, but fall back to scanning the oplog just in case.
@@ -471,6 +477,8 @@ void WiredTigerRecordStore::OplogStones::_calculateStonesBySampling(OperationCon
         }
         oplogEstimates.push_back(record->id);
     }
+    auto waitTime = curTimeMicros64() - startWaitTime;
+    LOG(1) << "WT cursor random sampling total took " << waitTime/1000.0 << "ms";
     std::sort(oplogEstimates.begin(), oplogEstimates.end());

     for (int i = 1; i <= wholeStones; ++i) {

把副本集 1 重启，观察日志，

020-10-27T15:34:09.058+0800 I  STORAGE  [initandlisten] Taking 755 samples and assuming that each section of oplog contains approximately 1991955 records totaling to 687194924 bytes
2020-10-27T15:34:09.058+0800 D1 STORAGE  [initandlisten] WT cursor random sample 0, Oct 26 19:54:57:7347, took 0.415ms
2020-10-27T15:34:09.063+0800 D1 STORAGE  [initandlisten] WT cursor random sample 1, Oct 26 19:57:47:6215, took 4.488ms
2020-10-27T15:34:09.067+0800 D1 STORAGE  [initandlisten] WT cursor random sample 2, Oct 26 15:37:47:1030, took 4.608ms
2020-10-27T15:34:09.072+0800 D1 STORAGE  [initandlisten] WT cursor random sample 3, Oct 26 15:44:15:4619, took 4.471ms
2020-10-27T15:34:09.076+0800 D1 STORAGE  [initandlisten] WT cursor random sample 4, Oct 26 15:46:51:2640, took 4.597ms
2020-10-27T15:34:09.081+0800 D1 STORAGE  [initandlisten] WT cursor random sample 5, Oct 26 15:49:22:10335, took 4.556ms
2020-10-27T15:34:09.086+0800 D1 STORAGE  [initandlisten] WT cursor random sample 6, Oct 26 15:52:03:10684, took 4.746ms
2020-10-27T15:34:09.090+0800 D1 STORAGE  [initandlisten] WT cursor random sample 7, Oct 26 15:54:14:4494, took 4.586ms
2020-10-27T15:34:09.095+0800 D1 STORAGE  [initandlisten] WT cursor random sample 8, Oct 26 15:56:46:1960, took 4.889ms
2020-10-27T15:34:09.100+0800 D1 STORAGE  [initandlisten] WT cursor random sample 9, Oct 26 15:59:18:7246, took 4.695ms
2020-10-27T15:34:09.105+0800 D1 STORAGE  [initandlisten] WT cursor random sample 10, Oct 26 16:02:05:4727, took 4.895ms
2020-10-27T15:34:09.110+0800 D1 STORAGE  [initandlisten] WT cursor random sample 11, Oct 26 16:04:30:5742, took 4.673ms
2020-10-27T15:34:09.115+0800 D1 STORAGE  [initandlisten] WT cursor random sample 12, Oct 26 16:06:45:1917, took 4.881ms
2020-10-27T15:34:09.119+0800 D1 STORAGE  [initandlisten] WT cursor random sample 13, Oct 26 16:08:50:5188, took 4.786ms
2020-10-27T15:34:09.124+0800 D1 STORAGE  [initandlisten] WT cursor random sample 14, Oct 26 16:11:13:7634, took 4.449ms
2020-10-27T15:34:09.129+0800 D1 STORAGE  [initandlisten] WT cursor random sample 15, Oct 26 16:13:25:6775, took 5.204ms
...
2020-10-27T15:34:12.463+0800 D1 STORAGE  [initandlisten] WT cursor random sample 752, Oct 26 15:56:15:232, took 4.923ms
2020-10-27T15:34:12.467+0800 D1 STORAGE  [initandlisten] WT cursor random sample 753, Oct 26 15:58:47:1953, took 4.399ms
2020-10-27T15:34:12.472+0800 D1 STORAGE  [initandlisten] WT cursor random sample 754, Oct 26 16:01:28:5317, took 4.598ms
2020-10-27T15:34:12.472+0800 D1 STORAGE  [initandlisten] WT cursor random sampling total took 3414.51ms

可以看到这个实例采样了 755 次，总共耗时 3414ms，每次采样的时间都比较固定，在 4ms – 5ms 之间。

然后 MongoDB 的 serverStatus 命令本身也提供了一个 section 的输出用于查看启动时，初始化 oplog stones 的总时长，和初始化方法（是 scan 还是 sampling），

xdjmgset-dbfs1:PRIMARY> db.serverStatus().oplogTruncation
{
        "totalTimeProcessingMicros" : NumberLong(3418164),
        "processingMethod" : "sampling",
        "totalTimeTruncatingMicros" : NumberLong(0),
        "truncateCount" : NumberLong(0)
}

可以看到，其他部分的时间开销在 4ms 左右，不到总时长的 1%。

为了 验证初始化时间和 sample 的个数成正比 ，同样根据上述方法构造另外一个数据集，25GB，1.13 亿条 oplog，

xdjmgset-dbfs1:PRIMARY> show dbs
admin    0.000GB
config   0.000GB
local   25.145GB
test    26.517GB
xdjmgset-dbfs1:PRIMARY> use local
switched to db local
xdjmgset-dbfs1:PRIMARY> db.oplog.rs.count()
113211477

重启之后查看日志输出，

2020-10-27T15:43:02.121+0800 I  STORAGE  [initandlisten] Taking 568 samples and assuming that each section of oplog contains approximately 1991875 records totaling to 687195044 bytes
2020-10-27T15:43:02.121+0800 D1 STORAGE  [initandlisten] WT cursor random sample 0, Oct 27 12:33:29:5201, took 0.216ms
2020-10-27T15:43:02.125+0800 D1 STORAGE  [initandlisten] WT cursor random sample 1, Oct 27 12:36:06:5577, took 4.489ms
2020-10-27T15:43:02.130+0800 D1 STORAGE  [initandlisten] WT cursor random sample 2, Oct 27 12:38:30:1191, took 4.417ms
2020-10-27T15:43:02.134+0800 D1 STORAGE  [initandlisten] WT cursor random sample 3, Oct 27 12:40:51:1654, took 4.526ms
2020-10-27T15:43:02.139+0800 D1 STORAGE  [initandlisten] WT cursor random sample 4, Oct 27 12:43:12:9085, took 4.51ms
2020-10-27T15:43:02.144+0800 D1 STORAGE  [initandlisten] WT cursor random sample 5, Oct 27 12:45:36:3523, took 4.465ms
2020-10-27T15:43:02.148+0800 D1 STORAGE  [initandlisten] WT cursor random sample 6, Oct 27 12:48:09:6883, took 4.63ms
2020-10-27T15:43:02.153+0800 D1 STORAGE  [initandlisten] WT cursor random sample 7, Oct 27 12:50:09:6716, took 4.484ms
2020-10-27T15:43:02.157+0800 D1 STORAGE  [initandlisten] WT cursor random sample 8, Oct 27 12:52:24:1495, took 4.531ms
2020-10-27T15:43:02.162+0800 D1 STORAGE  [initandlisten] WT cursor random sample 9, Oct 27 12:54:39:3871, took 4.705ms
2020-10-27T15:43:02.167+0800 D1 STORAGE  [initandlisten] WT cursor random sample 10, Oct 27 12:57:15:3946, took 4.661ms
2020-10-27T15:43:02.171+0800 D1 STORAGE  [initandlisten] WT cursor random sample 11, Oct 27 12:59:36:5033, took 4.74ms
2020-10-27T15:43:02.176+0800 D1 STORAGE  [initandlisten] WT cursor random sample 12, Oct 27 13:01:52:6908, took 4.424ms
2020-10-27T15:43:02.181+0800 D1 STORAGE  [initandlisten] WT cursor random sample 13, Oct 27 13:04:22:2838, took 4.637ms
2020-10-27T15:43:02.186+0800 D1 STORAGE  [initandlisten] WT cursor random sample 14, Oct 27 13:06:42:6574, took 5.21ms
...
2020-10-27T15:43:04.771+0800 D1 STORAGE  [initandlisten] WT cursor random sample 567, Oct 27 12:17:32:2820, took 4.397ms
2020-10-27T15:43:04.771+0800 D1 STORAGE  [initandlisten] WT cursor random sampling total took 2650.65ms

进行了 568 次 sample，总时间开销 2650ms，而 2650ms 基本上等于 ( 568.0/755) * 3414ms = 2568ms ，和 sample 个数成正比的结论可成立。

综上，考虑到单次 random cursor sample 的开销大约是 4-5ms，总 sample 上限是 1000，那么 oplog stones 初始化时间的上限在 5s 左右 （NVMe SSD）。

背景

Oplog Stones 初始化

Oplog 回收

Oplog Stones 初始化的时间开销分析

Recommend

当 TiDB 与 Flink 相结合：高效、易用的实时数仓

中小企业应该如何建立自己的防御体系

FB宣布推出免费云游戏服务，但不会登陆iOS设备

尚纬股份将收购星空野望，罗永浩把直播电商业务做上A股了

Python监控进程状态并实现告警-长江渔夫

Gartner：2021年IT企业最高战略预测

分享个局域网分享文件的工具

OhMyStock - 自建给室友用的简单 Screener 平台

Solidity 优化系列 - 编写 O(1) 复杂度的可迭代映射

那些选择自由职业的人，真的自由了吗？

About Joyk