SOSP 2021 Lightweight Formal Methods Validate a Key-Value Storage in Amazon S3

O ever youthful,O ever weeping

Introduce

SOSP 2021的Best paper之一，介绍Amazon采用一种“草量级” formal methods验证S3的key-value存储系统。之前没有研究过formal methdos，最多的认知就停留在lamport的TLA+，通过形式化的语言描述系统的状态和状态转移，并验证系统的状态和属性是否满足期望，类似Paxos、Raft等共识算法，可以通过这种方法验证算法理论上的正确性。但是软件系统是很多模块和代码完美配合出来的，complex and frequntly changing，算法正确，系统的实现未必正确。为了更好地验证系统的正确性，Amazon抛弃了传统formal methods完整的形式验证，提出了一个 “草量级” formal methods，使得其可以更容易自动化的使用在日常的测试开发中，保证kv存储系统的正确性。

为什么 “lightweight” 是 ”草量级“，因为最近关注UFC 草量级世界冠军二番战

Lightweight Formal Methods

（1）Executable reference models as specifications

例如：实际的k-v存储系统可能是一个很复杂的LSM tree结构，但是对标的reference model，可以简单使用内存的hashmap实现即可，保证k-v storage完全具有相同语义的API-level接口，那么验证系统的specifications（或者说properties），就变成了比较reference models和实际k-v storage的状态是否一致就可以了。其实这个思想其实非常朴素，我想大家实际测试验证过程中可能都用过。

（2）Automated tools to check implementations against models

通过Property-Based Testing自动化的生成测试inputs，并且验证系统实现和model是否一致

（3）Coverage tools to track effectiveness over time

系统和真实场景往往是复杂的，而且软件系统在不停的变更，如何保证测试尽可能覆盖绝大数的场景，除了采用一些Argument bias的方式尽可能完善我们的测试case之外，还会引入code coverage metrics来发现一些盲点，这一点很重要。

通过放弃formal method完整的形式化验证，使得相关工作变得更加容易在日常开发中开展。

为了更好的验证k-v存储系统的持久性和一致性，具体实践的时候分3种情况分别进行验证：

（1）Sequential correctness: refinement of the reference model

第一种，没有并发的顺序操作，不考虑故障，直接验证和reference model是否一致

（2）For sequential crashing executions：refinement against a weaker reference model

第二种，没有并发的顺序操作，考虑故障，正确性验证还是和reference model对比，增加的fault injection主要考虑了三类：

Fail-stop crashes (e.g., power outages, kernel panics)
Transient or permanent disk IO failures (e.g., HDD failures, timeouts)
Resource exhaustion(e.g.,out of memory or diskspace)

随着混沌工程的普及，fault injection现在已经深入人心，毕竟确实是分布式系统测试验证的利器

（3）For concurrent crash-free executions, we write separate reference models and check linearizability

第三种，并发操作，不考虑故障，采用一种stateless mode checking的方法来验证并发情况下的linearizability，没研究过stateless model checking，这里先记个TODO。

对于并发场景的验证是一个强需求，但也确实是一个难点，因为一旦并发，验证就是一个难题了。类似的，Raft线性一致性验证工具Jepsen就是个例子，Client发起并发读写，并且记录下所有op history，底层分布式注入各种crash，最后验证history是否可线性化来验证系统是否满足线性一致性，这个方法优势很明显，端到端，自动化，支持并发，支持故障注入，但是有一个致命的缺陷，op history验证是否可线性化是一个NP难的问题，线性时间无解，所以测试的时间不能太长，太长op history就会很大，正确性验证所需要的时间就不能收敛了。

Summary

因为要兼得高性能、高可用和高可靠三者，分布式存储系统自然也就成为了最难搞的系统，良好的顶层设计和开发实践是基础，完善科学的测试验证体系也同样“一个都不能少”，否则亦是浮沙筑高台。最后用论文Lessons Learned三个关键字总结对于测试验证在分布式系统中实施的最佳实践吧：Early、Continuous、Future。

（其实对formal methods、stateless model checking不懂，所以可能对很多描述理解上有偏差和错误，大家凑合着看，放这里只是简单整理记录下）

Notes

限于作者水平，难免有理解和描述上有疏漏或者错误的地方，欢迎共同交流；部分参考已经在正文和参考文献中列表注明，但仍有可能有疏漏的地方，有任何侵权或者不明确的地方，欢迎指出，必定及时更正或者删除；文章供于学习交流，转载注明出处

References

[1]. Bornholt J, Joshi R, Astrauskas V, et al. Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3[C]//Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM. 2021: 836-850.

SOSP 2021 Lightweight Formal Methods Validate a Key-Value Storage in Amazon S3

SOSP 2021 Lightweight Formal Methods Validate a Key-Value Storage in Amazon S3

Introduce

Lightweight Formal Methods

Summary

References

Recommend

Improve UDP performance in RHEL 8.5

Microsoft GitHub CEO Nat Friedman replaced by Thomas Dohmke

团购类网站倒计时的js实现

ByteDance founder Zhang Yiming steps down as chairman amid reshuffle

【链向FM】Huobi Global已支持SubDAO竞拍波卡平行链插槽,DappRadar集成Axie Infinity...

OYO，孙正义的另外一个烂摊子

Bill Gates doubts aim to limit global warming to 1.5 degrees achievable

TSMC, Samsung and Intel have a huge carbon footprint

Realme GT Neo2 reaches Thailand and Malaysia

Sony puts out more videos showcasing the camera prowess on the Xperia Pro-I

About Joyk