pmem.io: KV-store improved & measured

pmem.io
Persistent Memory Programming

KV-store improved & measured

As promised in the previous post about the kv-store implementation I’m back with new results after implementing the optimizations I devised a month ago. As a bonus I implemented a red-black tree to have a fair comparison between two data structures that allocate similar number of nodes.

tl;dr: I was right about crit-bit :)

Test platform

The same server was used to run the benchmarks but with the latest 4.2 kernel that contains numerous DAX improvements.

What changed?

Various allocator performance improvements
- Mostly small things here and there
- Improved multithreaded scaling (linear now!)
Cache for small memory blocks added to the transaction
- The cache is re-used across transactions, meaning less work to start transaction
Range tree to detect and discard overlapping transaction memory ranges
- pmemobj_tx_add_range_direct behaves exactly like pmemobj_tx_add_range
Crit-bit leaf nodes are now embedded into internal ones
- Halves the number of allocations

Performance results

Let’s dive straight into the interesting stuff.

Inserting 1 mln entries:

Structure Outer TX Time B-Tree Yes 3.17193s B-Tree No 14.46045s Crit-bit Yes 10.29985s Crit-bit No 9.09759s RB-Tree Yes 10.00846s RB-Tree No 21.53144s

Removing 1 mln entries:

Structure Outer TX Time B-Tree Yes 4.69913s B-Tree No 19.11727s Crit-bit Yes 9.48193s Crit-bit No 8.95918s RB-Tree Yes 26.56952s RB-Tree No 39.01479s

The B-Tree performance changed a bit - removes are significantly faster and inserts are a tiny bit slower. This is probably because of a bug-fix to the tree I made sometime ago. Doesn’t matter that much though, we are more interested in the relative performance. I also changed the methodology to be a little bit more scientific, those results are probably more accurate.

The difference between inserts with and without the outer TX in crit-bit can be attributed to the fact that the long-running transaction must dynamically allocate cache instances. Meaning that when using a single transaction that inserts a lot of nodes the pmemobj requires thousands of cache instances, while when you start one transaction for each insert just one is enough.

The new challenger does not fare so well. It’s better (a little) than crit-bit at bulk inserting the nodes but that’s as expected. At regular inserts the red-black tree fails flat compared to the other data structures. This just confirms that algorithms that intensively modify its data structures won’t do well in persistent memory.

As for the B-Tree vs Crit-bit tree considerations, bulk inserts are still quite a lot faster in the b-tree, but for regular inserts the crit-bit takes the lead with its consistent performance. Because this data structure modifies very little during one insert/remove operation (just one node) it takes far smaller performance hit when run in an undo-log transaction. This is going to be my go-to data structure for all the persistent in-order collections.

The contents of this web site and the associated GitHub repositories are BSD-licensed open source.

pmem.io
Persistent Memory Programming

KV-store improved & measured

Test platform

What changed?

Performance results

Recommend

pmem.io: Challenges of multi-threaded transactions

pmem.io: pmemobjfs - The simple FUSE based on libpmemobj

pmem.io: Evaluation of a better object container

pmem.io: An introduction to replication

pmem.io: Performance improvements

pmem.io: C++ bindings for libpmemobj (part 0)

pmem.io: C++ bindings for libpmemobj (part 1) - pmem resident variables

pmem.io: C++ bindings for libpmemobj (part 2) - persistent smart pointer

pmem.io: C++ bindings for libpmemobj (part 3) - persistent queue example

pmem.io: How to emulate Persistent Memory

About Joyk