Memtable Design & Lifecycle
NoKV’s write path mirrors RocksDB: every write lands in the WAL and an in-memory memtable backed by a selectable in-memory index (skiplist or ART). The implementation lives in lsm/memtable.go and ties directly into the flush manager (lsm/flush).
1. Structure
type memTable struct {
lsm *LSM
segmentID uint32 // WAL segment backing this memtable
index memIndex
maxVersion uint64
walSize int64
}
The memtable index is an interface that can be backed by either a skiplist or ART:
type memIndex interface {
Add(*kv.Entry)
Search([]byte) ([]byte, kv.ValueStruct)
NewIterator(*utils.Options) utils.Iterator
MemSize() int64
IncrRef()
DecrRef()
}
- Memtable engine –
Options.MemTableEngineselectsart(default) orskiplistvianewMemIndex. ART is not a generic trie: it is an internal-key-only memtable index. It uses a reversible mem-comparable route key so trie ordering matches the LSM internal-key comparator;skiplistremains available as the simpler baseline alternative. - Arena sizing – both
utils.NewSkiplistandutils.NewARTusearenaSizeForto derive arena capacity fromOptions.MemTableSize. - WAL coupling – every
Setuseskv.EncodeEntryto materialise the payload to the active WAL segment before inserting into the chosen index.walSizetracks how much of the segment is consumed so flush can release it later. - Segment ID –
LSM.NewMemtableatomically incrementslevels.maxFID, switches the WAL to a new segment (wal.Manager.SwitchSegment), and tags the memtable with that FID. This matches RocksDB’slogfile_numberfield. - ART specifics – ART stores prefix-compressed inner nodes (Node4/16/48/256). Each leaf keeps both the private route key used by the trie and the original canonical internal key returned to callers. The main concurrency model is still copy-on-write payload/node cloning with CAS installs; the only retained writer-side
OLC-litefast path is a narrow in-placereplaceChildupdate. Reads stay lock-free and do not run full version validation.
2. Lifecycle
sequenceDiagram
participant WAL
participant MT as MemTable
participant Flush
participant Manifest
WAL->>MT: Append+Set(entry)
MT->>Flush: freeze (walSize + incomingEstimate > limit)
Flush->>Manifest: LogPointer + AddFile
Manifest-->>Flush: ack
Flush->>WAL: Release segments ≤ segmentID
- Active → Immutable – when
mt.walSize + estimateexceedsOptions.MemTableSize, the memtable is rotated and pushed onto the flush queue. The new active memtable triggers another WAL segment switch. - Flush – the flush manager drains immutable memtables, builds SSTables, logs manifest edits, and releases the WAL segment ID recorded in
memTable.segmentIDonce the SST is durably installed. - Recovery –
LSM.recoveryscans WAL files, reopens memtables per segment (most recent becomes active), and deletes segments ≤ the manifest’s log pointer. Entries are replayed viawal.Manager.ReplaySegmentinto fresh indexes and the active in-memory state is rebuilt.
Badger follows the same pattern, while RocksDB often uses skiplist-backed arenas with reference counting—NoKV reuses Badger’s arena allocator for simplicity.
3. Read Semantics
memTable.Getlooks up the chosen index and returns a borrowed, ref-counted*kv.Entryfrom the internal pool. The index search returns the matched internal key plus value struct, so memtable hit entries carry the concrete version key instead of the query sentinel key. Internal callers must release borrowed entries withDecrRefwhen done.MemTable.IncrRef/DecrRefdelegate to the index, allowing iterators to hold references while the flush manager processes immutable tables—mirroring RocksDB’sMemTable::Ref/Unreflifecycle.- WAL-backed values that exceed the value threshold are stored as pointers; the memtable stores the encoded pointer, and the transaction/iterator logic reads from the vlog on demand.
DB.Getreturns detached entries; callers must not callDecrRefon them.DB.GetInternalEntryreturns borrowed entries; callers must callDecrRefexactly once.
4. Integration with Other Subsystems
| Subsystem | Interaction |
|---|---|
| Distributed 2PC | kv.Apply + percolator write committed MVCC versions through the same WAL/memtable pipeline in raft mode. |
| Manifest | Flush completion logs EditLogPointer(segmentID) so restart can discard WAL files already persisted into SSTs. |
| Stats | Stats.Snapshot pulls FlushPending/Active/Queue counters via lsm.FlushMetrics, exposing how many immutables are waiting. |
| Value Log | lsm.flush emits discard stats keyed by segmentID, letting the value log GC know when entries become obsolete. |
5. Comparison
| Aspect | RocksDB | BadgerDB | NoKV |
|---|---|---|---|
| Data structure | Skiplist + arena | Skiplist + arena | Skiplist or ART + arena (art default) |
| WAL linkage | logfile_number per memtable | Segment ID stored in vlog entries | segmentID on memTable, logged via manifest |
| Recovery | Memtable replays from WAL, referencing MANIFEST | Replays WAL segments | Replays WAL segments, prunes ≤ manifest log pointer |
| Flush trigger | Size/entries/time | Size-based | WAL-size budget (walSize) with explicit queue metrics |
6. Operational Notes
- Tuning
Options.MemTableSizeaffects WAL segment count and flush latency. Larger memtables reduce flush churn but increase crash recovery time. - ART currently uses noticeably more memindex arena memory than skiplist because it stores both route keys and original internal keys in leaves; in local measurements the ART memindex is roughly
2xthe skiplist memindex footprint for the same key/value set. - Monitor
NoKV.Stats.flush.*fields to catch stalled immutables—an ever-growing queue often indicates slow SST builds or manifest contention. - Because memtables carry WAL segment IDs, deleting WAL files manually can lead to recovery failures; always rely on the engine’s manifest-driven cleanup.
See docs/flush.md for the end-to-end flush scheduler and [docs/architecture.md](architecture.md#3-end-to-end-write-flow) for where memtables sit in the write pipeline.