Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Memtable Design & Lifecycle

NoKV’s write path mirrors RocksDB: every write lands in the WAL and an in-memory memtable backed by a selectable in-memory index (skiplist or ART). The implementation lives in lsm/memtable.go and ties directly into the flush manager (lsm/flush).


1. Structure

type memTable struct {
    lsm        *LSM
    segmentID  uint32       // WAL segment backing this memtable
    index      memIndex
    maxVersion uint64
    walSize    int64
}

The memtable index is an interface that can be backed by either a skiplist or ART:

type memIndex interface {
    Add(*kv.Entry)
    Search([]byte) kv.ValueStruct
    NewIterator(*utils.Options) utils.Iterator
    MemSize() int64
    IncrRef()
    DecrRef()
}
  • Memtable engineOptions.MemTableEngine selects skiplist (default) or art via newMemIndex. Skiplist favors simpler writes; ART favors tighter memory and ordered scans.
  • Arena sizingutils.NewSkiplist uses arenaSizeFor; utils.NewART uses arenaSizeForART to reserve more space for variable node payloads and prefix spills.
  • WAL coupling – every Set uses kv.EncodeEntry to materialise the payload to the active WAL segment before inserting into the chosen index. walSize tracks how much of the segment is consumed so flush can release it later.
  • Segment IDLSM.NewMemtable atomically increments levels.maxFID, switches the WAL to a new segment (wal.Manager.SwitchSegment), and tags the memtable with that FID. This matches RocksDB’s logfile_number field.
  • ART specifics – ART stores prefix-compressed inner nodes (Node4/16/48/256), uses optimistic version checks for reads with localized locks for writes, and iterators walk the tree in key order.

2. Lifecycle

sequenceDiagram
    participant WAL
    participant MT as MemTable
    participant Flush
    participant Manifest
    WAL->>MT: Append+Set(entry)
    MT->>Flush: freeze (Size() >= limit)
    Flush->>Manifest: LogPointer + AddFile
    Manifest-->>Flush: ack
    Flush->>WAL: Release segments ≤ segmentID
  1. Active → Immutable – when mt.Size() crosses thresholds (Options.MemTableSize), the memtable is swapped out and pushed onto the flush queue. The new active memtable triggers another WAL segment switch.
  2. Flush – the flush manager drains immutable memtables, builds SSTables, logs manifest edits, and releases the WAL segment ID recorded in memTable.segmentID once the SST is durably installed.
  3. RecoveryLSM.recovery scans WAL files, reopens memtables per segment (most recent becomes active), and deletes segments ≤ the manifest’s log pointer. Entries are replayed via wal.Manager.ReplaySegment into fresh indexes, rebuilding maxVersion for the oracle.

Badger follows the same pattern, while RocksDB often uses skiplist-backed arenas with reference counting—NoKV reuses Badger’s arena allocator for simplicity.


3. Read Semantics

  • memTable.Get looks up the chosen index and returns a copy of the entry. MVCC versions stay encoded in the key suffix (KeyWithTs), so iterators naturally merge across memtables and SSTables.
  • MemTable.IncrRef/DecrRef delegate to the index, allowing iterators to hold references while the flush manager processes immutable tables—mirroring RocksDB’s MemTable::Ref/Unref lifecycle.
  • WAL-backed values that exceed the value threshold are stored as pointers; the memtable stores the encoded pointer, and the transaction/iterator logic reads from the vlog on demand.

4. Integration with Other Subsystems

SubsystemInteraction
TransactionsTxn.commitAndSend writes entries into the active memtable after WAL append; pending writes bypass the memtable until commit so per-txn isolation is preserved.
ManifestFlush completion logs EditLogPointer(segmentID) so restart can discard WAL files already persisted into SSTs.
StatsStats.Snapshot pulls FlushPending/Active/Queue counters via lsm.FlushMetrics, exposing how many immutables are waiting.
Value Loglsm.flush emits discard stats keyed by segmentID, letting the value log GC know when entries become obsolete.

5. Comparison

AspectRocksDBBadgerDBNoKV
Data structureSkiplist + arenaSkiplist + arenaSkiplist or ART + arena
WAL linkagelogfile_number per memtableSegment ID stored in vlog entriessegmentID on memTable, logged via manifest
RecoveryMemtable replays from WAL, referencing MANIFESTReplays WAL segmentsReplays WAL segments, prunes ≤ manifest log pointer
Flush triggerSize/entries/timeSize-basedSize-based with explicit queue metrics

6. Operational Notes

  • Tuning Options.MemTableSize affects WAL segment count and flush latency. Larger memtables reduce flush churn but increase crash recovery time.
  • Monitor NoKV.Stats.Flush.* metrics to catch stalled immutables—an ever-growing queue often indicates slow SST builds or manifest contention.
  • Because memtables carry WAL segment IDs, deleting WAL files manually can lead to recovery failures; always rely on the engine’s manifest-driven cleanup.

See docs/flush.md for the end-to-end flush scheduler and [docs/architecture.md](architecture.md#3-end-to-end-write-flow) for where memtables sit in the write pipeline.