Parallel Block-STM

We implement the Block-STM algorithm (Gelashvili et al., PPoPP 2023) — the parallel execution engine behind the Aptos blockchain — in C++. Given a block of N ordered transactions, our engine speculatively executes them in parallel across multiple threads, dynamically detects read/write conflicts at runtime, aborts and re-executes conflicting transactions via cascading rollback, and guarantees the final state is identical to sequential execution in the preset order. The core data structure is a lock-free multi-version memory store where each key maintains a version chain accessed concurrently by all threads using std::atomic operations. We evaluate scaling behavior across thread counts and contention levels, and analyze abort rates, cache behavior, and the overhead of optimistic concurrency control.

Team

Hua-Yo Wu (huayow)
Pi-Jung Chang (pijungc)

Platform

GHC machines
Multi-core CPUs (Intel i7-9700, 8 cores) + NVIDIA RTX 2080 GPU for stretch goal

Technologies

C++20 OpenMP std::atomic / Lock-Free CUDA (stretch)

System Architecture

  ┌─────────────────────────────────────────────────────────────────────┐
  │                    Collaborative Scheduler                         │
  │                                                                     │
  │  Shared atomic counters:  execution_idx  |  validation_idx          │
  │                                                                     │
  │  ┌─── Thread 0 ───┐  ┌─── Thread 1 ───┐       ┌─── Thread N ───┐  │
  │  │  1. get task    │  │  1. get task    │  ...  │  1. get task    │  │
  │  │  2. execute     │  │  2. execute     │       │  2. execute     │  │
  │  │  3. validate    │  │  3. validate    │       │  3. validate    │  │
  │  │  4. abort/commit│  │  4. abort/commit│       │  4. abort/commit│  │
  │  └────────┬────────┘  └────────┬────────┘       └────────┬────────┘  │
  │           │                    │                          │          │
  │           ▼                    ▼                          ▼          │
  │  ┌──────────────────────────────────────────────────────────────┐   │
  │  │              Multi-Version Data Store (MVMemory)             │   │
  │  │                                                              │   │
  │  │  Key A: ─── [v1: Tx₁ wrote 100] ── [v3: Tx₃ wrote 250] ──  │   │
  │  │  Key B: ─── [v2: Tx₂ wrote 50]  ── [ESTIMATE: Tx₅ ...]  ── │   │
  │  │  Key C: ─── [v1: Tx₁ wrote 300] ──────────────────────────  │   │
  │  │                                                              │   │
  │  │  Lock-free version chains · std::atomic CAS operations       │   │
  │  └──────────────────────────────────────────────────────────────┘   │
  └─────────────────────────────────────────────────────────────────────┘

  Transaction lifecycle:    ReadyToExecute ──► Executing ──► Validated ──► Committed
                                   ▲                            │
                                   └──── Aborted ◄──────────────┘
                                      (conflict detected)

Project Timeline

Week	Task	Status
Week 1 (3/26–4/3)	Workload generator + sequential baseline + MVMemory data structure + proposal	in progress
Week 2 (4/4–4/10)	Collaborative scheduler + speculative execution + validation + abort logic	todo
Week 3 (4/11–4/15)	Integration + correctness testing (sequential equivalence) + ESTIMATE marker optimization + milestone report	todo
Week 4 (4/16–4/22)	Performance experiments: scaling study, contention sweep, abort rate analysis, cache profiling with perf	todo
Week 5 (4/23–4/30)	CUDA stretch goal + final benchmarks + final report + poster	todo

Summary

We implement the Block-STM algorithm — a parallel execution engine for ordered transactions built on Software Transactional Memory principles — in C++ with OpenMP. Given a block of N transactions with a preset sequential order, our engine speculatively executes them in parallel, detects read/write conflicts at runtime through a validation phase, and aborts and re-executes conflicting transactions while guaranteeing the final result is identical to sequential execution. The core challenge is building a lock-free multi-version data store and a collaborative scheduler that coordinates speculative execution, validation, and cascading aborts using std::atomic operations. Our primary goal is to characterize speedup, abort rates, and cache behavior across varying thread counts and contention levels.

Background

The Problem: Parallel Execution with Preset Order

In systems like blockchains, a block contains N ordered transactions (T₁, T₂, ..., Tₙ) that must be executed such that the final state matches strictly sequential execution in that order. The naive approach — execute them one by one — is safe but slow. The challenge is: how do we execute transactions in parallel without knowing their dependencies upfront, while guaranteeing the result is identical to sequential execution?

This is a well-studied problem in the database and transactional memory literature. Block-STM (Gelashvili et al., PPoPP 2023) solves it using optimistic concurrency control: execute first, validate later, abort and retry on conflict. The key insight is that the preset order — often seen as a constraint — is actually a performance blessing, because it eliminates the need for complex serialization ordering that general-purpose STM libraries must handle.

Algorithm Overview

Block-STM operates in a continuous loop across all worker threads. Each thread repeatedly picks the next available task (either an execution or a validation) from a shared scheduler:

Phase 1 — Speculative Execution: A thread picks transaction Tₖ and executes it against the multi-version data store (MVMemory). For each read, the thread finds the most recent version written by a transaction with ID < k. The read location and version are recorded in Tₖ's read-set. Write results go into Tₖ's write-set and are published to MVMemory tagged with version k.

Phase 2 — Validation: After execution, Tₖ enters validation. The system re-checks every entry in Tₖ's read-set: has any other transaction written a newer version to that location since Tₖ read it? If yes, Tₖ read stale data — it must be aborted.

Phase 3 — Abort & Re-execution: When Tₖ is aborted, its writes in MVMemory are replaced with ESTIMATE markers (not deleted). Any later transaction Tⱼ (j > k) that attempts to read a location and encounters an ESTIMATE marker knows it has a dependency on a transaction currently being re-executed — it suspends rather than reading stale data and getting aborted later. This dramatically reduces cascading aborts. Aborted transactions are re-queued with priority given to lower-ID transactions to guarantee progress and prevent livelocks.

Commit: The entire block commits atomically once all transactions have been validated and no ongoing execution or validation tasks remain.

Multi-Version Data Store (MVMemory)

MVMemory is a concurrent hash map where each key maps to a version chain — a sorted list of (tx_id, value) entries. The core operations are:

write(location, tx_id, value): Insert or update the entry for tx_id in the version chain.
read(location, tx_id): Find the entry with the largest tx_id < k. If the entry is an ESTIMATE marker, return READ_ERROR (dependency detected). If no entry exists, fall back to the pre-block storage snapshot.

Both operations must be safe under concurrent access from all threads without coarse-grained locking.

Why is this suitable for parallelism?

Under low-contention workloads (transactions touching mostly disjoint memory locations), almost all speculative executions succeed validation on the first try, yielding near-linear speedup. Under high contention, the ESTIMATE marker mechanism and priority-based scheduling keep abort rates manageable. The algorithm dynamically discovers and exploits whatever parallelism exists in the workload — no static dependency analysis is required.

The Challenge

Dynamic conflict detection: Unlike static work decomposition (e.g., tiling an image), conflicts between transactions are unknown until runtime. A transaction's read-set depends on the values it reads, which depend on what earlier transactions wrote, which may themselves be speculative. This creates a chain of dynamic dependencies that must be resolved correctly under concurrency.

Lock-free multi-version data store: MVMemory is the central shared data structure accessed by every thread on every read and write. Using mutexes would serialize access and destroy scalability. We implement version chain operations using std::atomic compare-and-swap (CAS) loops, which requires careful reasoning about ABA problems, memory ordering (memory_order_acquire/release/seq_cst), and correctness under relaxed consistency.

Cascading aborts: When transaction Tₖ is aborted and re-executed, its new writes may differ from the old ones. Every transaction Tⱼ (j > k) that read Tₖ's old output must also be aborted. In the worst case, a single abort at T₁ can cascade through the entire block. The ESTIMATE marker mechanism mitigates this by converting "abort after wasted execution" into "suspend before wasted execution," but implementing this correctly — especially the transition from ESTIMATE back to a real value after re-execution, and waking up suspended threads — involves subtle synchronization.

Livelock prevention: Without careful scheduling, two transactions can repeatedly abort each other: T₅ aborts T₈ which re-executes and aborts T₅ in a cycle. Block-STM prevents this by always prioritizing lower-ID transactions for re-execution and ensuring that a re-executing transaction can only be aborted by a transaction with a strictly lower ID. Implementing this priority scheme correctly in the scheduler using only atomic operations is non-trivial.

Cache behavior: The multi-version data store is a pointer-heavy structure (hash map → version chains → entries) with poor spatial locality. Under high concurrency, false sharing on adjacent version chain entries and on the scheduler's atomic counters can degrade performance. We plan to analyze this with perf and explore cache-line padding and memory layout optimizations.

Resources

Hardware: GHC machines — Intel i7-9700 (8 cores), NVIDIA RTX 2080 GPU (for stretch goal), 16 GB RAM.

Software: C++20, OpenMP, CUDA toolkit (stretch goal), perf for cache profiling.

Code base: Starting from scratch — no existing codebase.

References

[1] Gelashvili, Spiegelman, Xiang, et al. "Block-STM: Scaling Blockchain Execution by Turning Ordering Curse to a Performance Blessing." PPoPP, 2023. (Core implementation reference — Section 3 contains pseudocode for the scheduler, MVMemory, and validation logic.)
[2] Shavit, Touitou. "Software Transactional Memory." PODC, 1995. (Foundational STM paper — Block-STM adapts OCC principles from this work to the deterministic, preset-order setting.)
[3] Cooper et al. "Benchmarking Cloud Serving Systems with YCSB." SoCC, 2010. (Workload generation methodology.)

Goals and Deliverables

Plan to Achieve (MVP)

Sequential baseline: Execute all N transactions in order. This is the correctness reference and performance baseline.
Workload generator: Synthetic transaction generator with tunable parameters: block size (1K–100K transactions), number of memory locations (accounts), read/write set sizes, and compute intensity. Contention is controlled by the ratio of transactions to accounts (2 accounts = nearly sequential; 10K accounts = nearly independent).
Multi-version data store (MVMemory): Lock-free concurrent hash map with version chains. Support read, write, and ESTIMATE marker operations using std::atomic CAS.
Collaborative scheduler: Shared atomic counters for execution and validation index. Dynamic task assignment: each thread atomically increments the appropriate counter to claim the next task. Priority-based re-execution for aborted transactions.
Speculative execution + validation + abort: Full Block-STM loop: execute → record read/write sets → validate → abort with ESTIMATE markers or commit. Cascading abort propagation.
Correctness verification: Automated comparison of parallel execution output against sequential baseline for every test case.
Performance evaluation:
- Scaling study: 1, 2, 4, 8 threads × 5 contention levels (2, 10, 100, 1K, 10K accounts)
- Throughput (transactions/sec) and speedup curves
- Abort rate and cascading abort depth vs. contention
- Cache miss analysis with perf (L1/L2 miss rates, false sharing detection)

Hope to Achieve (Stretch Goals)

CUDA port: Map speculative execution to GPU — one thread per transaction. Analyze why GPU is suboptimal for this workload: warp divergence from abort/retry branching, global memory contention on MVMemory, lack of efficient cross-warp synchronization for the collaborative scheduler. Quantify GPU vs CPU throughput across contention levels.
Cache-line padding optimization: Pad version chain entries and atomic counters to eliminate false sharing; measure the impact on throughput.
Dependency-aware scheduling: After first execution round, use observed read/write sets to build a dependency graph and pre-schedule independent transactions, reducing abort rates on subsequent blocks with similar access patterns.

Fallback Plan

If the full lock-free MVMemory proves too complex, we fall back to fine-grained locking (one lock per key, not a global lock). This still preserves the core Block-STM algorithm and allows meaningful scaling and contention analysis, while being simpler to implement correctly.

Platform Choice

GHC machines provide 8-core CPUs suitable for evaluating thread scaling from 1 to 8, and NVIDIA GPUs for the stretch goal. C++20 is chosen for std::atomic with explicit memory ordering control and low-level control over memory layout. OpenMP provides a straightforward thread pool without the boilerplate of raw pthreads.

Evaluation Plan

Scaling Experiments

Experiment	Configuration	Key Metric
Thread scaling	1, 2, 4, 8 threads; block = 10K txns; accounts = 1K (medium contention)	Throughput (txn/sec), speedup
Contention sweep	8 threads; block = 10K txns; accounts = 2, 10, 100, 1K, 10K	Throughput, abort rate, cascading abort depth
Block size scaling	8 threads; block = 1K, 5K, 10K, 50K, 100K; accounts = 1K	Throughput, scheduler overhead
Cache analysis	8 threads; block = 10K; accounts = 100 (high contention)	L1/L2 cache miss rate, false sharing (perf c2c)

Correctness Tests

Test	Description
Zero-contention	10K transactions on 10K disjoint accounts — parallel result must match sequential
Full-contention	10K transactions on 2 accounts — effectively sequential, tests abort/retry correctness
Mixed workload	Realistic read/write patterns with varying hot-spot ratios
Determinism	Run same block 100 times — output must be bit-identical every time

Schedule

Week	Dates	Tasks
1	3/26–4/3	Workload generator (tunable contention) + sequential baseline + MVMemory data structure (lock-free version chains with std::atomic CAS). Submit proposal.
2	4/4–4/10	Collaborative scheduler (atomic counters, task dispatch, priority re-execution). Speculative execution + validation + abort logic with ESTIMATE markers.
3	4/11–4/15	Integration + correctness testing (sequential equivalence). ESTIMATE optimization + livelock prevention. Milestone report (due 4/14).
4	4/16–4/22	Performance experiments: scaling study, contention sweep, abort rate analysis. Cache profiling with perf.
5	4/23–4/30	CUDA stretch goal (GPU speculative execution). Final benchmarks + final report + poster.

◇

Milestone report will be posted here after April 14th.

◇

Final report will be posted here after April 30th.