VHE: GPU-Accelerated Gate-Level Simulation at Zero License Cost

The Problem

Our NPU design hit 1.4 million gates. Verilator started a convolution test.

Runtime: 139 billion cycles
VCD trace: 56 GB
Status: Killed after 3 days

Commercial emulators cost alot. We're a startup in India. That wasn't happening.

What We Built

VHE (Virtual Hardware Emulator) — GPU-accelerated gate-level simulation.

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Yosys     │───▶│   Parser    │───▶│  Levelizer  │───▶│    CUDA     │
│  JSON Net   │    │  (Python)   │    │  (DAG sort) │    │   Kernel    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                                                               │
                                                               ▼
                                                        ┌─────────────┐
                                                        │  Simulation │
                                                        │   Output    │
                                                        └─────────────┘

The Journey (Real Numbers)

Design	Gates	VHE Speed	vs Verilator
PicoRV32	8K	11,063 cyc/s	100× faster
mor1kx	1.25M	2,941 cyc/s	27× faster
GEMMX	1.4M	1,465 cyc/s	13× faster
WZ-NPU	6.7M	3,444 cyc/s	Verilator: DNF

Architecture Deep Dive

Phase 1: Levelization

Gates form a DAG. We topologically sort them into "levels" — gates at level N depend only on gates at levels < N.

while changed:
    for gate in gates:
        gate.level = max(input.level for input in gate.inputs) + 1

Our 6.7M gate NPU: 447 logic levels.

Phase 2: GPU Dispatch

Each level is a CUDA kernel launch. Gates within a level evaluate in parallel.

Level 0: 12,847 gates  → 1 kernel, 12,847 threads
Level 1: 8,234 gates   → 1 kernel, 8,234 threads
...
Level 447: 156 gates   → 1 kernel, 156 threads

Challenges We Hit

Levelization cap: Initial algorithm hit 100-iteration limit. Fixed with proper visited tracking.
Memory management: 6.7M gates × 4 bytes × 2 (current + next) = 54 MB state. Fits in GPU memory.
Timing accuracy: Phase 1 is zero-delay (functional). Phase 2 adds SDF timing (in progress).

Why Not Verilator?

Verilator is great for RTL. But at gate-level with millions of cells:

VCD traces explode (56 GB for one test)
Single-threaded evaluation
No GPU acceleration

VHE trades generality for speed. We only support gate-level netlists from Yosys. That's all we need.

The Proof

We used VHE to verify WZ-NPU (our open-source NPU):

Test	Description	Result
VHE-F1	Deterministic GEMM	✅ PASS
VHE-F2	Random GEMM (10 seeds)	✅ PASS
VHE-F3	Reset/Start torture	✅ PASS
VHE-P1	Tile scaling equivalence	✅ PASS
VHE-S1	Backpressure stress	✅ PASS

What's Next

SDF timing annotation (post-layout accuracy)
4-value logic (X, Z propagation)
Waveform export (VCD/FST)
Integration with formal tools

VHE: GPU-Accelerated Gate-Level Simulation at Zero License Cost

The Problem

What We Built

The Journey (Real Numbers)

Architecture Deep Dive

Phase 1: Levelization

Phase 2: GPU Dispatch

Challenges We Hit

Why Not Verilator?

The Proof

What's Next

Links

Comments

Command Palette

The Problem

What We Built

The Journey (Real Numbers)

Architecture Deep Dive

Phase 1: Levelization

Phase 2: GPU Dispatch

Challenges We Hit

Why Not Verilator?

The Proof

What's Next

Links

Comments