Skip to main content

Command Palette

Search for a command to run...

VHE: GPU-Accelerated Gate-Level Simulation at Zero License Cost

How we built a GPU simulator to verify a 6.7M gate NPU when Verilator failed

Updated
3 min read

The Problem

Our NPU design hit 1.4 million gates. Verilator started a convolution test.

Runtime: 139 billion cycles
VCD trace: 56 GB
Status: Killed after 3 days

Commercial emulators cost alot. We're a startup in India. That wasn't happening.

What We Built

VHE (Virtual Hardware Emulator) — GPU-accelerated gate-level simulation.

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Yosys     │───▶│   Parser    │───▶│  Levelizer  │───▶│    CUDA     │
│  JSON Net   │    │  (Python)   │    │  (DAG sort) │    │   Kernel    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                                                               │
                                                               ▼
                                                        ┌─────────────┐
                                                        │  Simulation │
                                                        │   Output    │
                                                        └─────────────┘

The Journey (Real Numbers)

DesignGatesVHE Speedvs Verilator
PicoRV328K11,063 cyc/s100× faster
mor1kx1.25M2,941 cyc/s27× faster
GEMMX1.4M1,465 cyc/s13× faster
WZ-NPU6.7M3,444 cyc/sVerilator: DNF

Architecture Deep Dive

Phase 1: Levelization

Gates form a DAG. We topologically sort them into "levels" — gates at level N depend only on gates at levels < N.

while changed:
    for gate in gates:
        gate.level = max(input.level for input in gate.inputs) + 1

Our 6.7M gate NPU: 447 logic levels.

Phase 2: GPU Dispatch

Each level is a CUDA kernel launch. Gates within a level evaluate in parallel.

Level 0: 12,847 gates  → 1 kernel, 12,847 threads
Level 1: 8,234 gates   → 1 kernel, 8,234 threads
...
Level 447: 156 gates   → 1 kernel, 156 threads

Challenges We Hit

  1. Levelization cap: Initial algorithm hit 100-iteration limit. Fixed with proper visited tracking.

  2. Memory management: 6.7M gates × 4 bytes × 2 (current + next) = 54 MB state. Fits in GPU memory.

  3. Timing accuracy: Phase 1 is zero-delay (functional). Phase 2 adds SDF timing (in progress).

Why Not Verilator?

Verilator is great for RTL. But at gate-level with millions of cells:

  • VCD traces explode (56 GB for one test)

  • Single-threaded evaluation

  • No GPU acceleration

VHE trades generality for speed. We only support gate-level netlists from Yosys. That's all we need.

The Proof

We used VHE to verify WZ-NPU (our open-source NPU):

TestDescriptionResult
VHE-F1Deterministic GEMM✅ PASS
VHE-F2Random GEMM (10 seeds)✅ PASS
VHE-F3Reset/Start torture✅ PASS
VHE-P1Tile scaling equivalence✅ PASS
VHE-S1Backpressure stress✅ PASS

What's Next

  • SDF timing annotation (post-layout accuracy)

  • 4-value logic (X, Z propagation)

  • Waveform export (VCD/FST)

  • Integration with formal tools