Blog / Benchmarks

Benchmarks

Our Benchmark Methodology: Measuring TOPS/W on Neuromorphic Silicon

Andrei Volkov · August 12, 2025 · 14 min read

Test harness setup for neuromorphic chip power and latency measurement

TOPS/W as a metric is simultaneously the most-cited number in neuromorphic computing marketing and one of the most easily misrepresented. The ambiguity is structural: "operations" in TOPS can mean MAC operations, synaptic operations, or some vendor-specific definition; "W" can mean peak power, average power during inference, or average power over a full duty cycle including sleep. A hardware vendor claiming 50 TOPS/W and a software team reporting 0.8 TOPS/W can both be technically correct while describing the same physical process under different measurement conditions.

This post documents Neurmorph's benchmark methodology v1 — the specific measurement procedures, hardware configurations, and calculation conventions we use for every number published on our benchmarks page. The goal is reproducibility: if you follow this procedure on the same hardware, you should get numbers within ±10% of ours.

Defining TOPS/W for neuromorphic inference

For SNN inference on neuromorphic hardware, the appropriate operation unit is the synaptic operation (SOP) — a single weighted accumulation into a post-synaptic membrane potential triggered by a pre-synaptic spike. SOPs are the relevant computational primitive; they directly map to energy because a SOP only occurs when a spike fires, unlike a MAC which always occurs regardless of activation value.

Our TOPS calculation therefore is:

effective_TOPS = (SOPs_per_inference × inferences_per_second) / 1e12

# Where SOPs_per_inference is measured from the actual spike count during
# inference on the benchmark dataset (not the theoretical maximum at 100% firing rate)

# For a keyword spotting model (512→256→128 LIF layers) on SHD test set:
# Average SOPs per inference: 580,000 (at ~10% average firing rate)
# Inferences per second (sustained): 100
# Effective TOPS = 580,000 × 100 / 1e12 = 5.8×10⁻⁸ TOPS

# Power during inference (measured): 420 µW
# Effective TOPS/W = 5.8×10⁻⁸ / 420×10⁻⁶ = 1.38×10⁻⁴ TOPS/W

This number looks small compared to published TOPS/W values, because it reflects actual SOPs per real inference on real data — not peak theoretical throughput. Converting to a more intuitive unit: 138 GSOP/W, which is how we publish it. We're not saying other vendors are wrong to publish TOPS/W with a different denominator; we're saying our methodology uses measured SOP count on benchmark test data, and we believe that's the more physically meaningful metric for comparing inference efficiency across implementations.

Hardware measurement setup

Power measurement instrumentation

We use a four-wire Kelvin-sense measurement setup with a 0.1Ω precision shunt resistor in series with the VDD supply rail. Current is sampled at 1 MHz using a 24-bit ADC (ADS1256 on a custom measurement board), providing sub-µA resolution at the current levels relevant for neuromorphic inference. The measurement PCB is fully isolated from the device under test's digital ground to eliminate ground loop contamination.

# Measurement configuration
shunt_resistance = 0.1    # Ohm, ±0.01% tolerance, 25ppm/°C
adc_sample_rate  = 1e6    # Hz
adc_resolution   = 24     # bits
adc_lsb          = 3.3 / 2**24  # ~0.197 µV/LSB
current_lsb      = adc_lsb / shunt_resistance  # ~1.97 µA/LSB

# Temperature-controlled environment
ambient_temp     = 25.0   # °C ±0.5°C (in environmental chamber)

Measurements are taken in a temperature-controlled chamber at 25°C ±0.5°C. This matters because idle leakage on digital CMOS varies approximately 5–10% per degree Celsius at operating temperature; multi-degree temperature variation introduces comparable variation in measured idle power, which swamps real differences between configurations.

Latency measurement

Inference latency is measured as the wall-clock time from the first byte of the input event buffer being written to the inference engine's input FIFO to the assertion of the inference-complete interrupt by the hardware. We do not include host-side Python overhead in latency measurement — only the hardware inference cycle. All latency measurements are reported as p50 and p99 over 10,000 consecutive inferences on the benchmark dataset.

Benchmark datasets and tasks

SHD (Spiking Heidelberg Digits)

Primary benchmark for audio classification. SHD contains 10,000 recordings of spoken digits (0–9) in German and English, pre-processed into 700-channel cochlear filterbank spike trains at 1 ms resolution. The spike trains are directly consumable as SNN input without additional encoding. Network: two-layer LIF (700→512→256→10 with softmax decoding). Training: BPTT with FastSigmoid surrogate gradients, T=20, Adam optimizer, weight decay 1e-4.

SSC (Spiking Speech Commands)

Secondary benchmark using the neuromorphic variant of Google Speech Commands. 35-class keyword spotting from 700-channel spike encodings. More challenging than SHD due to class count and speaker variability. Same network architecture scaled to handle 35 output classes.

DVS-Gesture

Benchmark for event-camera-based gesture recognition. 11-class hand gestures recorded with a DVS128 camera (128×128 pixel, 1 µs temporal resolution). Input encoding: 1 ms time bins, 128×128×2 polarity channels. Network: convolutional SNN (2×Conv-LIF + 2×FC-LIF). DVS-Gesture tests spatial feature extraction in the SNN domain.

Custom vibration anomaly dataset (VIB-1K)

An internal benchmark dataset of 1,000 vibration recordings from industrial accelerometers mounted on electric motors and pump housings, with labeled anomaly events (bearing wear, imbalance, cavitation). Recordings span 10 machine types, 3 severity levels. Rate-encoded from FFT features (64-bin, 256-point FFT, 50% overlap). This dataset is not yet public — we plan to release it in 2026. It's included in our internal benchmarks to validate that academic benchmark performance transfers to industrial-signal-characteristic inputs.

What we measure and what we report

Metrics

Metric	Definition	Unit
Accuracy	Top-1 classification accuracy on test split	%
E/inf	Mean energy per inference, measured on test set	µJ
GSOP/W	Giga-SOPs per watt (measured SOPs, measured watts)	GSOP/W
Latency p50	Median inference latency, hardware only	µs
Latency p99	99th percentile inference latency	µs
Idle power	Average power with network loaded, no inference	µW
Avg firing rate	Mean spike rate across all layers, all test samples	% of max

What we explicitly do not report

We do not report peak TOPS/W calculated from theoretical maximum firing rate — this number is not physically meaningful for real workloads. We do not report energy measurements taken outside a temperature-controlled environment. We do not report accuracy from models with more training compute than would be typical for the described task — no cherry-picked overfit checkpoints. We do not compare against MCU baselines at artificially reduced clock frequencies to make the efficiency ratio look larger.

Baseline comparison methodology

Comparisons against MCU baselines use commercial development boards with well-characterized power measurement test points (STM32H743ZI Nucleo board, nRF5340-DK). The MCU models are INT8-quantized equivalents of the SNN models, compiled with CMSIS-NN for Cortex-M and TFLite-Micro for nRF5340. We use the same inference task (same dataset, same class count) and as close to the same model capacity (parameter count ±20%) as the network architecture permits.

We measure MCU inference energy by the same shunt-resistor method, gating measurement windows from inference-trigger GPIO to inference-complete GPIO to exclude the MCU's idle current from the per-inference measurement. This produces a fair per-inference energy comparison rather than a misleading "total system power" comparison where the MCU's peripheral keep-alive power is charged against the neuromorphic system's inference-only measurement.

Known limitations and future methodology evolution

Our v1 methodology has several limitations we intend to address:

No multi-chip measurements: all current benchmarks are single-chip. Multi-chip deployments (Loihi 2 mesh boards) have substantially different inter-chip routing energy that our single-chip numbers don't capture.
Static weight precision: we benchmark a single weight precision per model. The compiler's precision selection pass can produce mixed-precision networks where different layers use INT8 vs INT4; we don't yet have benchmark configurations for these.
No aging characterization: for deployment lifetime predictions, we need power measurements at the 100, 1000, and 8760-hour marks to characterize any drift in neuromorphic chip power characteristics. This requires long-duration test infrastructure we're building.
No temperature characterization: we measure only at 25°C. Industrial applications span -20°C to +85°C. The methodology will be extended with measurements at 0°C, 40°C, and 70°C.

The benchmark methodology is versioned (v1.0) and all numbers on our website are tagged with the methodology version that produced them. When we update the methodology, we will re-run affected benchmarks and clearly mark which numbers changed and why. Benchmark integrity is a long-term asset — numbers we can't defend under scrutiny are worse than no numbers at all.