Why Spike Coding Beats 8-bit Quantization for Always-On Inference

Leila Farrokhzad · January 14, 2025 · 8 min read

Spike coding visualization showing sparse event patterns versus dense quantized activations

A 320×240 keyword-spotting model quantized to INT8 draws roughly 8–12 mW on a Cortex-M4 running at 64 MHz. The same functional model compiled to spike-coded representation and mapped to a neuromorphic core draws 40–80 µW in steady-state inference. That is not a rounding error. The gap is structural, and understanding it requires looking below the operator abstraction that most edge-AI frameworks hide.

What quantization actually trades away

Quantization-aware training (QAT) reduces weight and activation precision — typically from FP32 to INT8, sometimes INT4 — and therefore shrinks the memory footprint and lowers the multiply-accumulate (MAC) cost per operation. On a Cortex-M7 with a SIMD unit capable of 4× INT8 MACs per cycle, INT8 inference can approach 200 GMAC/s·W. That sounds impressive until you measure what "inference" means for an always-on task.

An always-on keyword spotter doesn't run inference once and sleep. It runs continuously — 100 ms windows at 10 ms stride — because the trigger event could arrive at any time. At 8 mW idle inference power and a 3V CR2032 (nominal 660 mWh), the battery lasts roughly 3.4 days. That's the number no amount of quantization reduces: the model is dense. Every inference window touches every weight, every activation, every layer, regardless of whether the audio contains anything meaningful.

Sparsity is where quantization falls short. INT8 doesn't model time. A zero-valued INT8 activation still participates in the matrix multiply unless the hardware has explicit structured-sparsity support (which most Cortex-M series do not). The arithmetic cost is fixed to the model's dense footprint.

The energy mechanics of spike coding

Spiking Neural Networks communicate via binary events — spikes — propagated asynchronously through time. The canonical Leaky Integrate-and-Fire (LIF) neuron accumulates weighted incoming spikes into a membrane potential V_m, and only emits an output spike when V_m exceeds threshold V_th. Between spikes, the neuron leaks according to a time constant τ and consumes no dynamic power for the arithmetic it isn't doing.

This matters because the fundamental cost unit on neuromorphic silicon isn't FLOPS or MACs — it's synaptic operations (SOPs): the act of accumulating a weighted spike into a post-synaptic membrane. On BrainChip Akida AKD1000, one SOP costs roughly 0.8–1.2 pJ. On Intel Loihi 2, SOP energy is in the 1–5 pJ range depending on fanout and routing distance. Compare this to a full INT8 MAC on a Cortex-M7 SIMD unit: ~50–100 pJ when you account for the full pipeline power.

The energy per inference then becomes: E_inf = SOP_count × E_SOP. And SOP_count is determined by average firing rate across the network — which for natural signals with strong temporal redundancy is often 5–15% of the theoretical maximum. A spike-coded model inferring a quiescent audio frame fires almost nothing. A spike-coded model encountering a keyword fires in a structured burst along relevant feature pathways. The energy is signal-proportional.

Measuring sparsity in practice

In our internal benchmarks on an SHD (Spiking Heidelberg Digits) classification task using a two-layer LIF network (512×256 neurons), average firing rates converge to 8–11% per layer after training with surrogate gradients (FastSigmoid, β=10). This yields an effective SOP count of roughly 580K per 100 ms inference window — translating to approximately 0.6–0.7 µJ per inference on Loihi 2 neurocores.

The equivalent INT8 dense model on a Cortex-M4 at 64 MHz draws approximately 8 µJ per inference window when accounting for memory access patterns (the bottleneck is usually SRAM bandwidth, not compute). The spike-coded path is roughly 12× more efficient at the inference level before considering idle power.

Where the comparison gets complicated

We're not saying INT8 quantization is a poor engineering choice in general — it remains the pragmatic baseline for many real-time inference tasks where latency matters more than power. A wrist-worn gesture recognizer that must respond within 50 ms to a single gesture may well prefer INT8 on an nRF5340 over the added complexity of neuromorphic compilation. The tradeoff calculus shifts when the task is always-on with irregular event arrival.

The honest comparison also has to account for accuracy. SNN models trained with surrogate gradient methods typically reach within 1–3% accuracy of their ANN equivalents on classification tasks — but this gap widens on tasks that require dense spatial precision. Object detection with precise bounding-box regression is still harder in the SNN domain because rate coding requires more timesteps (T=20–50) to accumulate stable probability estimates, increasing latency. Temporal coding approaches (encoding information in first-spike timing rather than spike rate) can reduce this to T=4–8 but require more careful training.

Rate coding vs temporal coding: the energy is not the same

Rate coding over T=20 timesteps means a neuron that should fire at 50% rate will fire 10 spikes. Temporal coding for the same neuron emits a single spike at timestep T=10 (inversely proportional to stimulus intensity). The SOP count under temporal coding is substantially lower — often 3–5× — but the required precision of spike timing places stronger demands on the compiler's scheduling pass and the hardware's timestamp resolution.

On SynSense Xylo, which targets audio-frequency neuromorphic inference with an analog frontend, temporal coding maps well because the hardware's event queue natively handles sub-millisecond spike timestamps. On BrainChip Akida, which operates on rate-coded input from standard frame-based sensor data, rate coding is the practical path. The compiler must know which encoding to target — it's a hardware-topology decision, not just a training decision.

The always-on duty-cycle advantage

The deeper advantage of spike coding for always-on deployments is architectural. Neuromorphic cores can enter a wake-on-spike state where the core clock is gated and only the spike router is active, drawing sub-µA quiescent current. The core wakes only when an incoming spike arrives. For a keyword-spotting application with a 0.1% duty cycle (one keyword per 1000 audio frames), the time-averaged power with wake-on-spike approaches:

P_avg = P_active × duty_cycle + P_sleep × (1 - duty_cycle)
      = 400 µW × 0.001 + 0.8 µW × 0.999
      ≈ 1.2 µW

At 1.2 µW average from a 660 mWh CR2032, the theoretical lifetime exceeds 60 years — which means the physical battery degradation mechanisms, not energy capacity, become the limiting factor. In practice, with realistic leakage and event overhead, targets of 3–5 years on a coin cell are achievable for sensors with low event rates.

No amount of INT8 quantization + sleep mode on a standard MCU approaches this. An STM32L4 in Stop2 mode draws ~3 µA. Even at that floor, 660 mWh at 3V gives you roughly 9 years — but Stop2 means no inference. The moment you wake the MCU for inference (typically requiring a full 1–5 ms wake-up sequence plus inference time), average power climbs rapidly. Waking 10 times per second to run a 5 ms INT8 inference at 8 mW is already 400 µW average — before memory leakage, peripheral keeps, and ADC sampling.

Compiler implications: sparsity isn't free

The energy advantage of spike coding only materializes if the compiler correctly exploits it. A naive compilation of an SNN to a neuromorphic core that doesn't account for spike sparsity in its scheduling pass will over-allocate time slots and leave cores active waiting for spikes that don't arrive. Dead neuron elimination — removing neurons that train to zero firing rate — is a prerequisite pass before any energy estimate is meaningful.

Fanout splitting is equally critical. A neuron with 1024 post-synaptic targets that maps to a single neurocores on Loihi 2 will saturate that core's axon routing bandwidth at high firing rates, introducing stall cycles that burn power without advancing computation. The compiler needs to analyze fanout distributions and split high-fan neurons across multiple cores with local spike replication, trading core area for routing efficiency.

This is the gap that most academic SNN implementations miss. They report raw inference energy at a fixed firing rate but don't model the routing overhead. In our benchmarks on Loihi 2 with a standard SHD task, unoptimized mapping produces 1.8–2.4× higher energy than the compiler-optimized mapping — because routing stalls and idle-core wakeup overhead dominate at low firing rates.

Practical implications for hardware selection

If your deployment is keyword spotting, vibration anomaly detection, or any always-on classification task with sparse natural inputs, spike coding on neuromorphic silicon is the right engineering choice — but only when:

The input signal has genuine temporal structure that rate or temporal coding can exploit (audio, vibration, event camera streams, ECG/EEG signals work well; dense RGB video at 30 fps is harder)
You have a training pipeline that produces well-calibrated firing rates (5–20% average per layer) — models with firing rates above 40% approach the energy cost of dense computation
The compiler correctly handles fanout, dead neuron pruning, and sleep-mode scheduling

INT8 quantization is the right choice when the inference task is compute-bound rather than power-bound: a video frame that must be processed within 30 ms regardless of content, or a model that runs on a system already powered by a larger battery where latency SLA matters more than µW budget.

The neuromorphic inference story is fundamentally about signal-proportional energy. The physical intuition is correct — if the input carries no information, the network should do almost no work. Spike coding is the mechanism by which that intuition becomes measurable silicon efficiency.

What quantization actually trades away

The energy mechanics of spike coding

Measuring sparsity in practice

Where the comparison gets complicated

Rate coding vs temporal coding: the energy is not the same

The always-on duty-cycle advantage

Compiler implications: sparsity isn't free

Practical implications for hardware selection

More from the blog

Building a Neuromorphic Model Compiler from a PyTorch Frontend

SNN vs ANN Latency Trade-offs at the Edge

Inside the NMC Compiler: Optimization Passes