Blog / Research

SNN vs ANN Latency Trade-offs at the Edge: A Practitioner's Guide

Latency comparison chart SNN versus ANN on edge inference hardware

The most common question we get from engineers evaluating neuromorphic inference is some version of: "What's the latency penalty?" The implicit model is that SNNs trade latency for power. That framing is partially right and partially misleading, and getting the distinction wrong leads to using the wrong tool for the application.

The T-step problem: where latency comes from in SNNs

An ANN forward pass on a Cortex-M7 at 200 MHz running INT8 is roughly sequential MAC operations — the latency is determined by the operation count and the memory access pattern. For a small 4-layer MLP (512→256→128→10), INT8 inference on a Cortex-M7 with CMSIS-NN takes approximately 0.8–1.2 ms end-to-end, dominated by SRAM bandwidth.

An SNN forward pass requires T timesteps to complete an inference. Each timestep propagates spikes through the network and updates membrane potentials. The total inference latency is approximately T × t_step, where t_step is the time to execute one timestep on the hardware. For rate-coded SNNs, T typically ranges from 20 to 100 to achieve reliable classification — because the rate code needs enough spikes to represent a stable probability estimate.

At T=20 with t_step=2 ms on Loihi 2 neurocores (synchronous timestep mode), total latency is ~40 ms. For the same task, INT8 on Cortex-M7 is 1.2 ms. This is a real, significant latency disadvantage for rate-coded SNNs. We're not going to argue otherwise.

Where temporal coding changes the calculation

Temporal coding encodes information in first-spike timing rather than spike count across T timesteps. A neuron representing high stimulus intensity fires at timestep T=1; low intensity fires at T=10 (or doesn't fire at all). Classification can be read from the first layer of post-synaptic integrations after the first spike propagation — effectively T=4 to T=8 in well-trained networks.

On Loihi 2 in asynchronous execution mode (event-driven rather than synchronous timestep), a first-spike temporal-coded inference on a 4-layer network completes in 60–120 µs wall-clock time for a keyword spotting task. This compares favorably to INT8 on Cortex-M7. The catch: temporal coding requires precise spike timing, which places stronger demands on the training procedure (time-to-first-spike loss function rather than standard cross-entropy) and is sensitive to inter-chip timing jitter in multi-chip deployments.

N-MNIST benchmark comparison

On the N-MNIST dataset (neuromorphic MNIST from a DVS camera, input as event streams), temporal-coded SNN inference on Loihi 2 achieves:

  • Latency: 80–150 µs per sample (asynchronous mode, T_eff ≈ 5–8)
  • Accuracy: 98.7–99.1% (comparable to dense ANN)
  • Energy: 0.8–1.4 µJ per inference

A comparable dense ANN (similar parameter count) on Cortex-M7 at 200 MHz, INT8:

  • Latency: 2.1 ms per sample (including image preprocessing from raw DVS events)
  • Accuracy: 99.3%
  • Energy: 16 µJ per inference at 8 mW average

Here the SNN wins on both latency and energy — but only because the input is already an event stream from a DVS camera, which eliminates the preprocessing overhead that the ANN must pay. When the ANN receives dense frame data, it processes faster; when it must compute optical flow or convert events to frames, it's slower.

Input modality as a latency determinant

This is the key insight that practitioners often miss: latency comparisons between SNNs and ANNs are largely determined by the input modality and the encoding/decoding overhead, not the inference topology itself.

Event-stream input (DVS camera, event microphone, spike-encoded sensor)

When input arrives as a native event stream, the SNN starts processing immediately on the first event. Temporal-coded networks can produce classifications after T=4–8 effective timesteps with event-driven execution. Latency advantage goes to SNN.

Frame-based input (standard camera, ADC-sampled accelerometer)

Frame-based input must be rate-encoded before SNN inference — typically by Poisson spike generation proportional to pixel/sample intensity. This encoding step takes 0.5–2 ms and adds to the total latency. Rate-coded networks then require T=20–50 timesteps for reliable inference. For standard 30-fps video input, the SNN latency (encoding + T-step inference) is 15–60 ms, compared to 5–20 ms for INT8 ANN on a comparable embedded processor. Here, latency disadvantage goes to SNN.

The duty-cycle latency trade

For always-on applications with irregular event arrival (keyword spotting, anomaly detection, occupancy sensing), the relevant latency metric is time from event to classification output — not inference cycle time. An SNN in wake-on-spike mode begins processing immediately on the first input event, with no wake-up delay. An MCU in Stop2 mode requires 1–3 ms to resume from deep sleep before inference begins. For low-duty-cycle applications, the MCU wake-up latency can dominate, making the SNN effective latency faster despite its higher per-cycle timestep count.

Practical classification by application type

Application Input type Latency target Better choice Reason
Keyword spotting (always-on) PDM microphone (frame) < 500 ms SNN (temporal) Wake-on-spike advantage; rate tolerance
Gesture recognition (DVS) DVS event stream < 50 ms SNN Native event input; T_eff = 4–8
Vibration anomaly (accel) ADC-sampled 3-axis < 100 ms Either Rate encoding adds ~2 ms; both viable
Object detection (RGB) Standard camera frame < 100 ms ANN (INT8) Rate coding overhead; T=50 too slow
Heart rate monitoring (PPG) Analog time-series < 1s SNN Temporal spike encoding; low freq signal

The BPTT training overhead and its deployment irrelevance

SNNs are typically trained with Backpropagation Through Time (BPTT) using surrogate gradients — because the Heaviside threshold function is non-differentiable, and a smooth approximation (SuperSpike, FastSigmoid, or arctan) is substituted during the backward pass. BPTT over T=20 timesteps is roughly 20× more expensive than standard backprop for the same network depth, making SNN training noticeably slower than ANN training on the same hardware.

This training-time cost is sometimes conflated with inference-time latency — they are completely separate. Training happens once, offline. Inference latency is what matters in deployment. The fact that training a 4-layer SNN takes 8 hours instead of 25 minutes is an engineering inconvenience, not a product requirement.

Synchronous vs asynchronous execution modes

Most neuromorphic hardware supports both synchronous (timestep-clocked) and asynchronous (event-driven) execution modes. In synchronous mode, all neurocores advance one timestep on each clock tick — predictable timing, easy to profile, but cores that receive no spikes in a timestep still advance (burning clock energy). In asynchronous mode, cores wake only when spikes arrive — lower average power, but timing is non-deterministic, which complicates the definition of "inference latency" (latency depends on input event rate).

For latency-sensitive applications, synchronous mode with T set as low as possible (T=4–8 with temporal coding) gives deterministic, measurable latency. For energy-sensitive always-on applications, asynchronous event-driven mode with wake-on-spike minimizes average power at the cost of variable latency. The compiler must generate different execution plans for each mode; the NMC runtime exposes this as a deployment configuration flag rather than requiring model re-compilation.

Calibrating expectations for real deployments

Engineers accustomed to TFLite-Micro on Cortex-M4 will find that neuromorphic inference requires rethinking latency budgets. The familiar "inference takes X ms" mental model breaks down when execution is event-driven and T-step count is a free parameter. The correct mental model is a two-dimensional space: (T, mode) pairs that map to (latency, energy) operating points, and the application requirement determines which region of that space is acceptable.

For most industrial edge applications we've engaged with — vibration monitoring, acoustic anomaly detection, occupancy sensing, structural health monitoring — the latency requirement is 50–500 ms. Every combination of T=4–20 and either synchronous or asynchronous execution fits comfortably inside that window. The binding constraint for those applications is almost always energy, not latency. Latency becomes the binding constraint for reactive control systems (sub-10 ms requirements), and that's genuinely a harder problem for current SNN architectures.