Blog / Architecture

Architecture

Inside the NMC Compiler: Optimization Passes for Spike-Coded Graphs

Mira Vasquez · April 1, 2026 · 16 min read

NMC compiler optimization pass pipeline visualized as a directed graph transformation

The NMC compiler's optimization pipeline is the difference between a compiled SNN binary that uses 2.4× more energy than it should and one that approaches theoretical minimum energy for the model architecture. This post is a detailed walkthrough of the eight optimization passes that run between NMC-IR construction and backend lowering, with specific attention to the passes that have the highest energy impact and the ones that are most likely to interact incorrectly if you're extending the compiler.

An earlier post described the six mandatory passes for the initial v0.5 compiler. This post covers the extended set in the current v1.0 pipeline, including three passes added after v0.7 that collectively reduced average energy per inference by an additional 28% on the benchmark suite.

Pass ordering and why it matters

The pass pipeline is ordered, not freely reorderable. Some passes produce preconditions that later passes require; some passes produce inputs that earlier passes would have made incorrect decisions about if run in the wrong order. The ordering is:

1. BN Fold                    # must run before threshold calibration
2. Dead Neuron Elimination    # must run before fanout analysis
3. Spike-Rate Regularization Check  # diagnostic, no transform
4. Fanout Split               # must run before core allocation
5. Core Allocation            # must run before synapse packing
6. Temporal Op Fusion         # runs on allocated graph
7. Synapse Memory Packing     # target-specific, must run after allocation
8. Sleep Annotation           # runs last, consumes all previous results

Pass 1: Batch normalization fold

Batch normalization layers in the source ANN must be folded into the preceding linear or convolutional layer before the graph reaches the SNN domain. The fold transforms:

# Before fold:
# Linear(W, b) → BatchNorm(γ, β, μ, σ²)

# After fold:
# Linear(W', b') where:
# W' = W × (γ / sqrt(σ² + ε))
# b' = β + (b - μ) × (γ / sqrt(σ² + ε))

# BN layer is removed from graph

This pass is algebraically lossless for inference (assuming float32 computation), but it's not trivially safe in the int-weight domain. After folding, W' may have a different distribution than W, potentially producing values outside the target's weight precision range. The pass checks for overflow and emits a warning if folded weights require clipping. In practice, about 8% of models we've tested produce at least one clipped weight after BN fold — the workaround is to include a brief QAT fine-tuning step that re-centers the weight distribution after folding.

Pass 2: Dead neuron elimination

Dead neuron elimination removes neurons whose average firing rate on the calibration dataset falls below a configurable threshold (default: 1%). This is a pruning pass, not just a graph annotation — the neuron is removed from the NMCNeuronNode and its associated synaptic edges are deleted.

The implementation uses calibration set profiling (the same dataset used for threshold calibration). Each calibration forward pass records per-neuron spike counts; the dead neuron checker flags any neuron that fired fewer than threshold_fraction × T × n_samples total spikes across the full calibration set.

# Dead neuron check (simplified)
dead_mask = (neuron_spike_counts / (T * n_calib_samples)) < dead_threshold

# Layer 0 (pop_size=512): 41 dead neurons (8.0%)
# Layer 1 (pop_size=256): 23 dead neurons (9.0%)
# Layer 2 (pop_size=128): 11 dead neurons (8.6%)
# Total removed: 75 neurons, 12,190 synaptic edges eliminated

The energy impact is direct: each removed neuron reduces the SOP count proportionally to its post-synaptic fanout. For a fully-connected layer with 256 downstream neurons, removing one upstream neuron eliminates 256 potential SOPs per timestep per spike. Dead neuron elimination typically reduces estimated energy per inference by 6–12% on models that were trained without explicit firing-rate regularization.

The accuracy risk in dead neuron elimination

The 1% default threshold is conservative. Neurons below 1% average rate contribute negligibly to classification accuracy on the training distribution but may fire at higher rates on out-of-distribution inputs — inputs that are rare in the calibration set but important for safety-critical classification (e.g., the uncommon fault mode in an anomaly detection task). We're not saying aggressive dead neuron elimination is always wrong; we're saying that for safety-critical applications, validation on a held-out dataset that specifically includes rare-class examples is mandatory before accepting the pruning results.

Pass 3: Spike-rate regularization check (diagnostic)

This pass doesn't transform the graph — it analyzes the per-layer firing rate distribution from calibration profiling and emits structured warnings when the distribution is outside the expected efficiency range:

WARNING [nmc-opt-pass3]: Layer 1 avg_rate=38.2% — above efficiency threshold (20.0%).
  High firing rate indicates potential ANN-to-SNN conversion issues or missing
  rate regularization during training. Model will compile but energy efficiency
  will be suboptimal. Consider:
    1. Retraining with rate_regularization_lambda > 0
    2. Increasing V_th via recalibration (--recalibrate-thresholds)
    3. Reducing T to reduce accumulated rate (--timesteps=10)

The warning threshold is configurable. For research use where accuracy is the primary metric, some users disable the warning entirely (--no-rate-warnings). For energy-constrained production deployment, the warning should be treated as an error requiring model revision before deployment.

Pass 4: Fanout split

Described in detail in the compiler architecture post. Key implementation note: the bin-packing heuristic is greedy and uses a modified first-fit-decreasing algorithm. The split introduces additional spike packets (one per replica copy per firing event), which increases routing overhead. The pass therefore computes a trade-off score:

split_overhead_pJ = split_copies × avg_spike_rate × routing_cost_pJ_per_packet
kept_single_overhead_pJ = saturation_stall_cycles × idle_core_power_pW × stall_probability

# Accept split if: split_overhead_pJ < kept_single_overhead_pJ

For Loihi 2 with an axon routing cost of ~0.4 pJ/packet, the split is beneficial when the high-fanout neuron's average firing rate is high enough that routing saturation stalls would otherwise cost more than the replication overhead. For low-rate neurons (<5%), splitting is almost never beneficial — the stall probability is too low to justify the replication cost.

Pass 5: Core allocation

The allocation pass maps neuron populations to physical cores using a greedy graph-coloring approach. The allocation objective is a weighted combination of two costs:

Routing distance cost: synaptic connections between populations on different cores incur inter-core routing energy proportional to the routing distance in the chip's NoC topology. Co-locating strongly-connected populations reduces this cost.
Load imbalance cost: cores with more neurons require more timestep processing time, which can create bottlenecks in synchronous execution mode. Balancing the neuron count across cores improves throughput.

These two objectives partially conflict: the population pairs with the most connections may have very different sizes, making co-location sub-optimal for load balance. The allocation pass uses a configurable weight parameter α to trade between them (default: α=0.6, weighted toward connection locality).

A known limitation: the greedy allocator doesn't model the effect of synapse memory layout on cache behavior. Two populations allocated to adjacent cores that share many synapses might benefit from a specific synapse memory interleaving that the greedy approach doesn't discover. A planned spectral partitioning pass will improve this for models with complex multi-layer connection patterns.

Pass 6: Temporal operation fusion

This is one of the three passes added after v0.7 and has the highest single-pass energy impact. The pass identifies pairs of consecutive synaptic layers where the intermediate neuron population has low fanout (≤4 downstream connections per neuron) and fuses them into a single compound synaptic operation.

# Before fusion:
# Pop A → Synapse(W1) → Pop B (64 neurons) → Synapse(W2) → Pop C

# After fusion (when Pop B fanout ≤ 4):
# Pop A → FusedSynapse(W1 × W2) → Pop C
# Pop B is eliminated from the graph

# Energy saving: eliminates all SOPs through Pop B
# (no intermediate spike accumulation, threshold check, or output spike)

The fusion is only valid when Pop B is a linear pass-through layer (no nonlinearity other than the LIF threshold, and the threshold is high enough that most neurons pass through without firing — i.e., they're functioning as a linear filter most of the time). The pass checks this condition by analyzing Pop B's average firing rate from calibration profiling: if >40% of neurons in Pop B fire on average, fusion is skipped because the LIF nonlinearity is active enough to affect the output distribution.

On our benchmark suite, temporal op fusion eliminated an average of 12% of intermediate populations and reduced energy per inference by 14–18% on models with narrow bottleneck layers.

Pass 7: Synapse memory packing

Described in the compiler architecture post. The specific detail worth adding here: the packing pass generates target-specific binary blobs that are embedded in the .nmc file as a named section. For Loihi 2, the section is loihi2.synapse_ram; for Akida, akida.weight_table. These sections are opaque to all other compiler passes and are consumed directly by the backend loader at runtime.

The packing pass also performs the final weight quantization to the target precision. Weight values that were maintained in float32 throughout the compiler pipeline are quantized here, at the final possible moment — this is intentional, because intermediate passes (fanout split, op fusion) may produce new derived weight tensors that should be quantized fresh rather than carrying quantization error from an earlier point.

Pass 8: Sleep annotation

The sleep annotation pass is the second of the three post-v0.7 additions. It analyzes the timestep execution schedule produced by Pass 5 (core allocation) and identifies time windows in which groups of cores receive no spikes. These windows are annotated with SLEEP_HINT metadata that the runtime passes to the hardware's power management controller.

# Sleep annotation output (fragment)
# Timestep T=0:  cores [0, 1, 2, 3] active; cores [4..15] SLEEP_HINT
# Timestep T=1:  cores [0, 1] active; cores [2, 3, 4..15] SLEEP_HINT
# Timestep T=2:  core [2] active; others SLEEP_HINT
# ...

# Effect: cores not needed until T=5 enter power-gated state
# Wake latency: ~400 ns from SLEEP_HINT to active (Loihi 2 specification)
# Energy saving: ~15% reduction in active-inference power for sparse networks

The wake latency (400 ns on Loihi 2) means that sleep annotation is only beneficial when the sleep window is at least 2–3 timesteps long (at 2 ms per timestep, that's 4–6 ms minimum). For very fast inference (T=4, 8 ms total), the wake latency overhead can negate the sleep benefit. The pass computes this trade-off and skips sleep annotation for windows shorter than the wake-latency break-even point.

The interaction between passes 2, 4, and 6

The three passes that most frequently interact incorrectly when extended are dead neuron elimination (Pass 2), fanout split (Pass 4), and temporal op fusion (Pass 6). The failure mode:

Pass 2 eliminates neuron X from Population B because its calibration firing rate is below threshold
Pass 4 splits Population B due to high fanout, creating a replica B' that includes neuron X's index in a new replica population
Pass 6 attempts to fuse the original Population B with Population C but the graph is now inconsistent: neuron X is eliminated from the primary B but present in the replica B'

This bug was introduced in v0.8.1 when we added the fanout split's replica-population creation logic and didn't update the dead neuron elimination to propagate eliminations to replica populations. It manifested as incorrect inference output (not a crash) — the replica population contained a dead neuron that contributed a non-zero weight to the fused synapse, producing spurious activation in the downstream population. Fixed in v0.8.4 by running a dead neuron propagation step within the fanout split pass itself, before any replica populations are created.

The lesson is that passes that modify graph topology (fanout split, op fusion) must validate their postconditions against the invariants that all prior passes were supposed to establish. We've added assertion-style postcondition checks at the end of each topology-modifying pass that catch consistency violations before they propagate to incorrect but non-crashing output — which is the hardest class of compiler bug to detect in deployment.