Blog / Architecture

Architecture

Building a Neuromorphic Model Compiler from a PyTorch Frontend

Mira Vasquez · February 18, 2025 · 12 min read

Compiler pipeline diagram showing PyTorch model flowing through NMC IR to hardware target

The first version of our compiler that could successfully map a PyTorch model to Loihi 2 neurocores was 2,400 lines of Python and produced executables roughly 3× larger than hand-written Lava equivalents. We're not sharing that as a badge of honor — it's context for why building a neuromorphic compiler from a standard ML frontend is harder than it looks, and what intermediate representations actually matter.

This post documents the NMC compiler architecture as it exists today: the frontend ingestion path, the NMC-IR design choices, the mandatory optimization passes, and the backend targets we currently support. It's written for engineers who have worked with LLVM or MLIR and want to understand where neuromorphic compilation diverges from standard compiler theory.

Why standard ML compilers don't compose here

TVM, XLA, and IREE all target architectures where the fundamental operation is a dense matrix multiply or convolution — executed on a grid of SIMD or systolic-array processing elements that share a memory hierarchy. The compiler's job is operator fusion, tiling, memory layout optimization, and instruction scheduling for throughput.

Neuromorphic architectures invert this. The fundamental operation is conditional spike propagation: a synaptic weight is only applied when the pre-synaptic neuron fires. Computation is event-driven and asynchronous. There is no tensor stride, no memory layout optimization for cache lines, no SIMD instruction to target. The relevant primitives are:

Neuron state update: integrate incoming spike weights into membrane potential, apply leak, check threshold
Spike routing: deliver a fired neuron's output to all post-synaptic targets across potentially multiple neurocores
Timestep management: synchronize a population of asynchronously firing neurons into coherent T-step inference windows

None of TVM's operator database, XLA's HLO, or MLIR's Linalg dialect models these primitives. This is why NMC has its own IR rather than sitting atop an existing compiler framework as a backend target.

Frontend: PyTorch model ingestion

The entry point is a standard torch.nn.Module — either a pre-trained ANN being converted via ANN-to-SNN substitution, or a natively trained SNN using our nrm.nn layer library. The first pass is a torch.fx symbolic trace:

import torch.fx as fx
import nrm.compiler as nmc

model = MyKeywordSpotter()  # torch.nn.Module with nrm.nn.LIFLayer layers
graph_module = fx.symbolic_trace(model)
nmc_graph = nmc.frontend.ingest(graph_module, input_shape=(1, 20, 40), T=20)

The ingest call performs three sub-passes:

Layer identification: classify each node as a static operator (Linear, Conv2d), an SNN operator (LIFLayer, ALIFLayer), or a passthrough (ReLU is stripped — it has no meaning in spike-coded graphs)
Temporal unrolling stub creation: insert T-step loop metadata without actually unrolling — unrolling in the IR would create T copies of every tensor, exploding graph size
Spike interface extraction: identify input encoding type (rate vs temporal) and output decoding (spike count → logits vs first-spike timing)

The result is a NMCGraph object — a directed acyclic graph of NMCNode instances with explicit spike-flow edges. This is NMC-IR.

NMC-IR design: what we got wrong first

Our first IR used a three-address code representation similar to LLVM IR. It looked clean but created a fundamental problem: temporal dependencies between timesteps require back-edges (membrane state at T depends on membrane state at T-1), which makes a DAG representation impossible without either unrolling or adding mutable state objects alongside the SSA values.

We chose the second path: NMC-IR is a graph with two classes of edges — spike flow edges (forward in time, DAG-legal) and state edges (membrane potential and spike history, modeled as explicit mutable buffers attached to neuron nodes). This allows T-step temporal semantics to be modeled without unrolling, at the cost of requiring a dedicated state-buffer allocation pass that has no equivalent in standard compiler IR design.

Key NMC-IR node types

# NMC-IR node descriptor (simplified)
@dataclass
class NMCNeuronNode:
    node_id: str
    pop_size: int             # number of neurons in this population
    threshold: float          # V_th
    leak_tau: float           # membrane time constant in ms
    reset_mode: str           # 'subtract' | 'zero'
    encoding: str             # 'rate' | 'temporal' | 'ttfs'
    state_buffer_bytes: int   # allocated at compile time

@dataclass
class NMCSynapseEdge:
    src_node: str
    dst_node: str
    weight_tensor_id: str
    delay_steps: int          # synaptic delay in timesteps (≥1)
    weight_precision: str     # 'int8' | 'int4' | 'binary'

The delay_steps field is worth noting. Standard ANN frameworks have no concept of synaptic delay — all connections are instantaneous within a forward pass. Neuromorphic hardware supports configurable delay lines (Loihi 2 supports up to 62-step delays per connection), which creates opportunities for temporal pattern processing that pure rate-coded networks can't exploit. The IR must represent this, even if the initial ANN-to-SNN conversion path produces all delay=1 edges.

Optimization passes: the mandatory six

Before backend lowering, six passes run in fixed order. Reordering them produces incorrect or inefficient output.

Pass 1: Dead neuron elimination

Neurons that trained to zero or near-zero firing rate (<1% average over the training set) are pruned from the graph. The threshold is configurable; the default of 1% reflects practical measurements showing that neurons below this threshold contribute less than 0.1% to classification accuracy on standard benchmarks while consuming routing bandwidth. Dead neuron elimination typically removes 8–25% of neurons in a well-regularized SNN.

Pass 2: Fanout analysis and splitting

Each neurocores on Loihi 2 can address a maximum of 4,096 output axons. A neuron with post-synaptic targets spread across 6,000 downstream neurons must be split into multiple source nodes, each routing to a subset of targets. The splitting pass computes a weighted fanout graph and uses a bin-packing heuristic to minimize the number of split copies while respecting per-core axon limits.

# Example: compiler output for a high-fanout population
# Original: pop_size=512 → 8,192 downstream targets (too large for single core)
# After split: 2 × 512 replicas, each routing to 4,096 targets
# Overhead: 1 additional spike packet per firing event (routing cost: ~0.4 pJ/packet on Loihi 2)

Pass 3: Core allocation via graph partitioning

The allocator maps populations to physical neurocores. Loihi 2 has 128 neurocores, each supporting up to 1,024 compartments (neurons). The allocation is a bin-packing problem with two constraints: compartment capacity and synapse memory (each core has 128 KB of synapse RAM). Populations that share many synaptic connections are co-located where possible to minimize inter-core spike routing overhead.

We use a greedy graph-coloring approach with core-load balancing. It's not optimal — optimal packing is NP-hard — but for networks up to roughly 50K neurons it produces allocations within 5% of theoretical minimum core count. Beyond 50K neurons, the greedy approach can deviate significantly; this is a known limitation we're addressing in a planned spectral partitioning pass.

Pass 4: Synapse memory packing

Synapse weights must be packed into each core's SRAM in a format that minimizes access time during spike processing. The packing format varies by hardware target: Loihi 2 uses a neuron-indexed format where each entry is (pre_index, weight, delay) sorted by pre-synaptic neuron index for sequential access during spike delivery. Akida uses a different columnar format. The memory packing pass is entirely target-specific and is the primary source of backend divergence in the compiler.

Pass 5: Timestep scheduler

The scheduler determines the execution order of neuron populations within each timestep T. Because spike delivery has configurable delays, some populations can be updated in parallel within a timestep; others must wait for upstream spikes to arrive. The scheduler constructs a partial order and emits a per-timestep execution plan for the runtime.

Pass 6: Sleep-mode annotation

The final pass analyzes which neurocores receive zero spikes on average in given timestep ranges (computed from training-set statistics) and annotates them for clock gating. This annotation becomes runtime hints that the hardware's power management controller uses to enable wake-on-spike for idle cores.

Backend targets and the HAL boundary

The compiler produces a .nmc binary — a self-describing archive containing the IR snapshot, the per-target lowered network graph, weight tensors, and runtime configuration tables. Backend lowering happens per-target via a plugin interface:

# Compile for Loihi 2
nmc compile model.pt \
    --target loihi2 \
    --timesteps 20 \
    --input-encoding rate \
    --output compiled/kws_loihi2.nmc

# Compile for BrainChip Akida AKD1500
nmc compile model.pt \
    --target akida-akd1500 \
    --timesteps 4 \
    --input-encoding temporal \
    --output compiled/kws_akida.nmc

The same NMC-IR graph produces different binaries for each target because the backend lowers IR to target-specific primitives. The Loihi 2 backend emits Lava-compatible compartment descriptors; the Akida backend emits AKD register configuration tables. From the application developer's perspective, the .nmc binary is opaque — it's consumed by the NMC runtime, which handles hardware initialization and inference loop execution.

What we haven't solved yet

The compiler handles feedforward networks cleanly. Recurrent connections — where a population's output re-enters its own input at a future timestep — require the state-buffer model to track multi-timestep history, and the scheduler must break potential cycles in the execution graph. We support simple single-step recurrence (output at T feeds input at T+1) but multi-hop recurrence (output at T feeds input at T+k, k>1) is not yet fully automated; it requires a manual delay annotation in the model definition.

Beyond recurrence, the graph partitioning pass doesn't yet model inter-chip routing for multi-chip deployments. A Loihi 2 board with 8 chips sharing a spike mesh requires the allocator to consider inter-chip link bandwidth as a constraint alongside intra-chip routing. This is planned for the next major compiler revision and is the blocking dependency for models above roughly 100K neurons.

Building this compiler taught us that the hard problems in neuromorphic compilation are not the algorithmic graph passes — those translate from classical compiler theory with some adaptation. The hard problem is the absence of a stable abstract machine. Each neuromorphic chip has different compartment counts, different synapse memory formats, different routing topologies, and different power management APIs. The HAL design that makes the optimizer target-agnostic while allowing the backend to be target-specific took more iteration than all the optimization passes combined.