Standard self-attention — the mechanism that made transformers ubiquitous — has O(N²) computational cost in sequence length N. For token sequences of 512 or 1024 this is manageable in data center inference. For a DVS event stream from a 346×260 pixel camera generating 1–10 million events per second, N is effectively unbounded, and O(N²) attention is physically inoperable at low power.
The research direction we've been pursuing treats temporal locality as a first-class computable property rather than a fixed architectural constraint like a sliding window or a fixed-length buffer. This post describes the problem formulation, the mechanism we've developed, and the practical limitations of the current approach.
The problem with fixed temporal windows in event-driven systems
Most practical event-driven neural architectures deal with the sequence-length problem by discretizing time into fixed bins — aggregate all events within a 1 ms window into a single pseudo-frame, then process pseudo-frames as a sequence. This approach converts the problem from continuous-time event processing back to frame-based processing, and with it, you import the latency of the frame boundary.
Consider a DVS-based gesture recognition system with 10 ms temporal bins. A gesture that begins 1 ms after a bin boundary will accumulate 10 ms of events in the first bin and contribute to a classification decision approximately 19 ms after the gesture started. A gesture beginning 9 ms into a bin produces a classification at 11 ms. The 8 ms jitter in classification timing comes entirely from the discretization, not from the actual neural computation time. In some applications this is acceptable; in reactive control systems it's a fundamental problem.
Adaptive temporal windows (variable bin sizes based on event rate) reduce the worst-case jitter but introduce variable-length input sequences, which breaks fixed-architecture feedforward processing. Attention mechanisms are a natural fit because self-attention is sequence-length-independent in principle — but the O(N²) complexity reintroduces the original problem.
Temporal locality as a structuring principle
The key observation is that for the types of signals we care about — DVS camera streams, spike-encoded audio, IMU event streams — events have strong temporal locality. Events that are temporally close together are more likely to be causally related (part of the same gesture, the same phoneme, the same vibration harmonic) than events separated by long time intervals. This is not a universal law, but it's a reliable enough regularity that exploiting it architecturally produces substantial efficiency gains on the target signal classes.
We formalize temporal locality as a computable property of the event stream. For an event stream E = {(t_i, x_i, p_i)}, where t_i is timestamp, x_i is spatial position, and p_i is polarity, define the local event density at time t as:
ρ(t, δ) = |{e ∈ E : |t_e - t| < δ}| / (2δ)
# Where δ is the locality window (a learned parameter, not a fixed hyperparameter)
High ρ indicates dense event activity; low ρ indicates sparse inter-event gaps. The temporal attention mechanism uses ρ to weight the attention computation: events in high-density regions receive full attention processing; events in low-density regions are processed with reduced attention heads (effectively, reduced representational capacity for quiescent signal regions).
The locality-weighted attention mechanism
Standard scaled dot-product attention:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
Locality-weighted attention replaces the uniform attention with a density-gated version:
LocalAttn(Q, K, V, ρ) = softmax(QK^T / sqrt(d_k) + log(ρ_ij + ε)) V
# Where ρ_ij is the local event density at the midpoint between events i and j,
# and ε is a small constant to avoid log(0)
The log(ρ_ij) term is added to the attention logits before softmax, which effectively suppresses attention weight between temporally distant events in sparse regions while allowing full attention weight between events in dense activity bursts. The δ parameter controlling the locality window is learned during training via gradient descent through the ρ computation (which is differentiable with respect to δ).
This formulation has an important property: the additional computation for the ρ term scales O(N) rather than O(N²), because ρ at each position can be computed as a running count over a sliding window rather than as a pairwise computation. The full attention matrix computation is still O(N²), but the density gating allows early termination of the softmax for below-threshold attention logits — in practice, for event streams with average density below 20% of maximum, the effective attention computation reduces to O(N × k) where k is the mean number of events within the locality window of each event.
Mapping to spiking implementation
The challenge in mapping this mechanism to neuromorphic hardware is that the attention computation requires approximate dot products — not binary spike accumulations. Standard LIF neurons don't natively implement the attention kernel.
We map the attention computation using a spiking attention population: a specialized neuron population where each neuron's membrane potential integrates incoming spikes weighted by a learned key-query product. The locality density gate is implemented as a presynaptic inhibitory population that reduces the effective threshold of attention neurons in low-density time windows, causing them to fire less frequently and thus produce lower-weight "soft attention" outputs.
# Spiking attention population in NMC model definition
import nrm.nn as nrm
class LocalSpikeAttn(nrm.Module):
def __init__(self, d_model, n_heads, locality_tau):
self.q_proj = nrm.nn.SynapseLayer(d_model, d_model)
self.k_proj = nrm.nn.SynapseLayer(d_model, d_model)
self.v_proj = nrm.nn.SynapseLayer(d_model, d_model)
self.attn_lif = nrm.nn.LIFLayer(
pop_size=d_model,
threshold=1.0,
leak_tau=locality_tau, # locality window controls threshold recovery
)
self.density_gate = nrm.nn.DensityGate(window=locality_tau)
def forward(self, x_spikes, T):
q = self.q_proj(x_spikes)
k = self.k_proj(x_spikes)
v = self.v_proj(x_spikes)
gate = self.density_gate(x_spikes, T)
attn_out = self.attn_lif(q * k * gate) # simplified spike product
return attn_out * v
The spike product q * k * gate is an approximation — strict spike multiplication requires the two inputs to both be 1 simultaneously, which happens only a fraction of the time. This is a known accuracy-efficiency trade-off. On DVS-Gesture classification, the spiking attention implementation achieves 91.3% accuracy versus 94.1% for a float32 attention implementation, a 2.8% gap we consider acceptable given the ~15× energy reduction from the spike-based computation.
Learned locality window: what training reveals
Allowing δ (the locality window) to be learned rather than fixed produces interesting patterns. On DVS-Gesture, trained δ values cluster around two timescales: 2–5 ms (capturing within-frame motion coherence) and 40–80 ms (capturing across-frame gesture trajectory). The network spontaneously discovers that there are two useful locality scales, which matches the physical structure of DVS gesture data — events from a moving hand are coherent within milliseconds spatially and across tens of milliseconds temporally.
On SHD (spike-encoded audio), trained δ values cluster around 1–3 ms and 15–25 ms — roughly corresponding to phoneme-internal and phoneme-to-phoneme timescales. This suggests the locality learning is capturing genuinely useful structure, not just fitting noise.
Limitations of the current formulation
We're not claiming this mechanism generalizes to all sequence modeling tasks. Several important limitations:
- Global dependencies: the locality gate suppresses long-range attention. For tasks where semantics require integrating information across long time gaps (e.g., wake-word detection where the wake word must be matched against a model learned across hundreds of examples with varying timing), the locality bias works against accurate classification.
- Training stability: learning δ jointly with the attention weights is unstable without a warm-start protocol where δ is fixed initially and only freed for optimization after the attention weights have converged. Training from scratch with free δ frequently collapses to degenerate solutions (δ → 0 or δ → ∞).
- Hardware mapping approximation: the spiking product approximation loses accuracy on tasks requiring precise attention weighting. For classification tasks with clear category boundaries, the accuracy gap is small; for regression or ranking tasks, it can be large.
The compiler support for LocalSpikeAttn layers is currently experimental — the layer type compiles for Loihi 2 but not yet for Akida, because the density gate mechanism requires programmable delay lines that aren't available on AKD1000/1500. The mechanism will be promoted to stable in NMC compiler release v0.8, pending Akida backend support.
The broader point is that event-driven temporal processing requires architectural primitives that standard sequence models don't provide — and that building those primitives in a way that maps to neuromorphic hardware constraints is a non-trivial co-design problem. Temporal locality is one such primitive; it's not the complete answer, but it's a tractable handle on the O(N²) problem that emerges from treating event density as a computable signal property rather than a fixed engineering parameter.


