Blog / Edge Engineering

Event-Driven Sensor Fusion: Combining DVS Camera and IMU on a Sub-milliwatt Budget

Event camera and IMU sensor fusion architecture for gesture recognition

Combining a DVS camera with an IMU seems straightforward on paper — two event-generating sensors, both producing timestamped data, feeding into a neuromorphic inference core. In practice, the temporal synchronization between these two sensors and the architectural challenge of fusing structurally different event streams into a coherent spike-coded representation dominates the engineering time. This post documents the approach we developed for a gesture recognition system that runs below 1 mW active power.

The synchronization problem

A DVS camera generates events at microsecond-resolution timestamps using the camera's internal oscillator. An IMU (MEMS 3-axis accelerometer + 3-axis gyroscope) typically samples at 100–1000 Hz with timestamps from the host microcontroller's SysTick timer. These two time domains diverge. At 100 Hz IMU sampling, one IMU sample arrives every 10 ms; meanwhile, the DVS may generate anywhere from 0 to 500K events in that same 10 ms window, each with its own microsecond timestamp.

The fundamental issue is that the two sensors measure different phenomena with different temporal resolutions, and fusion requires aligning them in a common time domain. Naive approaches (align by wall-clock time at the host) introduce jitter from USB latency (for DVS cameras on USB interfaces) or I2C/SPI bus latency (for IMU sensors). For the gesture recognition application, timing jitter above ~2 ms degrades classification accuracy because the relationship between DVS motion blur direction and IMU angular velocity becomes decoupled.

Hardware timestamping as the solution

The clean solution is hardware timestamp generation from a shared oscillator. For our prototype, we used a DAVIS240C camera (which provides both DVS events and APS frames from a single Sony IMX chip) with its built-in FPGA generating 1 µs-resolution timestamps, combined with an ICM-42688-P IMU configured to timestamp each sample using the FPGA's shared pulse-per-second (PPS) signal. Both event streams then share the same time reference to within the PPS generation jitter (~100 ns), which is negligible for millisecond-scale fusion.

Without a shared hardware timestamp, software timestamp synchronization requires a calibration procedure at startup (analogous to the IEEE 1588 PTP protocol for network time) that estimates and compensates for the fixed propagation latency between each sensor and the host timestamp capture. The NMC SDK includes a TimestampCalibrator class that performs this calibration over a 5-second initialization window and maintains a running drift estimate during deployment.

Encoding DVS events for neuromorphic input

A DVS camera produces events of the form (x, y, t, p) — pixel coordinate, timestamp, and polarity (ON or OFF edge). The DAVIS240C has 240×180 pixels. Feeding all pixels as independent input channels to a neuromorphic core would require 240×180×2 = 86,400 input neurons — far beyond the capacity of any coin-cell-compatible neuromorphic target.

We use a spatial downsampling + local event count encoding:

# DVS event encoding for neuromorphic input
# Input: event stream (x, y, t, p) at up to 100K events/second
# Output: 16×12×2 spike channel input (384 channels)

import nrm.encoding as enc

encoder = enc.DVSPoolEncoder(
    input_resolution=(240, 180),
    output_resolution=(16, 12),
    time_bin_ms=2.0,             # 2 ms temporal bins
    polarity_channels=2,         # ON and OFF as separate channels
    threshold=3,                 # min events in spatial bin to produce a spike
)

# Each 2ms bin: produce a spike on channel (row, col, polarity) if
# >= 3 DVS events occurred in that spatial pool during the bin
# Result: sparse spike tensor (16, 12, 2) per 2ms timestep

The threshold=3 parameter is critical. Setting it to 1 (any event triggers a spike) produces high firing rates from noise events. Setting it too high (>5) misses genuine low-contrast motion. Calibration for a given deployment environment is typically done by recording 5–10 minutes of background noise and finding the threshold that keeps the false positive event rate below 5 spikes/second across the full input channel set.

Encoding IMU data for neuromorphic input

IMU data is continuous-valued (accelerations in m/s², angular velocities in deg/s), not natively event-driven. Converting it to spikes requires an explicit encoding scheme. We use temporal delta encoding: a spike is emitted on a channel when the IMU value on that axis changes by more than a threshold amount since the last spike on that channel.

# IMU temporal delta encoding
encoder_imu = enc.IMUDeltaEncoder(
    axes=['ax', 'ay', 'az', 'gx', 'gy', 'gz'],   # 6 axes
    delta_thresholds={
        'ax': 0.2,   # m/s² — emit spike when acceleration change > 0.2 m/s²
        'ay': 0.2,
        'az': 0.3,   # higher threshold for Z (gravity component is static)
        'gx': 5.0,   # deg/s — emit spike when rotation rate change > 5 deg/s
        'gy': 5.0,
        'gz': 5.0,
    },
    up_down_channels=True,  # separate channels for positive and negative deltas
    # → 12 input channels total for IMU
)
# IMU spike rate: typically 10-80 spikes/second across all channels during gesture

With 12 IMU channels and 384 DVS channels, the fused input is 396 channels — manageable for a Loihi 2 deployment (up to 128K neurons available) and tight but feasible for Xylo (64 neurons in inference mode — but with 396 input channels, the input encoding requires hardware-level event routing that Xylo handles via its analog frontend rather than as neuron populations).

Fusion architecture: cross-modal spiking attention

The fusion model must integrate two structurally different event streams — DVS events capturing motion edges in 2D space, IMU events capturing body/device orientation and acceleration. These two modalities are complementary: DVS events capture what is moving in the visual field; IMU captures how the device is moving. For hand gesture recognition, both are informative, but their relative importance varies by gesture type.

A pure concatenation approach (stack DVS and IMU channels as a single 396-dimensional input) loses the structural difference between the two modalities. We instead use a two-stream architecture with a learned cross-modal gating mechanism:

  • DVS stream: 3-layer convolutional SNN (2D spatial convolutions on the 16×12 grid, with LIF neurons)
  • IMU stream: 2-layer fully-connected SNN (on the 12-channel time-series)
  • Fusion gate: a 32-neuron gating population that learns to modulate the DVS stream's output based on IMU activity — effectively learning "weight visual events more when IMU shows low acceleration (stable device) and less when IMU shows high angular velocity (device moving faster than hand)"

The fusion gate is the design choice that most improved classification accuracy for gesture recognition at a marginal cost in neuron count. Without it, the model treats fast-device-motion frames and slow-motion frames identically, which increases false positive rates during device motion that happens to produce DVS events similar to a target gesture. With the gate, the model learns to discount visual evidence when the device's own motion could be causing the events.

Measured power on Loihi 2

The full two-stream fusion model on Loihi 2 (3-layer DVS CNN + 2-layer IMU MLP + fusion gate, total ~4,200 neurons, T=10 timesteps at 2 ms bins = 20 ms inference window):

  • Active inference power: 680–920 µW (depending on gesture activity level)
  • Idle power (no events): 280 µW (neurocores in sleep, spike router active)
  • Average power during gesture session at 1 gesture/10 seconds: ~310 µW
  • Energy per classification: 4.8 µJ (gesture window: ~20 ms active at 240 µW above idle)
  • Classification accuracy on DVS-Gesture-11 (adapted for hand gesture subset): 92.4%

The sub-milliwatt target is met at the system level under the expected use-case duty cycle, though active inference bursts up to 920 µW during dense-gesture events. The power figure is for the Loihi 2 inference core only; the full system including DVS camera (typically 5–15 mW for a DVS240C) and IMU (0.5–2 mW for ICM-42688-P) is above the coin-cell budget. This implementation targets a system with a larger battery (a wearable with a 100 mAh LiPo, for example).

What doesn't work well in this architecture

We're not claiming this is a solved problem. The current fusion architecture has notable failure modes:

Lighting-induced false positives: sudden changes in scene lighting (flickering LED, person walking past a window) generate DVS events across the full spatial field, which the DVS stream may misclassify as a gesture even when the IMU shows no hand motion. The fusion gate helps but doesn't eliminate this. A scene-classification pre-filter that suppresses global DVS events (events correlated across >60% of the spatial field) reduces this failure mode by roughly 70% in our testing.

IMU calibration drift: the delta-encoding threshold calibration assumes a stable sensor offset. MEMS accelerometer offset drifts with temperature (typically ±0.1 m/s² over -20°C to +80°C range). At low delta thresholds, this drift produces false IMU spike events. Re-calibration requires a 2-second stillness window, which the runtime requests if it detects anomalous IMU event rates.

Event-driven fusion is a significantly more complex engineering problem than single-sensor inference, and the complexity scales non-linearly with the number of modalities. The three-sensor case (DVS + IMU + microphone) is substantially harder than two sensors due to the three-way synchronization and the three-modality fusion architecture. It's tractable, but it's not something to underestimate in a deployment timeline.