Optimization Ledger
What This File Is
This is the low-level implementation ledger for Mimir’s most likely hot paths. It is not a benchmark result. It is a map of what to test, why it might matter, and where the current code is likely to pay unnecessary cost.
Hot Path 1: Chirp-Bin Decode
Current Shape
- Build an energy trace across a window.
- Classify local candidate windows.
- Dechirp and score each bin.
- Run code-valid anchor selection.
- Fit clock and refine offset.
Bottleneck Risk
The current scalar/classic C# version is correct enough to reason about, but it still encourages repeated full-window work. Six or more audio streams at 192 kHz make that expensive quickly.
Preferred Evolution
- Streaming energy proposal ring.
- Precomputed chirp kernels by sample rate and symbol plan.
- SIMD dechirp/bin score on CPU for small channel counts.
- Batched GPU compute for many candidate windows or many remote receivers.
- Keep the codebook trellis on CPU unless profiling proves otherwise; it is branchy and small.
Micro-Optimization Ideas
- Structure-of-arrays for sine/cosine kernels.
- Process bins in groups of 8 floats on AVX2/FMA.
- Keep hot candidate windows contiguous and aligned.
- Avoid allocating candidate arrays inside every analysis tick.
- Cache symbol reliability weights in dense arrays.
- Precompute expected event sample offsets for the active window.
- Use a ring of candidate frames keyed by absolute sample index.
Sample References
samples/StreamingChirpBinDecoderSketch.cssamples/Avx2DechirpGoertzelSketch.cppsamples/ChirpBinScore.compute.hlsl
Hot Path 2: Passive GCC-PHAT
Current Shape
- Allocate complex arrays per estimate.
- FFT reference/candidate.
- Normalize cross-spectrum.
- Inverse FFT.
- Search lags.
Bottleneck Risk
Passive sync is cheaper than dense chirplet matching, but repeated allocations and full FFTs per source pair can still dominate if called too often.
Preferred Evolution
- Reuse FFT buffers/plans.
- Keep reference spectrum cached for the current analysis edge.
- Batch candidate transforms.
- Move to native FFTW/KissFFT/MKL/cuFFT only after measured C# FFT cost matters.
- Use passive only as confidence/drift support, not as mandatory every-tick work.
Micro-Optimization Ideas
- Window into preallocated
Complexspans. - Precompute Hann window.
- Preemphasis in one pass with mean removal.
- Limit lag search by physical/network horizon.
- Use parabolic interpolation only around plausible peaks.
Hot Path 3: Fractional Delay And SRO Actuator
Current Shape
- Not built.
Preferred Evolution
- Prototype native/Faust fractional delay using Farrow/Lagrange for small sub-sample corrections.
- Add a higher quality polyphase sinc path for program output.
- Drive both from smoothed delay/SRO state.
- Keep per-source state in DSP, not in UI/runtime strings.
Micro-Optimization Ideas
- Fixed filter order for predictable SIMD.
- Interleave channels only where the DSP kernel wants it.
- Separate control-rate state update from audio-rate sample processing.
- Use denormal guards.
Sample Reference
samples/FarrowFractionalDelaySketch.cpp
Hot Path 4: Native Rolling Buffers
Current Shape
- C# rolling buffers store sample envelopes.
- Rust reservoir stores one shared-edge native rolling buffer with typed views.
Bottleneck Risk
The C# buffer shape is fine for current proof state, but payload-heavy audio and video should move to native memory handles. Copies and per-sample allocations will become visible as source count rises.
Preferred Evolution
- Native SPSC rings per capture worker feeding a shared reservoir index.
- Payload handles point to native/audio/GPU memory owned by capture/DSP/Fensalir.
- Runtime stores metadata and current belief.
Micro-Optimization Ideas
- Power-of-two ring capacities.
- Single writer per capture device.
- Cache-line padded head/tail counters.
- Batch publish blocks.
- Avoid sharing mutable payload ownership across subsystems.
Sample Reference
samples/SpscAudioBlockRingSketch.cpp
Hot Path 5: Camera Capture And GPU Fusion
Current Shape
- Native probes prove direct driver access and cadence.
- Runtime direct driver seam exists.
- Fensalir fusion not wired yet.
Bottleneck Risk
Six cameras make CPU copies and process bridges fail. Leap/PS3/Kiyo sources need direct capture, stable timestamps, and GPU-friendly payloads.
Preferred Evolution
- Direct KS/libusb/vendor driver workers.
- Native payload handles into runtime/reservoir.
- Fensalir consumes current window and uploads/processes on GPU.
- Use D3D12 shared resources where possible.
Micro-Optimization Ideas
- Queue multiple async reads per camera.
- Keep camera buffers pinned/native.
- Avoid decode unless the algorithm needs decoded pixels.
- Do per-camera feature extraction in compute, then fuse compact features.
External References
- LoRa/CSS receivers repeatedly validate dechirp plus FFT/bin scoring as the natural controlled-chirp demodulator shape.
- HLSL Shader Model 6 wave intrinsics are relevant for reductions and FFT-like kernels inside D3D12 compute.
- cuFFT callbacks show a general trick: combine preprocessing with transform load/store to avoid extra memory bandwidth.
- FFTW wisdom/alignment notes matter if we move passive or chirp-bin batches to native FFT plans.
