edge-aimonitoringdevops

Edge AI debugging on Pi: capture, visualize and compare model traces from AI HAT+ 2

UUnknown

2026-02-22

11 min read

Debug Edge AI on Pi 5 with AI HAT+ 2 — scripts to log inference traces, perf counters, and a web UI to visualize time and memory.

Hook — stop guessing where your edge model stalls

When inference on a Raspberry Pi 5 with an AI HAT+ 2 takes longer than expected, most engineers go down a rabbit hole: micro-optimizing ops that aren't the real bottleneck, or chasing memory leaks without clear evidence. You need precise, reproducible traces that show which operators, system calls, or kernel events consumed time and memory during an inference run — and a simple UI to compare runs and spot regressions.

What this guide gives you (quick)

Practical scripts to capture framework-level inference traces (ONNX Runtime / TFLite) and system counters on Raspberry Pi 5 with AI HAT+ 2.
A lightweight Python toolchain to merge traces and perf counters into a timeline-friendly JSON.
A minimal Flask web UI that visualizes operator timings and memory usage so you can see where time is spent.
Advanced tips (eBPF, automated regression checks, CI integration) and 2026 context for edge AI observability.

The 2026 landscape — why this matters now

By late 2025 and into 2026, edge AI deployments exploded into mainstream IoT and inference-heavy gateways. Two trends matter for your debugging workflow:

More on-device acceleration: NPU/accelerator HATs like AI HAT+ 2 pushed workloads on-device, but debugging shifted from high-level model issues to delegate/integration and system-level contention.
Observability tooling matured: eBPF, improved runtime profilers (ONNX Runtime JSON profiler, TFLite delegates), and compact trace viewers are now standard in edge toolchains — you can get detailed traces on-device with low overhead.

What to capture (the checklist)

When you profile an inference run, collect these four classes of telemetry:

Framework traces — operator start/end, input shapes, durations (e.g., ONNX Runtime or TFLite profiler).
System performance counters — CPU usage, interrupts, frequency changes, context switches (perf, /proc, mpstat).
Kernel and user events — syscalls, scheduler latency, memory faults (use eBPF / bpftrace for low overhead tracing).
Thermal/power/temperature — temperature sensors, voltage rails on AI HAT+ 2 (via i2c / sysfs if vendor exposes them) and CPU/GPU frequencies.

Required tools & install (Pi 5, Debian 12+ recommended)

Install these on your Raspberry Pi 5. Commands assume apt; adjust for your distribution.

sudo apt update
sudo apt install -y git python3 python3-venv python3-pip linux-tools-$(uname -r) perf bpftrace
# Grab flamegraph for conversions
git clone https://github.com/brendangregg/FlameGraph.git ~/FlameGraph
# (Optional) speedscope / flamebearer-based viewers can be used in the web UI

Also install Python libs:

python3 -m venv venv
. venv/bin/activate
pip install flask pandas psutil parse-json

1) Capture framework-level traces (ONNX Runtime example)

ONNX Runtime provides a JSON profiler useful on small devices. Enable profiling from Python:

from pathlib import Path
import onnxruntime as ort

profile_dir = Path('/tmp/ort_profile')
profile_dir.mkdir(exist_ok=True)
so = ort.SessionOptions()
so.enable_profiling = True
so.profile_file_prefix = str(profile_dir / 'ort_trace')

sess = ort.InferenceSession('model.onnx', sess_options=so)
# Run your inference loop
for i in range(100):
    _ = sess.run(None, {'input': input_data})
# Profiler writes files like ort_trace_000.json
print('Profile files in', profile_dir)

If you use TensorFlow Lite, enable profiling via the Interpreter option or call the TFLite profiling API. The key is to produce a structured trace (JSON or CSV) with operator names and durations.

2) Collect system counters with perf and /proc

Use perf stat for aggregate counters, and perf record / perf script to capture call stacks for hot code paths. Example script collects both and writes CSV outputs.

#!/usr/bin/env bash
# start_perf.sh
OUT_DIR=/tmp/edge_trace_$(date +%s)
mkdir -p "$OUT_DIR"
# Aggregate counters while test runs
perf stat -a -e task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions -o "$OUT_DIR/perf_stat.txt" sleep 10 &
# Record samples to later create a flamegraph
perf record -F 99 -a -g -o "$OUT_DIR/perf.data" &
PERF_PID=$!
# Wait for your inference workload to run externally; or run it here
# run_inference.sh
# after test finishes, stop perf record
sleep 12
kill $PERF_PID
perf script -i "$OUT_DIR/perf.data" > "$OUT_DIR/perf.script"
# Convert to folded stack and flamegraph (requires FlameGraph installed)
~/FlameGraph/stackcollapse-perf.pl "$OUT_DIR/perf.script" > "$OUT_DIR/perf.folded"
~/FlameGraph/flamegraph.pl "$OUT_DIR/perf.folded" > "$OUT_DIR/perf.svg"
echo "Perf outputs in $OUT_DIR"

Adjust sample frequency (-F) depending on workload. On Pi 5, 99-199 Hz is often sufficient to capture hotspots without overhead.

3) Lightweight kernel/user tracing with bpftrace

When a specific syscall or memory allocation pattern is suspected, use bpftrace to capture short, focused traces. Example: trace malloc/free calls and durations (requires libbcc/bpftrace support on kernel):

#!/usr/bin/env bpftrace
# malloc_trace.bt
uprobe:/lib/ld-linux*.so:malloc
{
  @start[tid] = nsecs;
}
uprobe:/lib/ld-linux*.so:free
/ @start[tid] /
{
  $dur = nsecs - @start[tid];
  printf("%u %s %u %d\n", pid, comm, tid, $dur);
  delete(@start[tid]);
}
# Run: sudo bpftrace malloc_trace.bt > /tmp/malloc_trace.txt &

Note: adjust library path to match the runtime library that the Python process uses (ld.so path). For kernel sched latency or context-switch probes, bpftrace one-liners can capture times spent in run-queue.

4) Monitor memory snapshots (tracemalloc / /proc)

Python's tracemalloc is helpful to capture peak memory allocations in the interpreter. For system-level memory, sample /proc/<pid>/smaps or /proc/meminfo periodically.

# mem_sample.py
import time
import psutil

pid = 12345  # inference process
p = psutil.Process(pid)
for i in range(10):
    mem = p.memory_info()
    print(i, mem.rss, mem.vms)
    time.sleep(0.5)

Bringing traces together — a reproducible capture job

Create a single wrapper script that starts system collectors, triggers inference, then stops collectors and archives outputs. Example: capture_inference_run.sh.

#!/usr/bin/env bash
set -euo pipefail
OUT_DIR=/tmp/edge_infer_$(date +%Y%m%d_%H%M%S)
mkdir -p "$OUT_DIR"
# Start perf stat in background
perf stat -a -o "$OUT_DIR/perf_stat.txt" &
PERFSTAT_PID=$!
# Start perf record
perf record -F 99 -a -g -o "$OUT_DIR/perf.data" &
PERFREC_PID=$!
# Start bpftrace for specific events (optional)
sudo bpftrace malloc_trace.bt > "$OUT_DIR/malloc_trace.txt" &
BPFTRACE_PID=$!
# Trigger the inference workload — replace with your runner
python3 run_inference.py --iterations 100 --profile-dir "$OUT_DIR/ort_profile"
# Stop background collectors
kill $PERFREC_PID || true
kill $PERFSTAT_PID || true
sudo pkill -f bpftrace || true
# Convert perf data to script and flamegraph
perf script -i "$OUT_DIR/perf.data" > "$OUT_DIR/perf.script"
~/FlameGraph/stackcollapse-perf.pl "$OUT_DIR/perf.script" > "$OUT_DIR/perf.folded"
~/FlameGraph/flamegraph.pl "$OUT_DIR/perf.folded" > "$OUT_DIR/perf.svg"
# Tar the archive for off-device analysis
tar -czf "$OUT_DIR.tar.gz" -C "/tmp" "$(basename "$OUT_DIR")"
echo "Saved archive: $OUT_DIR.tar.gz"

Merge traces into a timeline JSON

Most web timeline visualizers accept an array of events with names, start and end timestamps. The following simplified Python script parses ONNX Runtime profiler JSON plus perf counters and bpftrace outputs, normalizes timestamps to microseconds, and writes merged JSON for UI consumption.

# merge_traces.py
import json
from pathlib import Path
from datetime import datetime

OUT = []
profile_dir = Path('/tmp/edge_trace/ort_profile')
# ONNX Runtime produces a list of events; adapt parsing to your profiler format
for p in profile_dir.glob('ort_trace_*.json'):
    data = json.loads(p.read_text())
    for evt in data.get('events', []):
        OUT.append({
            'name': evt.get('name'),
            'cat': 'onnx',
            'ts': int(evt.get('start_time_us')),  # microseconds
            'dur': int(evt.get('duration_us'))
        })
# Add perf top-level counters as atomic events
perf_stat = Path('/tmp/edge_trace/perf_stat.txt')
if perf_stat.exists():
    txt = perf_stat.read_text()
    OUT.append({'name': 'perf_stat_raw', 'cat': 'perf', 'ts': 0, 'dur': 0, 'text': txt})
# Add bpftrace output parsing (example assumes lines of pid comm tid dur)
bpf = Path('/tmp/edge_trace/malloc_trace.txt')
if bpf.exists():
    for line in bpf.read_text().splitlines():
        parts = line.strip().split()
        if len(parts) >= 4:
            pid, comm, tid, dur = parts[0], parts[1], parts[2], int(parts[3])
            OUT.append({'name': f'malloc:{comm}', 'cat': 'bpf', 'ts': 0, 'dur': int(dur/1000)})
# Normalize timestamps (here we assume relative timestamps)
# Write merged JSON
Path('/tmp/edge_trace/merged_timeline.json').write_text(json.dumps(OUT))
print('Wrote /tmp/edge_trace/merged_timeline.json')

For production use, normalize timestamps precisely (use a single clock source, e.g., CLOCK_MONOTONIC since process start). Many runtimes export timestamps in microseconds — keep consistent units.

Simple web UI: Flask timeline viewer

This minimal app serves merged JSON and renders a horizontal timeline using simple HTML/CSS/JS. It's intentionally tiny so you can run on-device and iterate.

# app.py
from flask import Flask, jsonify, render_template_string
from pathlib import Path
import json

app = Flask(__name__)

TEMPLATE = '''




Edge AI Trace Viewer



Edge AI Trace Viewer






'''

@app.route('/')
def index():
    return render_template_string(TEMPLATE)

@app.route('/data')
def data():
    p = Path('/tmp/edge_trace/merged_timeline.json')
    if not p.exists():
        return jsonify([])
    return jsonify(json.loads(p.read_text()))

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Run: FLASK_APP=app.py flask run --host=0.0.0.0 --port=8080. Open http://pi:8080 to see a simple timeline. Replace the renderer with a full-flamegraph or speedscope integration when you need deeper UX.

Interpreting the UI — what to look for

Long operator bars: tie those to specific layers in your model; consider operator fusion or using a different delegate (NPU delegate vs CPU).
Interleaved small bars + many short syscalls: likely I/O blocking or frequent memory allocations — batch inputs or pre-allocate buffers.
High perf counters for cycles or instructions but low NPU time: check delegate binding — model may be running on CPU because delegate isn't compatible.
Rising memory baseline across runs: look into persistent allocations, caches, or memory held by third-party libraries.

Advanced strategies & automation (CI / regression detection)

Once you have reliable trace capture:

Automated regressions: Store merged_timeline.json artifacts for each CI build; compute total inference duration and key operator durations; fail the build if regression > X%.
Threshold-based alerts: Use a tiny exporter that reads merged JSON and pushes a few metrics (p95 latency, memory RSS) to Prometheus Pushgateway or an APM.
Golden trace diffs: Implement a comparator that finds drift in operator ordering or duration — in 2026, semantic diffing of traces became a best practice for edge AI releases.

Security & performance considerations

Running perf and bpftrace requires elevated privileges; prefer short-lived runs and whitelist operator hosts in production.
Sampling overhead: keep perf sampling rates modest (50–200 Hz) and run tests offline if possible. ONNX Runtime profiling can add overhead — avoid high-frequency profiling in production.
Data privacy: don't log input tensors that contain PII — strip or sample inputs before archiving traces.

Real-world case studies (short)

Example 1 — Image classification latency: a manufacturer saw 3× slower-than-expected inference on Pi 5 + AI HAT+ 2. Profiling showed the model was using a CPU delegate instead of the HAT delegate due to a missing operator (Clip). Replacing Clip with a supported fused op reduced wall time by 65%.

Example 2 — Memory leak: periodic runs increased RSS by 40MB per 100 inferences. Tracemalloc revealed a persistent cache in third-party preprocessing — switching to a generator-based pipeline removed the leak.

Future-proofing your trace pipeline (2026+)

Normalize on trace event formats (Chrome Trace, speedscope, Perfetto) so you can switch viewers without retooling your parsers.
Adopt eBPF-based low-overhead exporters for continuous telemetry — late 2025–2026 saw eBPF become mainstream on small Linux kernels used in Pi images.
Integrate with model CI: run a small battery of trace checks on every model commit to catch delegate regressions early.

Actionable checklist — run this today

Install perf, bpftrace, and FlameGraph on your Pi 5.
Enable ONNX Runtime or TFLite profiling and run a baseline traced inference (save JSON).
Run the capture_inference_run.sh wrapper to collect system counters and eBPF traces.
Merge traces with merge_traces.py and launch the Flask UI to visualize hotspots.
Add a CI job that runs one traced inference per commit and compares total inference time to the baseline.

Conclusion & next steps

Edge AI debugging on Raspberry Pi 5 with AI HAT+ 2 is no longer guesswork. By combining framework profilers, system counters (perf), and targeted eBPF probes, you get a clear map of where time and memory are spent. The scripts and minimal web UI above give you a reproducible pipeline you can run on-device or in CI.

Start small: capture one traced run and visualize it — you’ll quickly surface the highest-impact optimizations. In 2026, observability and automation are the difference between a flaky edge deployment and a stable, optimized inference pipeline.

Call to action

Clone the repo with the example scripts and UI, run a baseline trace on your Pi 5 + AI HAT+ 2, and share the merged JSON in your team’s issue tracker. Want a ready-made CI template or help adapting this to your model (TFLite, PyTorch, or a custom delegate)? Reply with your runtime and model type — I’ll provide a tailored capture script and CI job example.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.