Edge inference recipes: Running Llama.cpp and ONNX models on the AI HAT+ 2
raspberry-piperformancescripts

Edge inference recipes: Running Llama.cpp and ONNX models on the AI HAT+ 2

ccodenscripts
2026-01-24 12:00:00
10 min read
Advertisement

Optimized commands, compile flags, and runtime scripts to run Llama.cpp and ONNX on Raspberry Pi 5 + AI HAT+ 2—practical, reproducible, 2026-ready.

Hook — stop reinventing the wheel: run Llama.cpp and ONNX models fast on Raspberry Pi 5 + AI HAT+ 2

You bought a Raspberry Pi 5 and the AI HAT+ 2 to run local LLMs at the edge, but your experimentation is slowed by long compile cycles, poor throughput, and confusing runtime settings. This guide gives a compact, battle-tested set of build flags, shell scripts, and runtime recipes to get Llama.cpp (ggml-based) and ONNX models performing well with the AI HAT+ 2 accelerator on the Raspberry Pi 5.

Quick TL;DR — What works best in 2026

  • Llama.cpp (quantized ggml models) = easiest, lowest-latency on CPU. Use aggressive ARM NEON/SVE compile flags and tuned thread affinity to saturate Pi 5 cores.
  • ONNX + AI HAT+ 2 = best peak throughput and power efficiency for supported operators. Use the vendor-provided execution provider (EP) from the AI HAT+ 2 SDK and ONNX Runtime (2025/2026 builds add improved small-batch NPU kernels).
  • Use per-process environment tuning (OMP/GOMP/KMP and CPU affinity) + model quantization (int8 / int4) and token-caching to unlock real-time inference.

Assumptions & prerequisites

  • You have a Raspberry Pi 5 with Linux (Raspberry Pi OS or Ubuntu 22.04/24.04 aarch64).
  • AI HAT+ 2 is attached and you installed the vendor SDK/drivers per the vendor docs (we’ll call the vendor-provided ONNX EP the AI_HAT_EP below).
  • Familiarity with building C/C++ repos and Python virtual environments.

The evolution in 2026 — why these patterns matter

By late 2025 and into 2026, two clear trends shaped edge inference:

  • Wider adoption of hardware-specific ONNX execution providers — vendors ship EPs tuned for small-batch LLM kernels and quantized arithmetic; that favors ONNX on edge NPUs and ties into broader edge orchestration patterns.
  • Software-first quantization and ggml tooling — projects like Llama.cpp and optimized ONNX exporters improved small model quantization (int8/int4) and fused kernels for edge CPUs.

Section 1 — Llama.cpp: compile and runtime recipes for Pi 5 CPU

Llama.cpp (ggml) remains a go-to for local LLMs because it requires no vendor runtime and supports highly-quantized models. On the Pi 5, your wins come from good compile flags, threading affinity and using quantized models.

Best compile flags (aarch64 / Cortex-A76)

Use these flags as a baseline — they’re conservative and portable for Pi 5’s A76 cores. Replace clang with gcc if needed.

export CC=clang
  export CFLAGS="-O3 -march=armv8.2-a -mcpu=cortex-a76 -mtune=cortex-a76 -ffast-math -funroll-loops -fomit-frame-pointer -fdata-sections -ffunction-sections -fno-exceptions -fno-rtti -fPIC -g0"
  export LDFLAGS="-Wl,--gc-sections -static-libgcc -static-libstdc++"

Notes:

  • -march=armv8.2-a and -mcpu=cortex-a76 tune for Pi 5. If you prefer portable builds, use -march=native.
  • -ffast-math trades strict FP compliance for speed — acceptable for inference in many LLM use cases but test for numerical stability.

Example build script for Llama.cpp (bash)

#!/usr/bin/env bash
  set -euo pipefail

  git clone https://github.com/ggerganov/llama.cpp.git
  cd llama.cpp

  export CC=clang
  export CXX=clang++
  export CFLAGS='-O3 -march=armv8.2-a -mcpu=cortex-a76 -ffast-math -funroll-loops -fomit-frame-pointer'
  export CXXFLAGS="$CFLAGS"

  # Use the make target that builds the main cli; adjust if upstream changed
  make clean && make -j4

  echo "Build finished: ./main"
  

Runtime tuning — threads & affinity

Tune these environment variables for predictable performance. Pi 5 has 4 high-performance cores (A76) — bind heavy threads there and leave lower-priority tasks off them.

export OMP_NUM_THREADS=4
  export GOMP_CPU_AFFINITY="0-3"
  export KMP_AFFINITY=granularity=fine,compact,1,0
  # Run the binary
  ./main -m models/ggml-model-q4_0.bin -p "Hello" --n_threads 4

Tips:

  • Match the process threading (--n_threads) to OMP_NUM_THREADS.
  • If you use other services on the Pi, reserve a core for system tasks (use affinity like 0-2 for model, core 3 for system).

Use ggml quantized variants (q4_0, q4_K_M, q8_0, q5_K_S) to reduce memory and improve throughput. The community maintains conversion and quantization tools; the safe pattern is:

  1. Start from an HF checkpoint or a PyTorch/PEFT export.
  2. Export to a float16 ONNX or ggml float16 intermediate (tools vary by model family).
  3. Run vendor or community quantization (int8/int4) to ggml format and validate outputs on a small test set.

Example: convert a PyTorch model to ONNX and then to ggml (pseudo-commands, adapt per-tooling):

# Export from PyTorch to ONNX (opset 16 recommended)
  python3 export_to_onnx.py --model hf-model --opset 16 --out model.onnx

  # Optional: use onnxruntime-tools quantize (for ORT) or use community ggml converter
  python3 -m onnxruntime_tools.quantize_dynamic --input model.onnx --output model_int8.onnx

  # Use community converter to ggml-quantized format
  python3 convert_onnx_to_ggml.py --onnx model_int8.onnx --out models/ggml-model-q4_0.bin
  

Section 2 — ONNX Runtime + AI HAT+ 2: vendor EP + tuning

When you want to offload supported operators to the AI HAT+ 2, ONNX Runtime is the right integration point. The vendor usually ships an execution provider (EP) that you add to ONNX Runtime builds or load at runtime. The recipe below shows how to wire it up and tune the runtime for Pi 5.

Check available ONNX Runtime providers

python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"

After installing the AI HAT+ 2 SDK, you should see the provider name returned (we’ll use AI_HAT_EP as a placeholder).

Python example: create a session using the AI HAT EP

import onnxruntime as ort

providers = ["AI_HAT_EP", "CPUExecutionProvider"]
sess = ort.InferenceSession("model_int8.onnx", providers=providers)

inputs = {sess.get_inputs()[0].name: input_array}
outputs = sess.run(None, inputs)
print(outputs[0].shape)

Notes:

  • Place the AI_HAT_EP before CPUExecutionProvider so supported ops run on the accelerator.
  • Check the SDK for provider-specific session options (memory pools, batch sizes, quantization preference).

ONNX Runtime build & pip installation tips (aarch64)

If a vendor-supplied wheel is provided, prefer that. If you must build ONNX Runtime from source to include the AI HAT EP, follow vendor SDK instructions. Basic build flags often include:

./build.sh --config Release --build_dir build --use_openmp --enable_pybind --parallel

Common pitfalls:

  • Mismatch between the EP binary ABI and the ORT build (both must be built for aarch64).
  • Missing vendor shared libraries (LD_LIBRARY_PATH must include the SDK lib dir).

Runtime env tuning for ONNX Runtime

These variables control threading and affinity for ORT and NPUs. Use them as a starting point:

export OMP_NUM_THREADS=2
export ORT_NUM_THREADS=2
export GOMP_CPU_AFFINITY="0-3"
# If the vendor exposes an env var for NPU thread pools, tweak as documented
export AI_HAT_NPU_THREADS=2

python3 run_onnx_inference.py

Hint: use ORT profiling to identify slow operators. Then check whether those ops are supported by the AI_HAT_EP; missing ops fall back to CPU and often become the bottleneck.

Benchmark script: measure latency and peak memory

#!/usr/bin/env bash
set -euo pipefail
MODEL=$1
REPS=${2:-50}

python3 - <

If you want more systematic comparisons, measure your inference against a cloud or host baseline (see NextStream Cloud Platform benchmarks) and record thermal/CPU counters so you can compare run-to-run.

Section 3 — Advanced strategies and trade-offs

1) Operator coverage & mixed execution

Most vendor EPs specialize in dense matmul, layernorm and fused attention ops. If an operator isn’t supported, ONNX Runtime falls back to CPU. To minimize fallback overhead:

  • Use an ONNX graph optimizer (onnx-simplifier / optimizer) to fuse operations into supported patterns.
  • Prefer export flows that map attention and RMSNorm to fused kernels (check your exporter flags: use use_packed_kv, fuse_attention where available).

2) Quantization choices

Quantization reduces memory and improves throughput but can reduce accuracy. Practical approach:

  • Start with dynamic int8 for ONNX models — minimal accuracy regression and simpler to apply.
  • If the AI HAT+ 2 EP supports int8 kernels, test int8 paths and validate end-to-end on representative prompts.
  • For Llama.cpp/ggml, use q4/q5 families; they give the best latency/memory trade-offs on Pi 5.

3) Token streaming & caching past keys

To maintain low latency when generating tokens, keep past-key-value caches on the accelerator when possible. If the EP doesn’t support persistent per-request state, keep past keys in process memory and only run the attention update ops on the accelerator.

4) Power, cooling, and PCIe bandwidth

AI HAT+ 2 leverages the Pi 5’s PCIe lanes and thermal envelope. Ensure:

  • Adequate cooling — throttling will cripple throughput.
  • Use a quality USB-C PSU (min 7.5A recommended depending on attached accessories).
  • If the HAT+ 2 exposes NVMe or other high-bandwidth storage, put swap/working files on that device, not the SD card.

Section 4 — Reusable scripts & snippets

1) Start/stop wrapper (bash) that enforces affinity and logs perf

#!/usr/bin/env bash
MODEL=$1
LOG=${2:-infer.log}

export OMP_NUM_THREADS=3
export GOMP_CPU_AFFINITY="0-2"
export AI_HAT_NPU_THREADS=2

python3 run_onnx_service.py --model "$MODEL" 2>&1 | tee "$LOG"

2) Minimal FastAPI service (Python) using ONNX Runtime

from fastapi import FastAPI
import onnxruntime as ort
import numpy as np

app = FastAPI()
sess = ort.InferenceSession('model_int8.onnx', providers=['AI_HAT_EP','CPUExecutionProvider'])
input_name = sess.get_inputs()[0].name

@app.post('/infer')
async def infer(payload: dict):
    tokens = np.array(payload['tokens'], dtype=np.int64)
    out = sess.run(None, {input_name: tokens})
    return {'output': out[0].tolist()}

If you prefer a TypeScript-backed microservice or want to auto-generate a proxy, see patterns for small micro-apps and TypeScript automation (example toolchain: From ChatGPT prompt to TypeScript micro app).

3) Node.js proxy (to stream tokens from a Python backend)

// simplest fetch wrapper
const fetch = require('node-fetch');

async function ask(tokens) {
  const resp = await fetch('http://127.0.0.1:8000/infer', {
    method: 'POST',
    headers: {'Content-Type':'application/json'},
    body: JSON.stringify({tokens})
  });
  return resp.json();
}

module.exports = { ask };

Section 5 — Troubleshooting checklist

  • No provider listed by ort.get_available_providers(): ensure the AI HAT SDK's lib folder is in LD_LIBRARY_PATH and the EP wheel matches aarch64 ORT build.
  • Large fallback ops on CPU: run ORT profiling, inspect the graph, and try ONNX graph optimization or a different exporter configuration. Also consider observability approaches for preprod inference to capture operator-level timing (modern observability).
  • Thermal throttling: check /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq and dmesg; increase cooling or decrease CPU frequency targets for stability (see electrical ops & cooling guidance).
  • Model memory OOM: use a smaller quantized model or offload activations to swap on NVMe (avoid SD-card swap).

Practical examples and one-minute recipes

1-minute: run a quantized ggml model

# Build llama.cpp (see earlier build script)
# Assume ./main exists and model is models/ggml-model-q4_0.bin
export OMP_NUM_THREADS=3
./main -m models/ggml-model-q4_0.bin -p "Translate to French: Hello" --n_threads 3

1-minute: run ONNX model with AI HAT EP

python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
# If AI_HAT_EP is listed
python3 run_onnx_inference.py --model model_int8.onnx

2026 predictions & final architecture advice

Across 2026 I expect the following to be practical truths for edge LLMs on devices like Raspberry Pi 5 + AI HAT+ 2:

  • Vendor EPs will continue to improve operator coverage for small-batch LLMs — expect better out-of-the-box ONNX performance by late 2026.
  • Hybrid deployments will be common: small quantized ggml models on CPU for instant responses, with larger ONNX models on NPUs for heavy-duty tasks.
  • Standardization (ONNX metadata, quantization schema) will ease model portability among edge NPUs.

Takeaways — what to do next

  • Start with a quantized ggml Llama.cpp model for fastest time-to-result; tune OMP/GOMP affinity and use -O3 compile flags.
  • If you need throughput/power efficiency, integrate ONNX Runtime + AI HAT EP and focus on operator coverage and graph fusion.
  • Automate builds and benchmarking: keep reproducible scripts and a perf harness that records latency, memory, and temperature (for example, tie your perf notes to a reproducible runbook and diagram your workflows for easier audits: making diagrams resilient).
Actionable: clone a clean repo, add the exact environment variables from this article to your start script, and run the benchmark script to create a baseline before further tuning.

Call-to-action

Try these recipes today: clone your preferred model, run the provided build and runtime scripts, and post your results back to the community for reproducible comparisons. If you want a ready-made toolkit, visit codenscripts.com to download a curated script library for Pi 5 + AI HAT+ 2 (Llama.cpp build scripts, ONNX run templates, and quantization helpers) — fork, adapt, and contribute your optimizations so everyone benefits.

Advertisement

Related Topics

#raspberry-pi#performance#scripts
c

codenscripts

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-24T04:29:40.719Z