raspberry-piedge-aitutorial

Raspberry Pi 5 + AI HAT+ 2: Quickstart projects for local generative AI

UUnknown

2026-01-23

11 min read

Practical Pi 5 + AI HAT+ 2 quickstarts: set up, optimize, and run local LLMs for chat, image captioning, and code help.

Hook: stop fighting latency and cloud costs — run generative AI locally on your Raspberry Pi 5

If you’re a developer or IT pro tired of shipping data to the cloud for every prompt, the Raspberry Pi 5 plus the AI HAT+ 2 is a game-changer in 2026. This combination makes it practical to run small, useful generative AI models at the edge: low-latency chatbots, image captioners that respect privacy, and local code helpers for quick iterations.

What you’ll get from this guide (fast)

Step-by-step setup of a Raspberry Pi 5 with an AI HAT+ 2
How to build and optimize small LLMs for ARM/NEON
Three runnable projects with Python examples: a local chatbot, an image captioner, and a code helper
Advanced tips (quantization, memory tricks, hardware offload) and 2026 trends to keep you forward-compatible

2026 context: why this matters now

Edge AI in 2026 is no longer experimental. Two trends make Pi 5 + AI HAT+ 2 practical:

Smaller, quantized models and gguf/ggml tooling: Most community models are distributed in formats designed for fast quantization (gguf) and inference with low memory footprint (4-bit/8-bit quant formats).
ARM/NEON and vendor runtime optimizations: LLM runtimes like llama.cpp and onnxruntime now include ARM NEON kernels and vendor SDKs that let NPUs or accelerators (like AI HAT+ 2) do matrix ops efficiently.

That means you can run 3–7B-parameter models (quantized) for real tasks on the Pi 5 with an AI HAT+ 2 at latencies useful for interactive tools.

Before you start: prerequisites

Raspberry Pi 5 (4 GB or 8 GB recommended — 8 GB is best for bigger models)
AI HAT+ 2 installed on the Pi 5 (driver/SDK access from vendor)
16–64 GB microSD or NVMe storage (models and swap files)
Basic Linux skills (apt, make, Python)
Familiarity with model licenses — verify local use license before deploying

Part A — Hardware & OS setup (10–20 minutes)

Flash Raspberry Pi OS (64-bit). I recommend Raspberry Pi OS 64-bit or Ubuntu 24.04/26.04 server images updated to latest kernel.
```
sudo apt update && sudo apt full-upgrade -y
sudo reboot
```
Verify CPU and memory so you can choose compile flags:
```
lscpu
free -h
```

Install build essentials and Python tooling:

sudo apt install -y build-essential cmake git python3 python3-pip python3-venv libblas-dev liblapack-dev libatlas-base-dev
python3 -m venv ~/venv && source ~/venv/bin/activate
pip install --upgrade pip

Install AI HAT+ 2 drivers and SDK.
Follow the vendor installation guide. Typical steps are adding the vendor apt repository and installing the runtime. After installing, confirm the device is visible:
```
# Example checks after following vendor instructions
lsusb
dmesg | tail
# or vendor CLI
aihatsdk --status
```
Note: the SDK lets runtimes offload certain layers or provides a system-level acceleration plugin for inference engines. Keep the SDK up to date — vendors shipped major updates in late 2025 to improve quantized matrix ops.

Part B — Build a fast local LLM runtime (llama.cpp)

In 2026 the most pragmatic path on ARM is llama.cpp (and its Python bindings) because it supports gguf quantized models and runs compactly with NEON acceleration. We’ll compile it with ARM tuning.

Steps

# clone and compile llama.cpp with NEON and ARM tuning
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# tune flags for Pi 5 (Cortex-A76); if unsure, use generic -O3
make CFLAGS="-O3 -march=armv8-a -mcpu=cortex-a76+crypto+simd"

Install the Python binding:

pip install pip wheel setuptools
cd python
pip install -e .
# this exposes the 'llama_cpp' Python package

Model preparation

Choose a small-to-medium model that allows local usage (check license). Popular community choices in 2026:

7B models quantized to q4_0 or q4_k_m (good tradeoff for accuracy vs memory)
3B or 4B models if you need lower memory and faster response times

Convert or download a gguf model. If you have a non-gguf file, convert it with the community conversion tools, then quantize with llama.cpp’s quantize tool:

# Example: convert/quantize to q4_0
# (replace source_model with your source path)
./quantize/quantize ./source_model.gguf ./model-q4_0.gguf q4_0

Tip: store models on fast storage (USB3 or NVMe) and make sure swap is available for peak tokens during decode.

Project 1 — Local chatbot (pi-chat)

A simple interactive assistant that runs wholly on the Pi. Use quantized model + llama.cpp Python binding for low-latency streaming.

Install dependencies

pip install llama-cpp-python flask uvicorn fastapi

chat_server.py (runnable; stripped for clarity)

from llama_cpp import Llama
from fastapi import FastAPI
import uvicorn

MODEL_PATH = "/path/to/model-q4_0.gguf"
llm = Llama(model_path=MODEL_PATH, n_ctx=2048, n_threads=4)

app = FastAPI()

@app.post('/chat')
async def chat(payload: dict):
    prompt = payload.get('prompt', '')
    # Basic system + user prompt template
    template = "System: You are a helpful assistant.\nUser: {user}\nAssistant:"
    full = template.format(user=prompt)
    out = llm.create_completion(prompt=full, max_tokens=256, temperature=0.7)
    return { 'response': out['choices'][0]['text'] }

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=8000)

How to run:

python3 chat_server.py

Test it with curl:

curl -s -X POST -H 'Content-Type: application/json' \
  -d '{"prompt":"Explain tail recursion in 3 sentences."}' http://localhost:8000/chat

Optimization tips for the chatbot

Quantize your model to q4 variants to stay within RAM limits.
Compile for NEON, and set n_threads to match the Pi’s cores.
Use a short system prompt and manage history size with sliding windows to limit context tokens.
If available, enable AI HAT+ 2 offload in the runtime (vendor SDK instructions) to accelerate matrix ops — typical effect: 2–5x decode speed improvements for quantized models.

Project 2 — Image captioner (privacy-friendly)

The goal: take an image on-device, produce a short caption locally, then optionally feed that caption to the local LLM for refinement.

Strategy

Use a lightweight ONNX image encoder (MobileNet/ViT-tiny) to get image features, and a small text decoder (GPT-2 small) to generate captions. Convert both to ONNX and run via onnxruntime with CPU/NNAPI or vendor runtime. Then, optionally, pass the caption to the LLM to expand or add context.

Install onnxruntime and pillow

pip install onnxruntime onnxruntime-tools pillow numpy

caption.py (simplified demo)

from PIL import Image
import numpy as np
import onnxruntime as ort

# Replace with your converted models
ENCODER = 'mobilenet_v2_encoder.onnx'
DECODER = 'gpt2_decoder.onnx'

sess_enc = ort.InferenceSession(ENCODER, providers=['CPUExecutionProvider'])
sess_dec = ort.InferenceSession(DECODER, providers=['CPUExecutionProvider'])

def preprocess(img_path):
    img = Image.open(img_path).convert('RGB').resize((224,224))
    arr = np.array(img).astype('float32') / 255.0
    arr = np.transpose(arr, (2,0,1))[None,:,:,:]
    return arr

def encode(img_path):
    x = preprocess(img_path)
    feats = sess_enc.run(None, {'input': x})[0]
    return feats

def decode(features):
    # This demo assumes decoder accepts features and returns token ids. Real model pipeline will vary
    out = sess_dec.run(None, {'features': features})[0]
    # convert tokens to text using decoder tokenizer mapping
    return ''

if __name__ == '__main__':
    feats = encode('photo.jpg')
    caption = decode(feats)
    print('Caption:', caption)

Practical path in 2026: many community caption models have ONNX conversions and int8 quantized artifacts that run efficiently on-device. If converting Hugging Face models to ONNX, use the Optimum toolchain and quantize to int8 for onnxruntime int8 execution.

Refining caption with local LLM

After generating a raw caption, you can call the local LLM to produce a more descriptive or alternative-stylized caption:

caption = 'A person riding a bicycle on a tree-lined path.'
prompt = f"Refine this image caption into a friendly two-sentence description: \n" + caption
# call llama_cpp as before to refine
out = llm.create_completion(prompt=prompt, max_tokens=50)
print(out['choices'][0]['text'])

Project 3 — Local code helper (explain, fix, refactor)

Use a model to help you understand or refactor local code snippets. This is a very practical workflow for offline/code-sensitive work.

Principles

Restrict context to the function or file to fit smaller context windows.
Use deterministic decoding (low temperature) for reproducible suggestions.
Use prompt templates that include language, libraries, and target style.

code_assist.py

from llama_cpp import Llama
import sys

MODEL_PATH = '/path/to/model-q4_0.gguf'
llm = Llama(model_path=MODEL_PATH, n_ctx=2048, n_threads=4)

def explain_code(code_snippet):
    prompt = (
        "You are a concise programming assistant. Explain the following code and list potential bugs.\n" 
        "Code:\n" + code_snippet + "\n\nAnswer:\n"
    )
    out = llm.create_completion(prompt=prompt, max_tokens=200, temperature=0.0)
    return out['choices'][0]['text']

if __name__ == '__main__':
    with open(sys.argv[1], 'r') as f:
        code = f.read()
    print(explain_code(code))

Run with: python3 code_assist.py path/to/file.py

Advanced: code patch generation

To get a suggested patch, include the file context and a short instruction like "Refactor to use list comprehension and add type hints." Your assistant will return a diff or suggested new file contents. Always run CI/tests before committing automatically generated patches.

Optimization checklist — squeeze every cycle out of the Pi

Quantize models (q4_0, q4_k_m, q8_0) — reduces memory and increases throughput.
Compile runtimes for ARM/NEON (-march/ -mcpu flags). Test several flags (microbench) to find best for your Pi build.
Use swap on fast storage (NVMe/SSD) for big contexts: create a v2 swap file tuned for performance.
Threads & affinity: set n_threads in runtimes to CPU cores and use taskset for pinning if background processes interfere.
Use float16 if supported: fp16 kernels reduce memory bandwidth where hardware supports it.
Offload to AI HAT+ 2: use the vendor SDK or runtime plugin to accelerate GEMM/MatMul; test both CPU-only and offload modes and measure latency.
Batch requests for throughput-oriented services; use streaming for interactive apps.

Security, licensing and production readiness

License check — always verify the model’s license before local deployment. In 2026 more models are distributed under explicit terms for local/offline use, but some still restrict commercial use.
Data governance — running locally keeps data private, but secure the device (local firewall, disable unnecessary services) and audit any network calls from your apps.
Monitoring and fallbacks — track inference latency and memory; when a request would blow memory, return a helpful error and fallback to a smaller model or cloud service if appropriate.

Troubleshooting & performance sanity checks

No model load? Verify llama_cpp.Llama can open the model and paths are correct.
OOM at decode? Reduce n_ctx, quantize further, or move model to faster storage and increase swap.
Slow decoding? Try fewer threads, different compile flags, or enable AI HAT+ 2 offload if installed.
Incorrect outputs? Use lower temperature and better prompts; run a small suite of unit prompts to validate model behavior.

Benchmarks and expectations (realistic)

Exact latency depends on model size, quantization, and whether AI HAT+ 2 offload is available. As of late 2025 and into 2026, community results show:

Quantized 3B models: sub-second token latency on Pi 5 (multi-threaded with NEON)
Quantized 7B models: 1–3 tokens/sec on CPU-only Pi 5; with AI HAT+ 2 or vendor NPU offload, 2–5x improvements are typical
Image encoder + small decoder pipelines: sub-1s feature extraction for small images on optimized ONNX models

Future-proofing: 2026 and next steps

Follow these to stay current:

Watch for growing support for gguf and improved quantization tools (gptq updates in 2025–2026 made smaller models more accurate when quantized).
Keep vendor AI HAT+ 2 SDKs updated — expect new kernels and reduced memory pressure in 2026 releases.
Edge distillation and LoRA/PEFT-style adapters will let you specialize models on-device with a few MBs of parameter deltas — experiment with tiny adapters for domain-specific behavior.
Look at WebRTC and WebGPU integrations for low-latency web UIs served from Pi-based LLMs.

Pro tip: Start with a 3B quantized model for prototyping — it gives a fast feedback loop. Move to quantized 7B only if you need the extra reasoning power and you can measure acceptable latency.

Actionable takeaways (quick checklist)

Flash 64-bit OS and install AI HAT+ 2 SDK per vendor instructions.
Build llama.cpp with ARM/NEON optimizations and install Python bindings.
Download/convert a gguf model, quantize to q4 variant, and test with a simple chat script.
For vision tasks, convert a small encoder/decoder to ONNX and run with onnxruntime + int8 quantization.
Measure, tune threads and swap, and enable vendor offload to maximize throughput.

Closing: what to build next (ideas)

Offline home automation assistant with local voice-to-text + Pi LLM for control
Secure on-premise code review assistant integrated into Git hooks
Edge camera caption + metadata pipeline for privacy-sensitive environments (deploy on-site with no cloud hops)

Call to action

Ready to build one of these projects? Try the chatbot and share your benchmark: what model, quantization level, and latency did you get on your Pi 5 + AI HAT+ 2? Post your results in the codenscripts community or subscribe for walkthroughs that include optimized build scripts, pre-quantized models, and CI-ready deployment examples.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.