So you’ve built GCC 8.5, survived the make -j6 wait, and now you’re eyeing llama.cpp like it’s your ticket to local AI greatness on a $99 dev board.
With a few workarounds (read: Git checkout, compiler gymnastics, and a well-placed patch), you can run a quantized LLM locally on your Jetson Nano with CUDA 10.2, GCC 8.5, and a Prayer. Will it be fast? No. Will it be absurdly cool? Absolutely.
🎯 Why not the latest llama.cpp?
Short version: it’s too spicy for the Jetson.
The latest versions depend on newer CUDA features that just don’t exist in 10.2. You’ll hit errors you can’t fix without sacrificing your weekend.
So instead, we pin the project to a known working commit that plays nice with CUDA 10.2 and the Jetson’s ARM architecture.
🧰 Method 1: Building llama.cpp from Source
This is the pure, from-scratch approach. You’ll get the raw C++ binaries that you can run directly.
Step 1: Clone and checkout a compatible version
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout a33e6a0
Note: This specific commit (a33e6a0
) is known to work with CUDA 10.2 and Jetson hardware.
Step 2: Apply patches for CUDA 10.2 compatibility
To play nice with CUDA 10.2, you’ll need to apply a few code tweaks. The patches address NVCC compiler flags and ARM-specific optimizations:
👉 Full patch details: https://gist.github.com/FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f
Quick manual fixes:
- In the Makefile, change
MK_NVCCFLAGS += -O3
toMK_NVCCFLAGS += -maxrregcount=80
(lines 109 and 113) - Remove
MK_CXXFLAGS += -mcpu=native
(line 302)
Step 3: Build with CUDA + GCC 8.5
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
make clean
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_53 -j6
LLAMA_CUBLAS=1
enables GPU acceleration via cuBLASCUDA_DOCKER_ARCH=sm_
53 targets the Jetson Nano’s Maxwell architecture-j6
uses multiple cores (adjust based on your patience)
Step 4: Test with a model
# Download a small model
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-GGUF/resolve/main/tinyllama-1.1b-chat.q4_K_M.gguf
# Run inference
./main -m tinyllama-1.1b-chat.q4_K_M.gguf -p "Hello from Jetson Nano!" -n 128 -ngl 35 --color
🐍 Method 2: Using llama-cpp-python
If you prefer Python bindings and want to build web APIs or integrate with other Python code, llama-cpp-python is your friend. This is what I’m actually using in my Docker setup.
Why llama-cpp-python?
- Python ecosystem: Easy integration with FastAPI, Flask, or Jupyter notebooks
- OpenAI-compatible API: Drop-in replacement for OpenAI’s API endpoints
- Simpler deployment: Perfect for containerized environments
Building llama-cpp-python v0.2.70
# Set environment for CUDA build
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
export CUDACXX=/usr/local/cuda/bin/nvcc
export CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_DOCKER_ARCH=sm_53"
export FORCE_CMAKE=1
# Install the Python package
pip install llama-cpp-python==0.2.70 --no-cache-dir --force-reinstall --upgrade
Version note: v0.2.70
of llama-cpp-python corresponds to a compatible version of the underlying llama.cpp library that works with CUDA 10.2.
Step 3: Serve via OpenAI-compatible API
python3 -m llama_cpp.server \
--model ./tinyllama-1.1b-chat.q4_K_M.gguf \
--n_ctx 1024 \
--n_gpu_layers 35 \
--host 0.0.0.0 \
--port 8000
Visit: 📍 http://localhost:8000/docs for the interactive API documentation.
🐳 Docker: The Lazy Person’s Paradise
Too lazy to compile GCC from source and wrangle CUDA paths? I’ve got your back.
This Docker image comes pre-baked with:
- ✅ CUDA 10.2 preconfigured
- ✅ llama-cpp-python v0.2.70 installed
- ✅ All the patches and compatibility fixes applied
- ✅ Ready-to-go OpenAI-compatible server
docker run -it \
-p 8000:8000 \
acerbetti/l4t-jetpack-llama-cpp-python:latest \
/bin/bash -c \
'python3 -m llama_cpp.server \
--model $(huggingface-downloader TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf) \
--n_ctx 1024 \
--n_gpu_layers 35 \
--host 0.0.0.0 \
--port 8000'
This command:
- Downloads a quantized TinyLlama model from Hugging Face
- Starts a REST API that speaks OpenAI’s language
- Runs entirely on your Jetson Nano, no cloud required
🔧 Troubleshooting Tips
OOM Issues? Create a swap file — 4GB RAM isn’t much:
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
CUDA not found? Set your paths:
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
The rest of your benchmarking and model recommendations sections remain unchanged and are excellent as-is!
Software enthusiast with a passion for AI, edge computing, and building intelligent SaaS solutions. Experienced in cloud computing and infrastructure, with a track record of contributing to multiple tech companies in Silicon Valley. Always exploring how emerging technologies can drive real-world impact, from the cloud to the edge.
Hi, thanks for the detailed guide!
While following the steps, I ran into an issue with this command:
git checkout tags/v0.2.70
Git responds with:
error: pathspec ‘tags/v0.2.70’ did not match any file(s) known to git.
After running git fetch –all –tags and checking the available tags with git tag, it seems that v0.2.70 doesn’t exist in the llama.cpp repository anymore. Could you please confirm the correct tag to use or update the instructions accordingly?
Thanks in advance!
Good catch, that tag was just a simple mistake on my end. I’ve updated the post with the correct version now. Thanks for flagging it!