Running llama.cpp on the Jetson Nano

So you’ve built GCC 8.5, survived the make -j6 wait, and now you’re eyeing llama.cpp like it’s your ticket to local AI greatness on a $99 dev board.

With a few workarounds (read: Git checkout, compiler gymnastics, and a well-placed patch), you can run a quantized LLM locally on your Jetson Nano with CUDA 10.2, GCC 8.5, and a Prayer. Will it be fast? No. Will it be absurdly cool? Absolutely.

Table Of Contents

🎯 Why not the latest llama.cpp?
🧰 Method 1: Building llama.cpp from Source
🐍 Method 2: Using llama-cpp-python
🐳 Docker: The Lazy Person's Paradise
🔧 Troubleshooting Tips

🎯 Why not the latest llama.cpp?

Short version: it’s too spicy for the Jetson.

The latest versions depend on newer CUDA features that just don’t exist in 10.2. You’ll hit errors you can’t fix without sacrificing your weekend.

So instead, we pin the project to a known working commit that plays nice with CUDA 10.2 and the Jetson’s ARM architecture.

🧰 Method 1: Building llama.cpp from Source

This is the pure, from-scratch approach. You’ll get the raw C++ binaries that you can run directly.

Step 1: Clone and checkout a compatible version

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout a33e6a0

Note: This specific commit (a33e6a0) is known to work with CUDA 10.2 and Jetson hardware.

Step 2: Apply patches for CUDA 10.2 compatibility

To play nice with CUDA 10.2, you’ll need to apply a few code tweaks. The patches address NVCC compiler flags and ARM-specific optimizations:

👉 Full patch details: https://gist.github.com/FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f

Quick manual fixes:

In the Makefile, change MK_NVCCFLAGS += -O3 to MK_NVCCFLAGS += -maxrregcount=80 (lines 109 and 113)
Remove MK_CXXFLAGS += -mcpu=native (line 302)

Step 3: Build with CUDA + GCC 8.5

export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
make clean
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_53 -j6

LLAMA_CUBLAS=1 enables GPU acceleration via cuBLAS
CUDA_DOCKER_ARCH=sm_53 targets the Jetson Nano’s Maxwell architecture
-j6 uses multiple cores (adjust based on your patience)

Step 4: Test with a model

# Download a small model
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-GGUF/resolve/main/tinyllama-1.1b-chat.q4_K_M.gguf

# Run inference
./main -m tinyllama-1.1b-chat.q4_K_M.gguf -p "Hello from Jetson Nano!" -n 128 -ngl 35 --color

🐍 Method 2: Using llama-cpp-python

If you prefer Python bindings and want to build web APIs or integrate with other Python code, llama-cpp-python is your friend. This is what I’m actually using in my Docker setup.

Why llama-cpp-python?

Python ecosystem: Easy integration with FastAPI, Flask, or Jupyter notebooks
OpenAI-compatible API: Drop-in replacement for OpenAI’s API endpoints
Simpler deployment: Perfect for containerized environments

Building llama-cpp-python v0.2.70

# Set environment for CUDA build
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
export CUDACXX=/usr/local/cuda/bin/nvcc
export CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_DOCKER_ARCH=sm_53"
export FORCE_CMAKE=1

# Install the Python package
pip install llama-cpp-python==0.2.70 --no-cache-dir --force-reinstall --upgrade

Version note: v0.2.70 of llama-cpp-python corresponds to a compatible version of the underlying llama.cpp library that works with CUDA 10.2.

Step 3: Serve via OpenAI-compatible API

python3 -m llama_cpp.server \
    --model ./tinyllama-1.1b-chat.q4_K_M.gguf \
    --n_ctx 1024 \
    --n_gpu_layers 35 \
    --host 0.0.0.0 \
    --port 8000

Visit: 📍 http://localhost:8000/docs for the interactive API documentation.

🐳 Docker: The Lazy Person’s Paradise

Too lazy to compile GCC from source and wrangle CUDA paths? I’ve got your back.

This Docker image comes pre-baked with:

✅ CUDA 10.2 preconfigured
✅ llama-cpp-python v0.2.70 installed
✅ All the patches and compatibility fixes applied
✅ Ready-to-go OpenAI-compatible server

docker run -it \
    -p 8000:8000 \
    acerbetti/l4t-jetpack-llama-cpp-python:latest \
    /bin/bash -c \
    'python3 -m llama_cpp.server \
        --model $(huggingface-downloader TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf) \
        --n_ctx 1024 \
        --n_gpu_layers 35 \
        --host 0.0.0.0 \
        --port 8000'

This command:

Downloads a quantized TinyLlama model from Hugging Face
Starts a REST API that speaks OpenAI’s language
Runs entirely on your Jetson Nano, no cloud required

🔧 Troubleshooting Tips

OOM Issues? Create a swap file — 4GB RAM isn’t much:

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

CUDA not found? Set your paths:

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

The rest of your benchmarking and model recommendations sections remain unchanged and are excellent as-is!

Stefano Acerbetti

Software enthusiast with a passion for AI, edge computing, and building intelligent SaaS solutions. Experienced in cloud computing and infrastructure, with a track record of contributing to multiple tech companies in Silicon Valley. Always exploring how emerging technologies can drive real-world impact, from the cloud to the edge.

anaskon

July 22, 2025 at 8:46 AM

Hi, thanks for the detailed guide!

While following the steps, I ran into an issue with this command:
git checkout tags/v0.2.70
Git responds with:
error: pathspec ‘tags/v0.2.70’ did not match any file(s) known to git.

After running git fetch –all –tags and checking the available tags with git tag, it seems that v0.2.70 doesn’t exist in the llama.cpp repository anymore. Could you please confirm the correct tag to use or update the instructions accordingly?

Thanks in advance!

Stefano Acerbetti
July 24, 2025 at 8:48 PM

Good catch, that tag was just a simple mistake on my end. I’ve updated the post with the correct version now. Thanks for flagging it!

Running llama.cpp on the Jetson Nano

🎯 Why not the latest llama.cpp?

🧰 Method 1: Building llama.cpp from Source

Step 1: Clone and checkout a compatible version

Step 2: Apply patches for CUDA 10.2 compatibility

Step 3: Build with CUDA + GCC 8.5

Step 4: Test with a model

🐍 Method 2: Using llama-cpp-python

Why llama-cpp-python?

Building llama-cpp-python v0.2.70

Step 3: Serve via OpenAI-compatible API

🐳 Docker: The Lazy Person’s Paradise

🔧 Troubleshooting Tips

2 thoughts on “Running llama.cpp on the Jetson Nano”

Leave a Comment Cancel Reply

Running llama.cpp on the Jetson Nano

🎯 Why not the latest llama.cpp?

🧰 Method 1: Building llama.cpp from Source

Step 1: Clone and checkout a compatible version

Step 2: Apply patches for CUDA 10.2 compatibility

Step 3: Build with CUDA + GCC 8.5

Step 4: Test with a model

🐍 Method 2: Using llama-cpp-python

Why llama-cpp-python?

Building llama-cpp-python v0.2.70

Step 3: Serve via OpenAI-compatible API

🐳 Docker: The Lazy Person’s Paradise

🔧 Troubleshooting Tips

Related Posts

2 thoughts on “Running llama.cpp on the Jetson Nano”

Leave a Comment Cancel Reply