Running llama.cpp on the Jetson Nano

So you’ve built GCC 8.5, survived the make -j6 wait, and now you’re eyeing llama.cpp like it’s your ticket to local AI greatness on a $99 dev board.
With a few workarounds (read: Git checkout, compiler gymnastics, and a well-placed patch), you can run a quantized LLM locally on your Jetson Nano with CUDA 10.2, GCC 8.5, and a Prayer. Will it be fast? No. Will it be absurdly cool? Absolutely.

Table Of Contents

🎯 Why not the latest llama.cpp?
🐳 Too lazy to build? Just run the Docker image (with llama-cpp-python baked in)
🧪 Benchmarking the Beast (Okay, the Nano)
🧠 Best Lightweight LLMs for Your Jetson Nano (4GB RAM)

🎯 Why not the latest llama.cpp?

Short version: it’s too spicy for the Jetson.

The latest versions depend on newer CUDA features that just don’t exist in 10.2. You’ll hit errors you can’t fix without sacrificing your weekend.

So instead, we pin the project to a known working version: v0.2.70. It’s modern enough to run GGUF models, and old enough to compile without a meltdown.

🧰 Step 1: Clone a compatible version

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout tags/v0.2.70

⚙️ Step 2: Apply patches and build with CUDA + GCC 8.5

To play nice with CUDA 10.2, you’ll need to apply a few code tweaks. Thankfully, someone already paved the way:

👉 Patch gist:
https://gist.github.com/FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f

Once that’s done, use your shiny GCC 8.5 install:

export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++

make clean
make LLAMA_CUBLAS=1

LLAMA_CUBLAS=1 enables GPU acceleration via cuBLAS. Works with CUDA 10.2 if it’s installed correctly.

📥 Step 3: Download a quantized model

You’re not running 70B on a Jetson — go small. Here’s a solid choice:

mkdir models
cd models

wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-GGUF/resolve/main/tinyllama-1.1b-chat.q4_K_M.gguf

🧪 Step 4: Run it!

./main -m ./models/tinyllama-1.1b-chat.q4_K_M.gguf -p "Hello from Jetson Nano!" -n 128 --n-gpu-layers 35 --color

Options:

-p: prompt
-n: number of tokens to generate
--n-gpu-layers: offload this many layers to the GPU (tune based on your memory)
--color: ANSI rainbow ✨

🧠 Tip: If you hit OOM issues, create a swap file — 4GB RAM isn’t much.

🌐 Bonus: Serve it via OpenAI API

python3 -m llama_cpp.server \
  --model ./models/tinyllama-1.1b-chat.q4_K_M.gguf \
  --n_ctx 1024 \
  --n_gpu_layers 35 \
  --host 0.0.0.0 \
  --port 8000

Open this in your browser:
📍 http://localhost:8000/docs

🐳 Too lazy to build? Just run the Docker image (with llama-cpp-python baked in)

If you’d rather skip the whole “compile GCC from source and patch llama.cpp” journey, I’ve got your back.

This Docker image comes with:

✅ CUDA 10.2 preconfigured
✅ llama.cpp pinned to v0.2.70
✅ Python bindings (llama-cpp-python) installed
✅ Ready-to-go OpenAI-compatible server

docker run -it \
  -p 8000:8000 \
  acerbetti/l4t-jetpack-llama-cpp-python:latest \
  /bin/bash -c \
    'python3 -m llama_cpp.server \
      --model $(huggingface-downloader TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf) \
      --n_ctx 1024 \
      --n_gpu_layers 35 \
      --host 0.0.0.0 \
      --port 8000'

This command:

Downloads a quantized TinyLlama model from Hugging Face
Starts a REST API that speaks OpenAI’s language
Runs entirely on your Jetson Nano, no cloud required

🧪 Benchmarking the Beast (Okay, the Nano)

What’s a blog post without some raw, delicious numbers? To satisfy your nerdy curiosity (and mine), I built a simple command-line benchmarking tool that hits the llama-cpp-python server using curl and prints out latency, tokens per second, and optionally, the actual text from the model.

You can grab the script here and try it yourself:
👉 benchmark.sh Gist

Here’s what went down when I ran the following prompt through TinyLlama 1.1B Q5_K_M on my Jetson Nano:

Prompt: “Tell me a short story about a robot and a cat.”

💻 Hardware

Device: Jetson Nano (4GB RAM model)
Model: TinyLlama-1.1B-Chat (Q5_K_M)
Server: llama-cpp-python running with n_gpu_layers=35, context size 1024

📊 Benchmark Results

Run 1... Time: 74.39s, Tokens: 138, TPS: 1.86
Run 2... Time: 6.80s,  Tokens: 14,  TPS: 2.06
Run 3... Time: 26.32s, Tokens: 72,  TPS: 2.74
Run 4... Time: 56.60s, Tokens: 159, TPS: 2.81
Run 5... Time: 56.67s, Tokens: 154, TPS: 2.72

--- Benchmark Summary ---
Average latency: 44.16s
Average tokens/sec: 2.43
Total time: 220.78s over 5 runs

🤔 Thoughts

Jetson Nano did its best, okay? Generating ~2.4 tokens per second is pretty impressive for a 4GB edge device chugging through a quantized 1.1B parameter model.
Run 1 looks like it needed a coffee. The huge latency spike suggests cold start or cache warming issues. Later runs were way snappier.
Performance is surprisingly usable if you’re fine with waiting a few seconds for a short response. Don’t expect real-time chat—unless your cat is really, really patient.

🥊 Compared to Big Boys

Let’s not pretend this competes with a 3090 or a beefy M2 Mac running the same model at 20–60 tokens/sec. But that’s the whole point: Jetson Nano is doing inference without a fan. It’s an edge device. And now it’s doing AI stuff like a tiny wizard.

🧠 Best Lightweight LLMs for Your Jetson Nano (4GB RAM)

So, you’ve got a Jetson Nano with only 4GB of RAM, and you’re dreaming of running massive language models like ChatGPT? Well, wake up, buddy! While you can’t exactly run GPT-4 on this thing (unless you want to time travel to the year 2050), you can run some snappy and efficient models that are perfect for quick tasks, coding help, or chatting with your robot sidekick.

Here’s a quick rundown of the best lightweight LLMs for Jetson Nano. These models won’t make your device sweat (too much), and they won’t require you to take out a second mortgage for extra RAM.

✅ 1. TinyLlama-1.1B-Chat

Why you’ll love it: It’s small, fast, and packs a punch. At only 1.1 billion parameters, this little guy is perfect for those “quick chat” moments when you need some AI banter. It runs great with Q5_K_M quantization — meaning it fits nicely into your 4GB RAM without breaking a sweat.
Use case: Building chatbots, quick Q&A, or just asking your Nano to tell you a joke (and yes, it’ll try its best).
Get it here: TinyLlama GGUF

✅ 2. Phi-2 by Microsoft (Q4_0)

Why it works: Don’t let the fancy name fool you. Phi-2 is small enough to fit on your Jetson Nano (with the right quantization). It might not be the flashiest model out there, but it’s got brains — solid at math, logic, and educational tasks.
Use case: Need help explaining quantum physics in simple terms? Phi-2’s got you.
Use it here: Phi-2 GGUF

✅ 3. Falcon-RW-1B

Why it works: This 1.3B model is like the sneaky underdog of the LLM world. It’s a lean machine that runs effortlessly on your Nano. Plus, it’s open-source! If you’re into performance, style, and not having to constantly baby your device — this is your jam.
Use case: Low-key general knowledge tasks, quick summaries, and text generation for small-scale projects.
Snag it here: Falcon-RW-1B GGUF

⚠️ 4. Mistral-7B (Q2_K)

Why you’ll try it (and regret it): Look, Mistral-7B is a beast. In the right setup, it’s a powerhouse. But trying to squeeze a 7B model into 4GB of RAM is like trying to fit a giraffe into a Smart car. It’s not pretty, and it’s slow.
Use case: If you’ve got no patience and a ton of RAM (or just love living on the edge), then go for it. But for the average Jetson Nano user? Stick to the smaller guys.
See it here: Mistral-7B GGUF

Stefano Acerbetti

Software enthusiast with a passion for AI, edge computing, and building intelligent SaaS solutions. Experienced in cloud computing and infrastructure, with a track record of contributing to multiple tech companies in Silicon Valley. Always exploring how emerging technologies can drive real-world impact, from the cloud to the edge.