Tuning AI for The Edge

February 2026 - April 2026

Edge compute sees an interesting domain intersection of optimizing both compute and power. Here's how I drag state of the art models to perform well on it.

Why?

Because I can, and because I did benchmarks in the process.

Background

Specs:

With the low 25W TDP of the CPU, it is a great target to push the edge of a problem known for being power hungry.


What matters in a model?

By reducing parameter size, the model can have higher throughput at the cost of lower accuracy. Quantization nomenclature typically uses Q4_K_M which stands for 4-bit quantization using k-means clustering. Later on, Unsloth's quantization methods were used, such as Q4_K_XL, dynamically quantizing the model to variable sizes.

What matters in compute?


Benchmarking

I use llama-bench to benchmark models. It tests "reading" and "writing" in the pp512 and tg128 tests, respectively. pp512 is a test that measures the speed of reading a prompt and tg128 measures the speed of writing a response. To see more about the tests, see here.

I tested 3 common models with this workflow:

  1. LFM2.5-1.2B-Thinking
  2. DeepSeek-R1-Distill-Qwen-1.5B
  3. Qwen3-0.6B

3 other common models were tested, but not on all builds. They will be included if results are provided.

  1. Qwen3-1.7B: Larger size, good for testing memory footprint on the same Qwen3 architecture
  2. Qwen3.5-0.8B: High-density knowledge and image-processing capabilities
  3. Llama3.2-1B: Common in academia

All of these models are designed for edge deployment, featuring a low parameter count to fit into smaller memory spaces. This low param count also means higher throughput (tokens/sec) due to less memory having to be accessed per token.


Tuning Llama.cpp

I took the following steps to tune Llama.cpp for the Halo node in chronological order:

1. Build llama.cpp for the CPU (BLAS) and benchmark (Feb. 4)

Model Quant Size pp512 (t/s) tg128 (t/s)
LFM2.5-1.2B Q6_K_M 915.96 MiB 38.64 2.90
LFM2.5-1.2B Q8_0 1.16 GiB 39.44 10.91
DeepSeek-R1-1.5B Q8_0 1.76 GiB 23.88 7.71
Qwen3-0.6B Q4_K_M 372.65 MiB 47.09 5.73
Qwen3-0.6B Q8_0 604.15 MiB 47.26 17.12
Qwen3-1.7B Q8_0 1.70 GiB 23.61 7.05
Llama3.2-1B Q8_0 1.22 GiB 35.18 9.79

This yielded poor performance because CPUs can only handle one operation at a time, which isn't ideal for handling chat-based LLMs. Although quantizations take up space, they require compute to cast them back to 8 bits.

2. Apply relevant flags (Feb. 7)

After getting a baseline, I decided to add some modification flags to the runtime:

With flash attention enabled and thread count tuned to 6, these two flags alone bring performance boosts of around 10% speedup.

Model Quant Size pp512 (t/s) tg128 (t/s)
LFM2.5-1.2B Q8_0 1.16 GiB 40.87 11.76
DeepSeek-R1-1.5B Q8_0 1.76 GiB 28.42 8.67
Qwen3-0.6B Q8_0 604.15 MiB 49.44 15.55
Llama3.2-1B Q8_0 1.22 GiB 38.79 11.05

3. Rebuild llama.cpp with new flags (Feb. 10/11)

Llama.cpp can be rebuilt with different flags such as -DGGML_NATIVE to enable hardware-level optimizations to improve CPU performance by about 20%. Interestingly, Qwen3 saw a large speedup of around 60% from this.

Model Quant Size pp512 (t/s) tg128 (t/s)
LFM2.5-1.2B Q8_0 1.16 GiB 49.21 13.31
DeepSeek-R1-1.5B Q8_0 1.76 GiB 36.24 10.04
Qwen3-0.6B Q8_0 604.15 MiB 74.45 24.48
Qwen3-1.7B Q8_0 1.70 GiB 32.95 8.95
Llama3.2-1B Q8_0 1.22 GiB 49.10 12.51

4. Change backend from BLAS to Vulkan (Feb. 15/16)

This CPU has integrated graphics, which is highly compatible with Vulkan's libraries. With this in mind, I switched from CPU to iGPU (finally) by building llama.cpp with Vulkan. With BLAS (CPU), it was a performance overhead to use quantized models, but the high parallelism of the iGPU made quantization feasible. By using quantized models, I reduced the memory footprint of the model and by extention increased the throughput.

Model Quant Size pp512 (t/s) tg128 (t/s)
LFM2.5-1.2B Q6_K_XL 946.96 MiB 313.64 16.26
DeepSeek-R1-1.5B Q4_K_M 1.04 GiB 252.35 16.06
DeepSeek-R1-1.5B Q6_K_XL 1.36 GiB 228.56 12.72
DeepSeek-R1-1.5B Q8_K_XL 1.76 GiB 282.24 10.33
Qwen3-0.6B Q5_K_XL 420.03 MiB 540.11 31.95
Qwen3-0.6B Q6_K_XL 544.09 MiB 555.25 26.51
Qwen3-0.6B Q8_K_XL 799.50 MiB 588.67 19.16
Qwen3-1.7B Q4_K_XL 1.05 GiB 227.10 13.84
Qwen3-1.7B Q5_K_XL 1.17 GiB 224.54 12.61
Qwen3-1.7B Q8_0 1.70 GiB 251.80 9.22
Qwen3.5-0.8B Q5_K_XL 568.03 MiB 303.12 14.26

Switching to Vulkan brought an enormous speedup in the pp512 test, speeding up "reading" or prompt prefill by about 6x. Qwen3 quants show a tradeoff in text generation speed, where it loses prefill speed in exchange for a higher text generation speed.

5. Add a new stick of RAM! (Mar. 4)

In order to run these models, the Halo node needs to access memory with high frequency. Because the GPU used is integrated, that means it is reliant on RAM speeds, and not the typical PCI-E bus provided when using dedicated GPUs. This reliance creates a large bottleneck on RAM speeds, of which I only had one stick of. I scrapped a broken laptop for its RAM and added that old stick to this node, allowing for "dual-channel" RAM access. This could effectively double the RAM speed and as a consequence, my LLMs' throughput.

Model Quant Size pp512 (t/s) tg128 (t/s)
LFM2.5-1.2B Q6_K_XL 946.96 MiB 351.94 30.90
DeepSeek-R1-1.5B Q4_K_XL 1.10 GiB 271.71 26.58
Qwen3-0.6B Q5_K_XL 420.03 MiB 590.80 51.66
Qwen3-1.7B Q5_K_XL 1.17 GiB 238.56 23.08
Qwen3.5-0.8B Q3_K_XL 458.96 MiB 496.00 31.65
Qwen3.5-0.8B Q5_K_XL 568.03 MiB 353.34 22.77
Llama3.2-1B Q4_K_XL 788.09 MiB 367.88 31.56

As per my expectations, this nearly doubled the text generation speed of all models, and has a more significant impact on the larger models.

Unsloth ran some benchmarks on their quantized Qwen3.5 models, showing minimal loss at Q3_K_XL, so I decided to include it in my benchmarks.

(Apr. 4): To handle longer contexts, I've quantized K and V cache sizes to 4 bits instead of 16. Alongside picking how many tokens are batched together, the memory footprint has been drastically reduced.


Observed Speedup

Speedup is calculated by after / before to get a multiplier on runtime speedup. In this case study, I've 10x'd the prompt processing speed of all models. The text generation speedup is lower but still significant, averaging around 3.16x speedup.

Model pp512 speedup tg128 speedup
LFM2.5-1.2B 351.94/39.44 = 8.92 30.90/10.91 = 2.83
DeepSeek-R1-1.5B 271.71/23.88 = 11.38 26.58/7.71 = 3.45
Qwen3-0.6B 590.80/47.26 = 12.50 51.66/17.12 = 3.02
Qwen3-1.7B 238.56/23.61 = 10.10 23.08/7.05 = 3.27
Llama3.2-1B 367.88/35.18 = 10.46 31.56/9.79 = 3.22

Conclusions

Deploying LLMs on edge devices was a fun way for me to learn how to create slop as fast as possible while tuning both hardware and software to fit the task. Using modern software layers like Vulkan created a tiny overhead in setup for strong results in using integrated graphics. For those using laptops, mini PCs, or other devices with integrated graphics, I strongly recommend trying Vulkan for high flexibility and performance.