Tuning AI for The Edge

February 2026 - April 2026

Edge compute sees an interesting domain intersection of optimizing both compute and power. Here's how I drag state of the art models to perform well on it.

Why?

Because I can, and because I did benchmarks in the process.

Background

Specs:

CPU: Ryzen 5 3550H
GPU: Vega 8 integrated graphics
RAM: 16GB

With the low 25W TDP of the CPU, it is a great target to push the edge of a problem known for being power hungry.

What matters in a model?

Paremeter count: dictates how many "neurons" the model loads into VRAM
Quantization ("quant"): sets parameters to a lower size, reducing RAM usage

By reducing parameter size, the model can have higher throughput at the cost of lower accuracy. Quantization nomenclature typically uses Q4_K_M which stands for 4-bit quantization using k-means clustering. Later on, Unsloth's quantization methods were used, such as Q4_K_XL, dynamically quantizing the model to variable sizes.

What matters in compute?

CPU: Directs data between resources, or does the math itself
GPU: The biggest power draw and number cruncher
RAM/VRAM: Stores parameters and relevant context

Benchmarking

I use llama-bench to benchmark models. It tests "reading" and "writing" in the pp512 and tg128 tests, respectively. pp512 is a test that measures the speed of reading a prompt and tg128 measures the speed of writing a response. To see more about the tests, see here.

I tested 3 common models with this workflow:

3 other common models were tested, but not on all builds. They will be included if results are provided.

Qwen3-1.7B: Larger size, good for testing memory footprint on the same Qwen3 architecture
Qwen3.5-0.8B: High-density knowledge and image-processing capabilities
Llama3.2-1B: Common in academia

All of these models are designed for edge deployment, featuring a low parameter count to fit into smaller memory spaces. This low param count also means higher throughput (tokens/sec) due to less memory having to be accessed per token.

Tuning Llama.cpp

I took the following steps to tune Llama.cpp for the Halo node in chronological order:

1. Build llama.cpp for the CPU (BLAS) and benchmark (Feb. 4)

Model	Quant	Size	`pp512` (t/s)	`tg128` (t/s)
LFM2.5-1.2B	`Q6_K_M`	915.96 MiB	38.64	2.90
LFM2.5-1.2B	`Q8_0`	1.16 GiB	39.44	10.91
DeepSeek-R1-1.5B	`Q8_0`	1.76 GiB	23.88	7.71
Qwen3-0.6B	`Q4_K_M`	372.65 MiB	47.09	5.73
Qwen3-0.6B	`Q8_0`	604.15 MiB	47.26	17.12
Qwen3-1.7B	`Q8_0`	1.70 GiB	23.61	7.05
Llama3.2-1B	`Q8_0`	1.22 GiB	35.18	9.79

This yielded poor performance because CPUs can only handle one operation at a time, which isn't ideal for handling chat-based LLMs. Although quantizations take up space, they require compute to cast them back to 8 bits.

2. Apply relevant flags (Feb. 7)

After getting a baseline, I decided to add some modification flags to the runtime:

-fa enables flash attention, speeding up the process of reading previous tokens to infer the next
-t sets the number of threads to use, changing the degree of parallelism for compute

With flash attention enabled and thread count tuned to 6, these two flags alone bring performance boosts of around 10% speedup.

Model	Quant	Size	`pp512` (t/s)	`tg128` (t/s)
LFM2.5-1.2B	`Q8_0`	1.16 GiB	40.87	11.76
DeepSeek-R1-1.5B	`Q8_0`	1.76 GiB	28.42	8.67
Qwen3-0.6B	`Q8_0`	604.15 MiB	49.44	15.55
Llama3.2-1B	`Q8_0`	1.22 GiB	38.79	11.05

3. Rebuild llama.cpp with new flags (Feb. 10/11)

Llama.cpp can be rebuilt with different flags such as -DGGML_NATIVE to enable hardware-level optimizations to improve CPU performance by about 20%. Interestingly, Qwen3 saw a large speedup of around 60% from this.

Model	Quant	Size	`pp512` (t/s)	`tg128` (t/s)
LFM2.5-1.2B	`Q8_0`	1.16 GiB	49.21	13.31
DeepSeek-R1-1.5B	`Q8_0`	1.76 GiB	36.24	10.04
Qwen3-0.6B	`Q8_0`	604.15 MiB	74.45	24.48
Qwen3-1.7B	`Q8_0`	1.70 GiB	32.95	8.95
Llama3.2-1B	`Q8_0`	1.22 GiB	49.10	12.51

4. Change backend from BLAS to Vulkan (Feb. 15/16)

This CPU has integrated graphics, which is highly compatible with Vulkan's libraries. With this in mind, I switched from CPU to iGPU (finally) by building llama.cpp with Vulkan. With BLAS (CPU), it was a performance overhead to use quantized models, but the high parallelism of the iGPU made quantization feasible. By using quantized models, I reduced the memory footprint of the model and by extention increased the throughput.

Model	Quant	Size	`pp512` (t/s)	`tg128` (t/s)
LFM2.5-1.2B	`Q6_K_XL`	946.96 MiB	313.64	16.26
DeepSeek-R1-1.5B	`Q4_K_M`	1.04 GiB	252.35	16.06
DeepSeek-R1-1.5B	`Q6_K_XL`	1.36 GiB	228.56	12.72
DeepSeek-R1-1.5B	`Q8_K_XL`	1.76 GiB	282.24	10.33
Qwen3-0.6B	`Q5_K_XL`	420.03 MiB	540.11	31.95
Qwen3-0.6B	`Q6_K_XL`	544.09 MiB	555.25	26.51
Qwen3-0.6B	`Q8_K_XL`	799.50 MiB	588.67	19.16
Qwen3-1.7B	`Q4_K_XL`	1.05 GiB	227.10	13.84
Qwen3-1.7B	`Q5_K_XL`	1.17 GiB	224.54	12.61
Qwen3-1.7B	`Q8_0`	1.70 GiB	251.80	9.22
Qwen3.5-0.8B	`Q5_K_XL`	568.03 MiB	303.12	14.26

Switching to Vulkan brought an enormous speedup in the pp512 test, speeding up "reading" or prompt prefill by about 6x. Qwen3 quants show a tradeoff in text generation speed, where it loses prefill speed in exchange for a higher text generation speed.

5. Add a new stick of RAM! (Mar. 4)

In order to run these models, the Halo node needs to access memory with high frequency. Because the GPU used is integrated, that means it is reliant on RAM speeds, and not the typical PCI-E bus provided when using dedicated GPUs. This reliance creates a large bottleneck on RAM speeds, of which I only had one stick of. I scrapped a broken laptop for its RAM and added that old stick to this node, allowing for "dual-channel" RAM access. This could effectively double the RAM speed and as a consequence, my LLMs' throughput.

Model	Quant	Size	`pp512` (t/s)	`tg128` (t/s)
LFM2.5-1.2B	`Q6_K_XL`	946.96 MiB	351.94	30.90
DeepSeek-R1-1.5B	`Q4_K_XL`	1.10 GiB	271.71	26.58
Qwen3-0.6B	`Q5_K_XL`	420.03 MiB	590.80	51.66
Qwen3-1.7B	`Q5_K_XL`	1.17 GiB	238.56	23.08
Qwen3.5-0.8B	`Q3_K_XL`	458.96 MiB	496.00	31.65
Qwen3.5-0.8B	`Q5_K_XL`	568.03 MiB	353.34	22.77
Llama3.2-1B	`Q4_K_XL`	788.09 MiB	367.88	31.56

As per my expectations, this nearly doubled the text generation speed of all models, and has a more significant impact on the larger models.

Unsloth ran some benchmarks on their quantized Qwen3.5 models, showing minimal loss at Q3_K_XL, so I decided to include it in my benchmarks.

(Apr. 4): To handle longer contexts, I've quantized K and V cache sizes to 4 bits instead of 16. Alongside picking how many tokens are batched together, the memory footprint has been drastically reduced.

Observed Speedup

Speedup is calculated by after / before to get a multiplier on runtime speedup. In this case study, I've 10x'd the prompt processing speed of all models. The text generation speedup is lower but still significant, averaging around 3.16x speedup.

Model	`pp512` speedup	`tg128` speedup
LFM2.5-1.2B	351.94/39.44 = 8.92	30.90/10.91 = 2.83
DeepSeek-R1-1.5B	271.71/23.88 = 11.38	26.58/7.71 = 3.45
Qwen3-0.6B	590.80/47.26 = 12.50	51.66/17.12 = 3.02
Qwen3-1.7B	238.56/23.61 = 10.10	23.08/7.05 = 3.27
Llama3.2-1B	367.88/35.18 = 10.46	31.56/9.79 = 3.22

Conclusions

Deploying LLMs on edge devices was a fun way for me to learn how to create slop as fast as possible while tuning both hardware and software to fit the task. Using modern software layers like Vulkan created a tiny overhead in setup for strong results in using integrated graphics. For those using laptops, mini PCs, or other devices with integrated graphics, I strongly recommend trying Vulkan for high flexibility and performance.