February 2026 - April 2026
Edge compute sees an interesting domain intersection of optimizing both compute and power. Here's how I drag state of the art models to perform well on it.
Because I can, and because I did benchmarks in the process.
Specs:
With the low 25W TDP of the CPU, it is a great target to push the edge of a problem known for being power hungry.
By reducing parameter size, the model can have higher throughput at the cost of lower accuracy.
Quantization nomenclature typically uses Q4_K_M which stands for 4-bit quantization using k-means clustering.
Later on, Unsloth's quantization methods were used, such as Q4_K_XL, dynamically quantizing the model to variable sizes.
I use llama-bench to benchmark models. It tests "reading" and "writing" in the pp512 and tg128 tests, respectively.
pp512 is a test that measures the speed of reading a prompt and tg128 measures the speed of writing a response.
To see more about the tests, see here.
I tested 3 common models with this workflow:
3 other common models were tested, but not on all builds. They will be included if results are provided.
All of these models are designed for edge deployment, featuring a low parameter count to fit into smaller memory spaces. This low param count also means higher throughput (tokens/sec) due to less memory having to be accessed per token.
I took the following steps to tune Llama.cpp for the Halo node in chronological order:
| Model | Quant | Size |
pp512 (t/s) |
tg128 (t/s) |
|---|---|---|---|---|
| LFM2.5-1.2B |
Q6_K_M |
915.96 MiB | 38.64 | 2.90 |
| LFM2.5-1.2B |
Q8_0 |
1.16 GiB | 39.44 | 10.91 |
| DeepSeek-R1-1.5B |
Q8_0 |
1.76 GiB | 23.88 | 7.71 |
| Qwen3-0.6B |
Q4_K_M |
372.65 MiB | 47.09 | 5.73 |
| Qwen3-0.6B |
Q8_0 |
604.15 MiB | 47.26 | 17.12 |
| Qwen3-1.7B |
Q8_0 |
1.70 GiB | 23.61 | 7.05 |
| Llama3.2-1B |
Q8_0 |
1.22 GiB | 35.18 | 9.79 |
This yielded poor performance because CPUs can only handle one operation at a time, which isn't ideal for handling chat-based LLMs. Although quantizations take up space, they require compute to cast them back to 8 bits.
After getting a baseline, I decided to add some modification flags to the runtime:
-fa enables flash attention, speeding up the process of reading previous tokens to infer the next -t sets the number of threads to use, changing the degree of parallelism for compute With flash attention enabled and thread count tuned to 6, these two flags alone bring performance boosts of around 10% speedup.
| Model | Quant | Size |
pp512 (t/s) |
tg128 (t/s) |
|---|---|---|---|---|
| LFM2.5-1.2B |
Q8_0 |
1.16 GiB | 40.87 | 11.76 |
| DeepSeek-R1-1.5B |
Q8_0 |
1.76 GiB | 28.42 | 8.67 |
| Qwen3-0.6B |
Q8_0 |
604.15 MiB | 49.44 | 15.55 |
| Llama3.2-1B |
Q8_0 |
1.22 GiB | 38.79 | 11.05 |
Llama.cpp can be rebuilt with different flags such as -DGGML_NATIVE to enable hardware-level optimizations to improve CPU performance by about 20%.
Interestingly, Qwen3 saw a large speedup of around 60% from this.
| Model | Quant | Size |
pp512 (t/s) |
tg128 (t/s) |
|---|---|---|---|---|
| LFM2.5-1.2B |
Q8_0 |
1.16 GiB | 49.21 | 13.31 |
| DeepSeek-R1-1.5B |
Q8_0 |
1.76 GiB | 36.24 | 10.04 |
| Qwen3-0.6B |
Q8_0 |
604.15 MiB | 74.45 | 24.48 |
| Qwen3-1.7B |
Q8_0 |
1.70 GiB | 32.95 | 8.95 |
| Llama3.2-1B |
Q8_0 |
1.22 GiB | 49.10 | 12.51 |
This CPU has integrated graphics, which is highly compatible with Vulkan's libraries. With this in mind, I switched from CPU to iGPU (finally) by building llama.cpp with Vulkan. With BLAS (CPU), it was a performance overhead to use quantized models, but the high parallelism of the iGPU made quantization feasible. By using quantized models, I reduced the memory footprint of the model and by extention increased the throughput.
| Model | Quant | Size |
pp512 (t/s) |
tg128 (t/s) |
|---|---|---|---|---|
| LFM2.5-1.2B |
Q6_K_XL |
946.96 MiB | 313.64 | 16.26 |
| DeepSeek-R1-1.5B |
Q4_K_M |
1.04 GiB | 252.35 | 16.06 |
| DeepSeek-R1-1.5B |
Q6_K_XL |
1.36 GiB | 228.56 | 12.72 |
| DeepSeek-R1-1.5B |
Q8_K_XL |
1.76 GiB | 282.24 | 10.33 |
| Qwen3-0.6B |
Q5_K_XL |
420.03 MiB | 540.11 | 31.95 |
| Qwen3-0.6B |
Q6_K_XL |
544.09 MiB | 555.25 | 26.51 |
| Qwen3-0.6B |
Q8_K_XL |
799.50 MiB | 588.67 | 19.16 |
| Qwen3-1.7B |
Q4_K_XL |
1.05 GiB | 227.10 | 13.84 |
| Qwen3-1.7B |
Q5_K_XL |
1.17 GiB | 224.54 | 12.61 |
| Qwen3-1.7B |
Q8_0 |
1.70 GiB | 251.80 | 9.22 |
| Qwen3.5-0.8B |
Q5_K_XL |
568.03 MiB | 303.12 | 14.26 |
Switching to Vulkan brought an enormous speedup in the pp512 test, speeding up "reading" or prompt prefill by about 6x.
Qwen3 quants show a tradeoff in text generation speed, where it loses prefill speed in exchange for a higher text generation speed.
In order to run these models, the Halo node needs to access memory with high frequency. Because the GPU used is integrated, that means it is reliant on RAM speeds, and not the typical PCI-E bus provided when using dedicated GPUs. This reliance creates a large bottleneck on RAM speeds, of which I only had one stick of. I scrapped a broken laptop for its RAM and added that old stick to this node, allowing for "dual-channel" RAM access. This could effectively double the RAM speed and as a consequence, my LLMs' throughput.
| Model | Quant | Size |
pp512 (t/s) |
tg128 (t/s) |
|---|---|---|---|---|
| LFM2.5-1.2B |
Q6_K_XL |
946.96 MiB | 351.94 | 30.90 |
| DeepSeek-R1-1.5B |
Q4_K_XL |
1.10 GiB | 271.71 | 26.58 |
| Qwen3-0.6B |
Q5_K_XL |
420.03 MiB | 590.80 | 51.66 |
| Qwen3-1.7B |
Q5_K_XL |
1.17 GiB | 238.56 | 23.08 |
| Qwen3.5-0.8B |
Q3_K_XL |
458.96 MiB | 496.00 | 31.65 |
| Qwen3.5-0.8B |
Q5_K_XL |
568.03 MiB | 353.34 | 22.77 |
| Llama3.2-1B |
Q4_K_XL |
788.09 MiB | 367.88 | 31.56 |
As per my expectations, this nearly doubled the text generation speed of all models, and has a more significant impact on the larger models.
Unsloth ran some benchmarks on their quantized Qwen3.5 models, showing minimal loss at Q3_K_XL, so I decided to include it in my benchmarks.
(Apr. 4): To handle longer contexts, I've quantized K and V cache sizes to 4 bits instead of 16. Alongside picking how many tokens are batched together, the memory footprint has been drastically reduced.
Speedup is calculated by after / before to get a multiplier on runtime speedup.
In this case study, I've 10x'd the prompt processing speed of all models.
The text generation speedup is lower but still significant, averaging around 3.16x speedup.
| Model |
pp512 speedup |
tg128 speedup |
|---|---|---|
| LFM2.5-1.2B | 351.94/39.44 = 8.92 | 30.90/10.91 = 2.83 |
| DeepSeek-R1-1.5B | 271.71/23.88 = 11.38 | 26.58/7.71 = 3.45 |
| Qwen3-0.6B | 590.80/47.26 = 12.50 | 51.66/17.12 = 3.02 |
| Qwen3-1.7B | 238.56/23.61 = 10.10 | 23.08/7.05 = 3.27 |
| Llama3.2-1B | 367.88/35.18 = 10.46 | 31.56/9.79 = 3.22 |
Deploying LLMs on edge devices was a fun way for me to learn how to create slop as fast as possible while tuning both hardware and software to fit the task. Using modern software layers like Vulkan created a tiny overhead in setup for strong results in using integrated graphics. For those using laptops, mini PCs, or other devices with integrated graphics, I strongly recommend trying Vulkan for high flexibility and performance.