February 2026 - April 2026
Obtaining State-of-the-Art (SOTA) performance on edge devices. I've recently been tuning LLMs for the edge with great speedups on smaller models. Does it run fast? No. But it does answer better.
On artificialanalysis.ai, there are leaderboards for aggregate intelligence over multiple tests such as Humanity's Last Exam. These tests are designed to test reasoning, instruction following, tool usage, and other abilities of LLMs. My tuned models rank in the "tiny" size class, with common chat sites like ChatGPT and Gemini landing in the "large" class. This leaves a large intelligence gap between Halo, my edge device, and the services I've been wanting to replace.
Using the leaderboards, I can see performance scores for proprietary models and services.
Any user on ChatGPT for free can use GPT-4 Turbo as much as they want, with limited access to GPT-5.3 if you make an account. For comparison's sake, here are the scores of the "target" models:
| Model | Score |
|---|---|
| Gemini 3.1 Pro | 57 |
| GPT-5.3 Codex | 54 |
| Claude Opus 4.6 | 53 |
| Gemini 3.0 Flash | 46 |
| DeepSeek V3.2 | 42 |
| Claude 4.5 Haiku | 37 |
| Gemini 2.5 Pro | 35 |
| GPT-4o | 19 |
| GPT-4 Turbo | 14 |
And here are the scores of my models from Tuning LLMs for The Edge:
| Model | Score |
|---|---|
| Qwen3.5-0.8B | 11 |
| DeepSeek-R1-1.5B | 9 |
| Qwen3-1.7B | 8 |
| LFM2.5-1.2B-Thinking | 8 |
Qwen3.5 can compete with both speed and intelligence with GPT-4 Turbo, but my models are lacking in top-end intelligence.
The recent release of Qwen3.5 and Gemma 4 allows for more intelligence in the same small footprint.
| Model | Score |
|---|---|
| Qwen3.5-4B | 27 |
| Gemma4-E4B | 19 |
| Qwen3.5-2B | 16 |
| Qwen3.5-0.8B | 11 |
Qwen3.5's 2B parameter counterpart scores 16, outscoring GPT-4 Turbo at the cost of nearly halving throughput (19.08 t/s generation vs. 31.65 t/s)
There are more reasoning-distilled models available, such as Jackrong's Qwopus3.5 v3. Based off of the previous reasoning version, it trained off of the chain of thoughts (CoT) of SOTA models such as GPT-4.5-Pro and Claude Opus 4.6. This model is likely slightly smarter than the original Qwen3.5-4B, but still a far cry from the scores shown at the top end.
Mixture of Experts (MoE) is a technique that uses multiple internal "experts" to answer a prompt. These models will have a larget parameter count, but only have a fraction of those parameters activated at one time. As a consequence, the computer will only have to work on the "active" parameters, while the rest of the parameters or "experts" are stored in RAM. This allows for smarter models in a smaller active footprint, with the downside of needing more RAM.
Thankfully, I now have ~22GB of RAM to work with on my Halo node. Here are some of the competitors in the MoE space that fit on Halo:
| Model | Score |
tg128 (t/s) |
|---|---|---|
| Qwen3.5-35B-A3B | 37 | 7.48 |
| gpt-oss-20B | 24 | 8.27 |
| Qwen3-30B-A3B | 22 | 13.25 |
There are two hyperparameters that I have yet to explore: -b and -ub.
-b is the logical batch size, which is the number of tokens that the model processes at once.
-ub is the physical batch size, which is the number of tokens that the model processes at once on hardware.
Having too large a batch size can use up too much of the 4GB "real" VRAM space and throttle the iGPU.
Too small a batch size can avoid using the full extent of compute available.
Here are the results of testing Qwen3.5-35B-A3B with different batch sizes:
b |
ub |
pp512 (t/s) |
tg128 (t/s) |
|---|---|---|---|
| 2048 | 256 | 28.43 | 7.52 |
| 2048 | 512 | 28.81 | 6.02 |
| 2048 | 1024 | 33.29 | 7.50 |
| 2048 | 2048 | 34.53 | 7.53 |
| 1024 | 256 | 28.42 | 7.48 |
| 1024 | 512 | 33.98 | 7.48 |
| 1024 | 1024 | 34.42 | 7.46 |
Considering that the tests are relatively small in token count, I've chosen to use a smaller batch size to reduce VRAM footprint. This makes multimodal usage (i.e. image viewing) more efficient without dorpping off performance.
For chatbots and agents, priorities lie in speed and long-context memory for tool use. For example, you wouldn't want a model to die halfway through a conversation or scraping 10 webpages. For this use case, I've selected Qwen3.5-2B for its fast speed and small memory footprint, as well as tool usage and instruction following capabilities.
For reasoning and high-level thinking, I've selected Qwen3.5-35B-A3B for the same reasons as Qwen3.5-2B, but bigger. Its score of 37 on the leaderboard is equivalent to Claude 4.5 Haiku, which had only released 4 months earlier. It competes with the old Gemini 2.5 Pro, and the new DeepSeek V3.2, both flagship models. I don't mind waiting a little longer for stronger reasoning, so this model is more for sending bulk work and coming back to it later.