Towards SOTA on the Edge

February 2026 - April 2026

Obtaining State-of-the-Art (SOTA) performance on edge devices. I've recently been tuning LLMs for the edge with great speedups on smaller models. Does it run fast? No. But it does answer better.


Background

On artificialanalysis.ai, there are leaderboards for aggregate intelligence over multiple tests such as Humanity's Last Exam. These tests are designed to test reasoning, instruction following, tool usage, and other abilities of LLMs. My tuned models rank in the "tiny" size class, with common chat sites like ChatGPT and Gemini landing in the "large" class. This leaves a large intelligence gap between Halo, my edge device, and the services I've been wanting to replace.

The Target

Using the leaderboards, I can see performance scores for proprietary models and services.

Any user on ChatGPT for free can use GPT-4 Turbo as much as they want, with limited access to GPT-5.3 if you make an account. For comparison's sake, here are the scores of the "target" models:

Model Score
Gemini 3.1 Pro 57
GPT-5.3 Codex 54
Claude Opus 4.6 53
Gemini 3.0 Flash 46
DeepSeek V3.2 42
Claude 4.5 Haiku 37
Gemini 2.5 Pro 35
GPT-4o 19
GPT-4 Turbo 14

And here are the scores of my models from Tuning LLMs for The Edge:

Model Score
Qwen3.5-0.8B 11
DeepSeek-R1-1.5B 9
Qwen3-1.7B 8
LFM2.5-1.2B-Thinking 8

Qwen3.5 can compete with both speed and intelligence with GPT-4 Turbo, but my models are lacking in top-end intelligence.

Tiny Challengers

The recent release of Qwen3.5 and Gemma 4 allows for more intelligence in the same small footprint.

Model Score
Qwen3.5-4B 27
Gemma4-E4B 19
Qwen3.5-2B 16
Qwen3.5-0.8B 11

Qwen3.5's 2B parameter counterpart scores 16, outscoring GPT-4 Turbo at the cost of nearly halving throughput (19.08 t/s generation vs. 31.65 t/s)

Distillations

There are more reasoning-distilled models available, such as Jackrong's Qwopus3.5 v3. Based off of the previous reasoning version, it trained off of the chain of thoughts (CoT) of SOTA models such as GPT-4.5-Pro and Claude Opus 4.6. This model is likely slightly smarter than the original Qwen3.5-4B, but still a far cry from the scores shown at the top end.


Mixture of Experts

Mixture of Experts (MoE) is a technique that uses multiple internal "experts" to answer a prompt. These models will have a larget parameter count, but only have a fraction of those parameters activated at one time. As a consequence, the computer will only have to work on the "active" parameters, while the rest of the parameters or "experts" are stored in RAM. This allows for smarter models in a smaller active footprint, with the downside of needing more RAM.

Thankfully, I now have ~22GB of RAM to work with on my Halo node. Here are some of the competitors in the MoE space that fit on Halo:

Model Score tg128 (t/s)
Qwen3.5-35B-A3B 37 7.48
gpt-oss-20B 24 8.27
Qwen3-30B-A3B 22 13.25

Batching Balancing Act

There are two hyperparameters that I have yet to explore: -b and -ub. -b is the logical batch size, which is the number of tokens that the model processes at once. -ub is the physical batch size, which is the number of tokens that the model processes at once on hardware. Having too large a batch size can use up too much of the 4GB "real" VRAM space and throttle the iGPU. Too small a batch size can avoid using the full extent of compute available.

Here are the results of testing Qwen3.5-35B-A3B with different batch sizes:

b ub pp512 (t/s) tg128 (t/s)
2048 256 28.43 7.52
2048 512 28.81 6.02
2048 1024 33.29 7.50
2048 2048 34.53 7.53
1024 256 28.42 7.48
1024 512 33.98 7.48
1024 1024 34.42 7.46

Considering that the tests are relatively small in token count, I've chosen to use a smaller batch size to reduce VRAM footprint. This makes multimodal usage (i.e. image viewing) more efficient without dorpping off performance.


Use Cases

Chatbot / Agent

For chatbots and agents, priorities lie in speed and long-context memory for tool use. For example, you wouldn't want a model to die halfway through a conversation or scraping 10 webpages. For this use case, I've selected Qwen3.5-2B for its fast speed and small memory footprint, as well as tool usage and instruction following capabilities.

Reasoning

For reasoning and high-level thinking, I've selected Qwen3.5-35B-A3B for the same reasons as Qwen3.5-2B, but bigger. Its score of 37 on the leaderboard is equivalent to Claude 4.5 Haiku, which had only released 4 months earlier. It competes with the old Gemini 2.5 Pro, and the new DeepSeek V3.2, both flagship models. I don't mind waiting a little longer for stronger reasoning, so this model is more for sending bulk work and coming back to it later.