Towards SOTA on the Edge

February 2026 - April 2026

Obtaining State-of-the-Art (SOTA) performance on edge devices. I've recently been tuning LLMs for the edge with great speedups on smaller models. Does it run fast? No. But it does answer better.

Background

On artificialanalysis.ai, there are leaderboards for aggregate intelligence over multiple tests such as Humanity's Last Exam. These tests are designed to test reasoning, instruction following, tool usage, and other abilities of LLMs. My tuned models rank in the "tiny" size class, with common chat sites like ChatGPT and Gemini landing in the "large" class. This leaves a large intelligence gap between Halo, my edge device, and the services I've been wanting to replace.

The Target

Using the leaderboards, I can see performance scores for proprietary models and services.

Any user on ChatGPT for free can use GPT-4 Turbo as much as they want, with limited access to GPT-5.3 if you make an account. For comparison's sake, here are the scores of the "target" models:

Model	Score
Gemini 3.1 Pro	57
GPT-5.3 Codex	54
Claude Opus 4.6	53
Gemini 3.0 Flash	46
DeepSeek V3.2	42
Claude 4.5 Haiku	37
Gemini 2.5 Pro	35
GPT-4o	19
GPT-4 Turbo	14

And here are the scores of my models from Tuning LLMs for The Edge:

Model	Score
Qwen3.5-0.8B	11
DeepSeek-R1-1.5B	9
Qwen3-1.7B	8
LFM2.5-1.2B-Thinking	8

Qwen3.5 can compete with both speed and intelligence with GPT-4 Turbo, but my models are lacking in top-end intelligence.

Tiny Challengers

The recent release of Qwen3.5 and Gemma 4 allows for more intelligence in the same small footprint.

Model	Score
Qwen3.5-4B	27
Gemma4-E4B	19
Qwen3.5-2B	16
Qwen3.5-0.8B	11

Qwen3.5's 2B parameter counterpart scores 16, outscoring GPT-4 Turbo at the cost of nearly halving throughput (19.08 t/s generation vs. 31.65 t/s)

Distillations

There are more reasoning-distilled models available, such as Jackrong's Qwopus3.5 v3. Based off of the previous reasoning version, it trained off of the chain of thoughts (CoT) of SOTA models such as GPT-4.5-Pro and Claude Opus 4.6. This model is likely slightly smarter than the original Qwen3.5-4B, but still a far cry from the scores shown at the top end.

Mixture of Experts

Mixture of Experts (MoE) is a technique that uses multiple internal "experts" to answer a prompt. These models will have a larget parameter count, but only have a fraction of those parameters activated at one time. As a consequence, the computer will only have to work on the "active" parameters, while the rest of the parameters or "experts" are stored in RAM. This allows for smarter models in a smaller active footprint, with the downside of needing more RAM.

Thankfully, I now have ~22GB of RAM to work with on my Halo node. Here are some of the competitors in the MoE space that fit on Halo:

Model	Score	`tg128` (t/s)
Qwen3.5-35B-A3B	37	7.48
gpt-oss-20B	24	8.27
Qwen3-30B-A3B	22	13.25

Batching Balancing Act

There are two hyperparameters that I have yet to explore: -b and -ub. -b is the logical batch size, which is the number of tokens that the model processes at once. -ub is the physical batch size, which is the number of tokens that the model processes at once on hardware. Having too large a batch size can use up too much of the 4GB "real" VRAM space and throttle the iGPU. Too small a batch size can avoid using the full extent of compute available.

Here are the results of testing Qwen3.5-35B-A3B with different batch sizes:

`b`	`ub`	`pp512` (t/s)	`tg128` (t/s)
2048	256	28.43	7.52
2048	512	28.81	6.02
2048	1024	33.29	7.50
2048	2048	34.53	7.53
1024	256	28.42	7.48
1024	512	33.98	7.48
1024	1024	34.42	7.46

Considering that the tests are relatively small in token count, I've chosen to use a smaller batch size to reduce VRAM footprint. This makes multimodal usage (i.e. image viewing) more efficient without dorpping off performance.

Use Cases

Chatbot / Agent

For chatbots and agents, priorities lie in speed and long-context memory for tool use. For example, you wouldn't want a model to die halfway through a conversation or scraping 10 webpages. For this use case, I've selected Qwen3.5-2B for its fast speed and small memory footprint, as well as tool usage and instruction following capabilities.

Reasoning

For reasoning and high-level thinking, I've selected Qwen3.5-35B-A3B for the same reasons as Qwen3.5-2B, but bigger. Its score of 37 on the leaderboard is equivalent to Claude 4.5 Haiku, which had only released 4 months earlier. It competes with the old Gemini 2.5 Pro, and the new DeepSeek V3.2, both flagship models. I don't mind waiting a little longer for stronger reasoning, so this model is more for sending bulk work and coming back to it later.