m5-infer — Extraordinary speed, extraordinary quality

Metric指標	Ollama	mlx_lm.server	m5-infer
Decode tok/s · long_gen 512 tokens (higher = better)Decode tok/s · long_gen 512 token (高いほど良い)	8.9	17.0	40.0
Decode tok/s · thinking ON (higher = better)Decode tok/s · thinking ON (高いほど良い)	11.2	18.6	28.8
Long-context needle retrieval (6 positions · higher = better)長 context needle retrieval (6 位置 · 高いほど良い)	2 / 6	0 / 6 *	6 / 6
Thinking-ON Short QA · 3 factual (higher = better)Thinking-ON 短答 QA · 3 問 (高いほど良い)	0 / 3	0 / 3 *	3 / 3
Output score · same model · graded by Opus-4.7 (10-task avg · higher = better)出力スコア · 同一モデル · 採点者 Opus-4.7 (10 タスク平均 · 高いほど良い)	5.28 / 10	—	5.85 / 10

Layered on top of mlx-lm. No weight changes, no kernel fork. Every item measured end-to-end or byte-equivalent to greedy decoding. The first three are approaches we have not seen in any other open inference engine. mlx-lm の上に積層。重み変更なし、kernel fork なし。全項目が end-to-end で実測済、もしくは greedy decode と byte 単位で等価。冒頭の 3 件は他の open な推論エンジンで見かけないアプローチです。

NOVEL · 01 新技術 · 01

Hybrid-aware speculative decoding with O(1) GDN state restore GDN 状態を O(1) 復元する hybrid 対応 speculative decoding

A speculative-decoding implementation that correctly handles Qwen 3.5's 24 GatedDeltaNet + 8 Full-Attention hybrid. We have not seen this approach in another open inference engine we tested. Qwen 3.5 の GDN 24 層 + FA 8 層 hybrid 構成を、黙って誤った出力に落とさずに正しく扱える speculative-decoding 実装。私たちが検証した範囲では、このアプローチを採用している open な推論エンジンを他に見ていません。

The problem 問題

Standard speculative decoding (Leviathan 2023, Medusa) targets pure transformers — on reject, you truncate the KV cache by N entries and continue. GDN layers carry a recurrent state and convolutional buffer that have already advanced through the entire draft window. Truncating only KV leaves GDN state corrupted, and the model silently produces divergent output. This is the implementation difficulty that, in the engines we checked, has kept speculative decoding off by default for Qwen 3.5 hybrid. 通常の speculative decoding (Leviathan 2023、Medusa) は pure transformer が前提 — reject 時に KV cache を N entry 切り詰めるだけで続行できる。しかし GDN 層は recurrent state と convolutional buffer が draft window 全体を進んでしまっている。KV だけ truncate しても GDN state は壊れたまま残り、モデルは clash せず静かに誤った出力を続ける。私たちが確認した engine では、この実装上の難しさゆえに Qwen 3.5 hybrid 用の speculative decoding はデフォルト無効となっています。

Our approach 私たちの方法

Before each verify call, snapshot every GDN layer's (recurrent_state, conv_buf) pair into a pre-allocated tensor pool. On rejection, restore from snapshot in O(1) — zero allocations in the hot path. GDN state is ~tens of KB per layer; 24 layers' snapshot completes well under 1 ms per verify. No additional MLX graph nodes, no memory fragmentation. 各 verify 呼び出しの前に、全 GDN 層の (recurrent_state, conv_buf) ペアを 事前確保した tensor pool に snapshot。reject 時は snapshot から O(1) で復元 — hot path で alloc は発生しない。GDN state は layer あたり数十 KB、24 層合計の snapshot でも verify あたり 1 ms 未満。MLX graph に追加ノードは不要、メモリ断片化なし。

byte = greedy Output equivalence vs mlx_lm.generate mlx_lm.generate との出力等価性

~70% Acceptance rate · 4-token draft Acceptance rate · 4 token draft

< 1 ms 24-layer snapshot cost per verify 24 層分 snapshot コスト / verify

Verify in sourceソースで確認→ app/innovation/speculative/draft_speculative.py

NOVEL · 02 新技術 · 02

CTRSP — disk-persisted recurrent state for agent workloads CTRSP — エージェント向けの disk 永続化 recurrent state

Prefix caching that covers not just KV pages but the full GDN recurrent + convolutional buffer — keyed by token-bytes hash, persisted to disk across process restarts. mlx_lm.server has a faster absolute warm total latency for in-process runs; CTRSP's value is state that survives restarts and sessions. KV ページだけでなく GDN recurrent + convolutional buffer も含めて prefix cache する。トークン列のバイト列ハッシュを key に、プロセス再起動を越えてディスクに永続化。in-process での warm 総レイテンシ絶対値は mlx_lm.server のほうが速く、CTRSP の価値はプロセス / セッションを跨いで state が残ることにあります。

The problem 問題

Agent workloads re-send the same 12 K+ tool schema on every turn. RadixAttention (SGLang) and AutomatedPrefixCache (vLLM) cache KV pages only. For a hybrid model, replaying KV without the matching GDN state gives a warm prefix but wrong output. In the engines we checked, we have not found one that persists GDN recurrent state across sessions or process restarts. エージェント用途では 12 K+ の tool schema が毎ターン同じものとして再送信される。RadixAttention (SGLang) や AutomatedPrefixCache (vLLM) は KV page のみをキャッシュする。hybrid モデルでは KV だけ再現しても GDN state が不整合だと prefill は warm になるが 出力は誤る。私たちが確認した engine の中には、GDN recurrent state をセッション・プロセス再起動を越えて保存するものは見つかっていません。

Our approach 私たちの方法

Serialize the full state (quantized KV + GDN recurrent/conv buffers) to disk on generation end, keyed by SHA-256 of the raw prompt-prefix tokens. Hash is over bytes, not decoded text — system prompts and tool schemas match exactly even when chat turns are appended. LRU evicts at 32 entries (~3 GB cap by default) with atomic writes + fingerprint verification on load. 生成終了時にモデルの完全状態 (quantized KV + GDN recurrent/conv buffer) を disk に serialize、prompt prefix のトークン列のバイト列の SHA-256 を key とする。デコード後のテキストではなくバイト列でハッシュするので、system prompt や tool schema は chat turn 追加後も完全一致で検出。LRU は 32 entry 上限 (デフォルト ~3 GB 上限) で eviction、load 時に atomic write + fingerprint 検証。

69 s → 11 s 12 K tool schema · cold → warm (6× from CTRSP hit) 12 K tool schema · cold → warm (CTRSP hit で 6×)

29 s → 7.5 s 5-turn session · turn 1 → turn 5 (3.9× within session) 5 ターン session · turn 1 → turn 5 (session 内で 3.9×)

survives restart State persists across process restarts (others lose it) プロセス再起動後も state が残る (他 engine は失われる)

Verify in sourceソースで確認→ app/innovation/n1_ctrsp/state_persistence.py

NOVEL · 03 新技術 · 03

Task-aware escape hint — rescue the model from think-loops Task-aware escape hint — 思考ループからの救出注入

When Qwen 3.5 gets stuck in a Wait, let me re-check… spiral, we inject a typed transition like **Final JSON:** so the model resumes in the exact format the user asked for. One string template. Same weights. +36% on the 10-task output score graded by Opus-4.7 — engine-attributable, not a model-quality claim. Qwen 3.5 が Wait, let me re-check… ループに入ったとき、**Final JSON:** のような タスク別の遷移を注入する。ユーザーが要求した形式で答えを再開させる。文字列テンプレート 1 つで、同一モデルのまま Opus-4.7 が採点した 10 タスクの出力スコアが +36% (エンジン起因の向上であり、モデル品質が上がったわけではありません)。

The problem 問題

Qwen 3.5 in thinking mode occasionally falls into repetitive spirals that never emit </think>. Existing engines either truncate (losing the answer), pass the raw thinking (breaking downstream agents), or inject a bare </think> close — which often leaves the model continuing the same stuck pattern in the answer phase. Qwen 3.5 の thinking モードでは、稀に </think> を閉じないまま repetitive spiral に入る。既存の engine は 切り詰める (答えを失う)、生の thinking を返す (下流エージェントを壊す)、または単純な </think> のみ注入 (モデルが answer phase でも同じパターンを続けてしまう) のいずれかしかできない。

Our approach 私たちの方法

N-gram loop detector scoped to the think block. On trigger, we inspect the user prompt and inject a task-typed transition: </think>\n\n**Final JSON:**\n\n\`\`\`json\n for JSON tasks, **Final Code:** for code, **Final Translation:** for translation, and so on. The typed hint anchors the model in the required output format; the stuck pattern dissolves. think ブロック内限定の n-gram ループ検出器が発火したとき、user prompt を検査して タスク別の遷移を注入する: JSON タスクなら </think>\n\n**Final JSON:**\n\n\`\`\`json\n、コードなら **Final Code:**、翻訳なら **Final Translation:**、など。型付きヒントがモデルを要求された出力形式に固定し、stuck pattern が解ける。

+461% extract_01 extracted JSON (same model · 1.40 → 7.85) extract_01 の抽出 JSON (同一モデル · 1.40 → 7.85)

+36% 10-task output score (same model · graded by Opus-4.7 · 4.29 → 5.85) 10 タスク出力スコア (同一モデル · 採点者 Opus-4.7 · 4.29 → 5.85)

< 1 ms Runtime cost per decode step Runtime コスト / decode step

Verify in sourceソースで確認→ app/backend/custom_generate.py::_build_escape_hint

Extraordinary speed,
extraordinary quality. 圧倒的な速度、
圧倒的な品質。

Up to 2.4× the decode speed of the MLX reference. MLX リファレンス比で最大 2.4 倍の decode 速度。

Five metrics. Three engines. 5 指標、3 エンジン。

Decode speedDecode 速度

Warm total latencyWarm 総レイテンシ

Thinking-mode qualityThinking モード品質

Output score · graded by Opus-4.7 (same model)出力スコア · 採点者 Opus-4.7 (同一モデル)

Three novel contributions, plus system optimizations. 3 つの新技術と、複数のシステム最適化。

Hybrid-aware speculative decoding with O(1) GDN state restore GDN 状態を O(1) 復元する hybrid 対応 speculative decoding

CTRSP — disk-persisted recurrent state for agent workloads CTRSP — エージェント向けの disk 永続化 recurrent state

Task-aware escape hint — rescue the model from think-loops Task-aware escape hint — 思考ループからの救出注入

System optimizations システム最適化

Needle-retrieval heuristicNeedle-retrieval heuristic

N3 SSEE — self-speculative early exitN3 SSEE — 自己投機的 early exit

N4 ALS — adaptive layer skippingN4 ALS — 適応的 layer skip

N6 PES — parallel expert schedulingN6 PES — Expert path 並列スケジューリング

X5-R compiled forward + wired memoryX5-R コンパイル済 forward + wired memory

Hardware-aware auto-tuneハードウェア検出 auto-tune

Model-family abstractionモデル family 抽象化

TPC — token prefix compilerTPC — token prefix compiler

17 tok/s → 40 tok/s.

Recent updates 最近の更新

v1.1.4 released — CLI subcommands + streaming pull + 35B A3B fits on 24 GB v1.1.4 公開 — CLI サブコマンド + ストリーミング pull + 24 GB Mac で 35B A3B fit

v1.1.0 released on PyPI — quality-neutral optimization stack + T14-OIRC v1.1.0 を PyPI で公開 — 品質中立の最適化スタック + T14-OIRC

Python 3.14 support — CI coverage on 3.11 / 3.12 / 3.13 / 3.14 Python 3.14 対応 — 3.11 / 3.12 / 3.13 / 3.14 で CI マトリクス化

Earlier: Python 3.13 added + bench-script privacy pass 前段: Python 3.13 追加 + ベンチスクリプトの公開整理

Honest bench labels: "Warm TTFT" → "Warm total latency" ベンチ表記の誠実化: "Warm TTFT" → "Warm 総レイテンシ"

Speedup claims qualified with "up to" / "最大" 速度倍率表記に "up to" / "最大" を付与

v1.0.0 public release — on PyPI, on GitHub v1.0.0 公開リリース — PyPI と GitHub で公開

One command. OpenAI-compatible on port 11436. コマンド 1 行。port 11436 で OpenAI 互換 API 起動。

We stand on the shoulders of excellent work. 優れた仕事の積み重ねの上に築かせていただきました。

Extraordinary speed,extraordinary quality. 圧倒的な速度、圧倒的な品質。

Up to 2.4× the decode speed of the MLX reference. MLX リファレンス比で最大 2.4 倍の decode 速度。

Five metrics. Three engines. 5 指標、3 エンジン。

Decode speedDecode 速度

Warm total latencyWarm 総レイテンシ

Thinking-mode qualityThinking モード品質

Output score · graded by Opus-4.7 (same model)出力スコア · 採点者 Opus-4.7 (同一モデル)

Three novel contributions, plus system optimizations. 3 つの新技術と、複数のシステム最適化。

System optimizations システム最適化

Needle-retrieval heuristicNeedle-retrieval heuristic

N3 SSEE — self-speculative early exitN3 SSEE — 自己投機的 early exit

N4 ALS — adaptive layer skippingN4 ALS — 適応的 layer skip

N6 PES — parallel expert schedulingN6 PES — Expert path 並列スケジューリング

X5-R compiled forward + wired memoryX5-R コンパイル済 forward + wired memory

Hardware-aware auto-tuneハードウェア検出 auto-tune

Model-family abstractionモデル family 抽象化

TPC — token prefix compilerTPC — token prefix compiler

17 tok/s → 40 tok/s.

Recent updates 最近の更新

v1.1.4 released — CLI subcommands + streaming pull + 35B A3B fits on 24 GB v1.1.4 公開 — CLI サブコマンド + ストリーミング pull + 24 GB Mac で 35B A3B fit

v1.1.0 released on PyPI — quality-neutral optimization stack + T14-OIRC v1.1.0 を PyPI で公開 — 品質中立の最適化スタック + T14-OIRC

Python 3.14 support — CI coverage on 3.11 / 3.12 / 3.13 / 3.14 Python 3.14 対応 — 3.11 / 3.12 / 3.13 / 3.14 で CI マトリクス化

Earlier: Python 3.13 added + bench-script privacy pass 前段: Python 3.13 追加 + ベンチスクリプトの公開整理

Honest bench labels: "Warm TTFT" → "Warm total latency" ベンチ表記の誠実化: "Warm TTFT" → "Warm 総レイテンシ"

Speedup claims qualified with "up to" / "最大" 速度倍率表記に "up to" / "最大" を付与

v1.0.0 public release — on PyPI, on GitHub v1.0.0 公開リリース — PyPI と GitHub で公開

One command. OpenAI-compatible on port 11436. コマンド 1 行。port 11436 で OpenAI 互換 API 起動。

We stand on the shoulders of excellent work. 優れた仕事の積み重ねの上に築かせていただきました。

Extraordinary speed,
extraordinary quality. 圧倒的な速度、
圧倒的な品質。