Show HN: Arch-Router – 基于偏好而非基准测试的 LLM 路由,15 亿参数模型
13 分•作者: adilhafeez•6 个月前
大家好,我们是 Arch(<https://github.com/katanemo/archgw>)背后的团队,Arch 是一个用 Rust 编写的开源 LLM 代理。今天,我们发布了 Arch-Router(<https://huggingface.co/katanemo/Arch-Router-1.5B>),一个用于基于偏好的路由的 15 亿参数路由器模型,现已集成到代理中。随着团队集成多个 LLM——每个模型都有不同的优势、风格或成本/延迟情况——将正确的提示路由到正确的模型成为应用程序设计的关键部分。但这仍然是一个未解决的问题。大多数路由系统分为两类:
* 基于嵌入的路由器使用意图分类器——将提示标记为“支持”、“SQL”或“数学”,然后路由到匹配的模型。这适用于简单的任务,但在真实的对话中会崩溃。用户在对话中途改变话题,任务边界模糊,产品更改需要重新训练分类器。
* 基于性能的路由器根据 MMLU 或 MT-Bench 等基准测试,或根据延迟或成本曲线来选择模型。但基准测试通常会忽略生产中重要的事情:特定领域的质量或主观偏好,例如“法律部门会接受这个条款吗?”
Arch-Router 采用了不同的方法:通过用通俗易懂的语言编写的偏好进行路由。您可以编写类似“合同条款 → GPT-4o”或“快速旅行提示 → Gemini Flash”的规则。路由器使用轻量级的 15 亿参数自回归模型将提示(和对话上下文)映射到这些规则。无需重新训练,无需脆弱的 if/else 链。我们是在 Twilio 和 Atlassian 团队的参与下构建的。它处理意图漂移,支持多轮对话,并允许您通过对路由策略进行一行更改来替换模型。完整细节请参阅我们的论文(<https://arxiv.org/abs/2506.16655>),但这里有一个概览:
规格:
* 15 亿参数——在单个 GPU(或用于测试的 CPU)上运行
* 无需重新训练——将其指向任何 LLM 组合
* 成本和延迟感知——将繁重的任务路由到昂贵的模型,将轻量级任务路由到更快/更便宜的模型
* 在我们的对话路由基准测试中,优于更大的闭源模型(详情请参阅论文)
链接:
* Arch 代理(开源):<https://github.com/katanemo/archgw>
* 模型 + 代码:<https://huggingface.co/katanemo/Arch-Router-1.5B>
* 论文:<https://arxiv.org/abs/2506.16655>
查看原文
Hi HN — we're the team behind Arch (<a href="https://github.com/katanemo/archgw">https://github.com/katanemo/archgw</a>), an open-source proxy for LLMs written in Rust. Today we're releasing Arch-Router (<a href="https://huggingface.co/katanemo/Arch-Router-1.5B" rel="nofollow">https://huggingface.co/katanemo/Arch-Router-1.5B</a>), a 1.5B router model for preference-based routing, now integrated into the proxy. As teams integrate multiple LLMs - each with different strengths, styles, or cost/latency profiles — routing the right prompt to the right model becomes a critical part of the application design. But it's still an open problem. Most routing systems fall into two camps:<p>- Embedding-based routers use intent classifiers — label a prompt as “support,” “SQL,” or “math,” then route to a matching model. This works for simple tasks but breaks down in real conversations. Users shift topics mid-conversation, task boundaries blur, and product changes require retraining classifiers.<p>- Performance-based routers pick models based on benchmarks like MMLU or MT-Bench, or based on latency or cost curves. But benchmarks often miss what matters in production: domain-specific quality or subjective preferences like “Will legal accept this clause?”<p>Arch-Router takes a different approach: route by preferences written in plain language. You write rules like “contract clauses → GPT-4o” or “quick travel tips → Gemini Flash.” The router maps the prompt (and conversation context) to those rules using a lightweight 1.5B autoregressive model. No retraining, no fragile if/else chains. We built this with input from teams at Twilio and Atlassian. It handles intent drift, supports multi-turn conversations, and lets you swap in or out models with a one-line change to the routing policy. Full details are in our paper (<a href="https://arxiv.org/abs/2506.16655" rel="nofollow">https://arxiv.org/abs/2506.16655</a>), but here's a snapshot:<p>Specs:<p>- 1.5B params — runs on a single GPU (or CPU for testing)<p>- No retraining needed — point it at any mix of LLMs<p>- Cost and latency aware — route heavy tasks to expensive models, light tasks to faster/cheaper ones<p>- Outperforms larger closed models on our conversational routing benchmarks (details in the paper)<p>Links:<p>- Arch Proxy (open source): <a href="https://github.com/katanemo/archgw">https://github.com/katanemo/archgw</a><p>- Model + code: <a href="https://huggingface.co/katanemo/Arch-Router-1.5B" rel="nofollow">https://huggingface.co/katanemo/Arch-Router-1.5B</a><p>- Paper: <a href="https://arxiv.org/abs/2506.16655" rel="nofollow">https://arxiv.org/abs/2506.16655</a>