HackerNews中文版

大家好，我是来自 General Instinct 的 Guanming 和 Bill（<a href="https://general-instinct.com/">https://general-instinct.com/</a>）。在机器人领域工作多年后，我们一直遇到同一个问题：最优秀的模型总是无法适配我们实际拥有的硬件。性能最好的模型通常是基于数据中心环境设计的：拥有大型 GPU、充足的内存带宽和可靠的网络连接。然而，大多数物理系统面临的则是相反的限制。这促使我们开始探索如何最大程度地保留前沿模型的能力，同时使其能够在边缘硬件上实际运行。作为这项工作的一部分，我们最近开源了 InstinctRazor（<a href="https://github.com/General-Instinct/InstinctRazor">https://github.com/General-Instinct/InstinctRazor</a>）。我们激动地宣布，我们已将一个约 245 GB 的 BF16 MoE 模型 Qwen3.5-122B-A10B 压缩成了一个 48 GiB 的 GGUF 模型。这个模型实际上比 Gemma-4-26B-A4B 更小，但在 MMLU-Pro 和 GPQA-D 等基准测试中的表现却优于后者。我们保留了始终活跃的部分（如路由器、归一化层、Gated-DeltaNet/SSM 层、视觉通路等），并对路由过的专家进行了更激进的量化。然后，我们使用 on-policy distillation 来恢复量化过程中损失的能力。该模型还可以配置为“小型 GPU”模式运行，此时专家模型将从系统内存中流式传输。在 8k 上下文窗口下，峰值显存使用量约为 7.6–8 GB。如果您对技术细节感兴趣，我们在此处详细介绍了我们的方法（<a href="https://general-instinct.com/blog/frontier-moe-sub-4-bit">https://general-instinct.com/blog/frontier-moe-sub-4-bit</a>）。我们特别希望听到那些将模型部署到机器人或其他边缘设备上的用户的声音。您目前正在尝试在本地运行哪些模型？在将它们投入生产的过程中，最大的瓶颈是什么？

查看原文

Hey HN, Guanming and Bill here from General Instinct (<a href="https://general-instinct.com/">https://general-instinct.com/</a>).After years of working in robotics, we kept running into the same problem: the best models never fit the hardware we actually had available.The models that performed best were usually designed around datacenter assumptions: large GPUs, lots of memory bandwidth, and reliable network access. But most physical systems have the opposite constraints.That led us down the path of figuring out how much of a frontier model could be preserved while still making it practical to run on edge hardware.As part of that work, we recently open sourced InstinctRazor (<a href="https://github.com/General-Instinct/InstinctRazor" rel="nofollow">https://github.com/General-Instinct/InstinctRazor</a>)One result we're excited about is compressing Qwen3.5-122B-A10B, a roughly 245 GB BF16 MoE model, into a 48 GiB GGUF. The resulting model is actually smaller than Gemma-4-26B-A4B while outperforming it on benchmarks like MMLU-Pro and GPQA-D etc. we preserve the parts that are always active (router, norms, Gated-DeltaNet/SSM layers, vision pathway, etc.) and quantize the routed experts much more aggressively. We then use on-policy distillation to recover capability lost during quantization.The model can also run in a "small GPU" configuration where experts are streamed from system RAM. With an 8k context window, peak VRAM usage is around 7.6–8 GB.If you're interested in the technical details, we wrote up the approach here (<a href="https://general-instinct.com/blog/frontier-moe-sub-4-bit">https://general-instinct.com/blog/frontier-moe-sub-4-bit</a>)We're especially interested in hearing from people deploying models onto robots or other edge devices. What models are you trying to run locally today? What has been the biggest bottleneck in getting them into production?

Launch HN：General Instinct (YC P26) – 边缘设备上的前沿模型