Show HN: 为什么机器学习推理在实践中仍然如此随意?
4 分•作者: krish678•6 个月前
我所见过的任何一个在生产环境中运行多个机器学习模型的团队,最终都会陷入各种定制的推理服务困境:不同的 API、不同的身份验证方式、不同的日志记录方式、半吊子的仪表盘,以及靠着团队内部的“秘传知识”勉强维持。
我一直在构建一个小型的副业项目,尝试标准化其中的服务部分——一个位于异构模型(本地、云托管、不同团队)之前的单一网关,负责处理推理 API、版本控制/回滚、身份验证、基本指标和健康检查。不涉及训练、AutoML 或“端到端 MLOps 平台”。
在我投入更多时间之前,我试图弄清楚这是否是:
* 一个人们默默用内部胶水勉强应付的真正痛点,或者
* 听起来有用,但在现实世界约束下会崩溃的东西。
对于那些实际在生产环境中运行机器学习的人:
* 你们是否已经拥有类似的内部推理层?
* 推理通常会在哪里出错(部署、版本控制、调试、合规性)?
* 在什么规模下,抽象变得毫无价值?
我不会发布任何东西——只是真心好奇这是否引起共鸣,或者我是否只是在重新发现为什么每个人都选择自己动手。
查看原文
Every place I’ve seen run more than a couple of ML models in production ends up with a mess of bespoke inference services: different APIs, different auth, different logging, half-working dashboards, and tribal knowledge holding it all together.<p>I’ve been building a small side project that tries to standardize just the serving part — a single gateway in front of heterogeneous models (local, managed cloud, different teams) that handles inference APIs, versioning/rollback, auth, basic metrics, and health checks. No training, no AutoML, no “end-to-end MLOps platform”.<p>Before I sink more time into it, I’m trying to figure out whether this is:<p>a real gap people quietly paper over with internal glue, or<p>something that sounds useful but collapses under real-world constraints.<p>For people actually running ML in prod:<p>Do you already have an internal inference layer like this?<p>Where does inference usually go wrong (deployments, versioning, debugging, compliance)?<p>At what scale does it stop being worth abstracting at all?<p>Not announcing anything — genuinely curious whether this resonates or if I’m just rediscovering why everyone rolls their own.