HackerNews中文版

为什么苹果的语音转录会如此糟糕？即使在两三年前，OpenAI 的 Whisper 模型就能提供更好、近乎实时的离线语音转录——而且该模型只有大约 500 MB。有了这个背景，就很难理解苹果的转录，它在强大的服务器上在线运行，为什么今天的表现如此糟糕。以下是刚刚使用 iOS 原生应用程序的真实示例： - “BigQuery update” → “bakery update”（“BigQuery 更新” → “面包房更新”） - “GitHub” → “get her”（“GitHub” → “得到她”） - “CI build” → “CI bill”（“CI 构建” → “CI 账单”） - “GitHub support” → “get her support”（“GitHub 支持” → “得到她的支持”）这些都不是晦涩难懂的术语——它们是软件领域中非常常见的词汇，在随意的语境中清晰地说出来。与几年前就已经可以实现的，即使是完全离线的技术相比，这种准确性差距显得尤为明显。这主要是模型质量问题、流媒体/分割问题、激进的后处理，还是苹果语音堆栈中的某些架构问题？真正的技术限制是什么，尽管有现代硬件和云处理，为什么它没有得到改善？

查看原文

Why is Apple’s voice transcription so hilariously bad?Even 2–3 years ago, OpenAI’s Whisper models delivered better, near-instant voice transcription offline — and the model was only about ~500 MB. With that context, it’s hard to understand how Apple’s transcription, which runs online on powerful servers, performs so poorly today.Here are real examples from using the iOS native app just now:- “BigQuery update” → “bakery update”- “GitHub” → “get her”- “CI build” → “CI bill”- “GitHub support” → “get her support”These aren’t obscure terms — they’re extremely common words in software, spoken clearly in casual contexts. The accuracy gap feels especially stark compared to what was already possible years ago, even fully offline.Is this primarily a model-quality issue, a streaming/segmentation problem, aggressive post-processing, or something architectural in Apple’s speech stack? What are the real technical limitations, and why hasn’t it improved despite modern hardware and cloud processing?

提问 HN：为什么苹果的语音转录功能会烂得如此离谱？