提问 HN:为什么苹果的语音转录功能会烂得如此离谱?
3 分•作者: keepamovin•6 个月前
为什么苹果的语音转录会如此糟糕?
即使在两三年前,OpenAI 的 Whisper 模型就能提供更好、近乎实时的离线语音转录——而且该模型只有大约 500 MB。有了这个背景,就很难理解苹果的转录,它在强大的服务器上在线运行,为什么今天的表现如此糟糕。
以下是刚刚使用 iOS 原生应用程序的真实示例:
- “BigQuery update” → “bakery update”(“BigQuery 更新” → “面包房更新”)
- “GitHub” → “get her”(“GitHub” → “得到她”)
- “CI build” → “CI bill”(“CI 构建” → “CI 账单”)
- “GitHub support” → “get her support”(“GitHub 支持” → “得到她的支持”)
这些都不是晦涩难懂的术语——它们是软件领域中非常常见的词汇,在随意的语境中清晰地说出来。与几年前就已经可以实现的,即使是完全离线的技术相比,这种准确性差距显得尤为明显。
这主要是模型质量问题、流媒体/分割问题、激进的后处理,还是苹果语音堆栈中的某些架构问题?真正的技术限制是什么,尽管有现代硬件和云处理,为什么它没有得到改善?
查看原文
Why is Apple’s voice transcription so hilariously bad?<p>Even 2–3 years ago, OpenAI’s Whisper models delivered better, near-instant voice transcription offline — and the model was only about ~500 MB. With that context, it’s hard to understand how Apple’s transcription, which runs online on powerful servers, performs so poorly today.<p>Here are real examples from using the iOS native app just now:<p>- “BigQuery update” → “bakery update”<p>- “GitHub” → “get her”<p>- “CI build” → “CI bill”<p>- “GitHub support” → “get her support”<p>These aren’t obscure terms — they’re extremely common words in software, spoken clearly in casual contexts. The accuracy gap feels especially stark compared to what was already possible years ago, even fully offline.<p>Is this primarily a model-quality issue, a streaming/segmentation problem, aggressive post-processing, or something architectural in Apple’s speech stack? What are the real technical limitations, and why hasn’t it improved despite modern hardware and cloud processing?