Launch HN: Uplift (YC S25) – Uplift(YC S25)发布:服务于弱势语言的语音模型

15作者: zaidqureshi9 个月前
大家好,我们是Zaid、Muhammad和Hammad,Uplift AI(<a href="https:&#x2F;&#x2F;upliftai.org">https:&#x2F;&#x2F;upliftai.org</a>)的联合创始人。我们构建了能够说服务欠缺语言的模型——目前包括乌尔都语、信德语和俾路支语。<p>全球有十亿人无法阅读。在像巴基斯坦这样的国家——世界第五人口大国——42%的成年人是文盲。这阻碍了整个经济的发展:病人无法阅读医疗报告,家长无法辅导作业,银行无法实现全面数字化,农民无法研究最佳实践,人们只能记住智能手机应用程序的按钮序列。语音AI界面可以解决所有这些问题,我们认为这或许将成为现代AI的一大益处。<p>目前,现有的语音模型几乎无法用于这些语言,而大型科技公司进展缓慢。<p>Uplift AI最初是一个副业项目,旨在为翻译和语音模型创建数据集。对我们来说,这只是一个可以做的“很酷的副业”,而不是一个需要全职投入的“重要事情”。通过一些初始数据,我们在Whatsapp上拼凑了一个乌尔都语语音机器人,并将其提供给一位家政工人。两天内就有800人开始使用它。当我们深入了解用户时,我们了解到文本界面对很多人来说是行不通的。所以我们开始全职运营Uplift AI来解决这个问题。<p>最具挑战性的部分是,构建优秀语音模型所需的所有基本组件对于这些语言来说都是残缺的。例如,如果你正在创建一个语音合成模型,你会从YouTube上抓取大量数据,并使用转录模型对其进行自动标注……这在英语中很容易做到。但在服务欠缺的语言中却行不通,因为转录模式不够准确。<p>还有许多其他挑战。例如,当你雇佣人工转录员来标注数据时,他们通常没有任何针对其语言的拼写校正器,这会在数据中产生大量噪声……使得用少量数据训练模型变得困难。在音素、静音检测、变音符号等方面也存在更多挑战。<p>我们通过创建出色的内部工具来帮助数据标注来解决这些问题。此外,我们自己获取数据,而不是购买。这有悖常理,但相对于购买数据然后进行训练的公司来说,这是一个很大的优势。通过自己获取数据,我们创建了正确的数据分布,并用更少的数据获得了更好的模型。通过完全内部化(数据、标注、训练、部署),我们能够更快地取得进展。<p>今天,我们公开提供乌尔都语、信德语和俾路支语的文本转语音API。这是一个展示这一点的视频:<a href="https:&#x2F;&#x2F;www.loom.com&#x2F;share&#x2F;dcd5020967444c228e9c127151e7a9f5" rel="nofollow">https:&#x2F;&#x2F;www.loom.com&#x2F;share&#x2F;dcd5020967444c228e9c127151e7a9f5</a>。<p>可汗学院正在使用我们的技术将视频配音成乌尔都语(<a href="https:&#x2F;&#x2F;ur.khanacademy.org" rel="nofollow">https:&#x2F;&#x2F;ur.khanacademy.org</a>)。<p>我们的模型在信息性用例(如AI机器人)方面表现出色,但在情感性用例(如诗歌)方面还需要更多努力。<p>我们一直在以Beta模式向许多人提供私人访问权限,今天我们将我们的模型公开发布。我们相信这将是我们了解表现不佳的领域并迅速修复它们的最快方式。<p>我们很乐意听到大家的反馈,特别是关于您在使用服务欠缺语言方面的经验(不仅仅是我们开始使用的巴基斯坦语言),以及您的一般评论。
查看原文
Hi HN, we are Zaid, Muhammad and Hammad, the co-founders of Uplift AI (<a href="https:&#x2F;&#x2F;upliftai.org">https:&#x2F;&#x2F;upliftai.org</a>). We build models that speak underserved languages — today: Urdu, Sindhi, and Balochi.<p>A billion people worldwide can&#x27;t read. In countries like Pakistan – the 5th most populous country – 42% of adults are illiterate. This holds back the entire economy: patients can&#x27;t read medical reports, parents can&#x27;t help with homework, banks can&#x27;t go fully digital, farmers can&#x27;t research best practices, and people memorize smartphone app button sequences. Voice AI interfaces can fix all of this, and we think this will perhaps be one of the great benefits of modern AI.<p>Right now, existing voice models barely work for these languages, and big tech is moving slowly.<p>Uplift AI was originally a side project to make datasets for translation and voice models. For us it was a &quot;cool side-thing&quot; to work on, not an &quot;important full-time thing&quot; to work on. With some initial data we hacked together a Urdu Voice Bot on Whatsapp and gave it to one domestic worker. In two days 800 people were using it. When we dived deeper into understanding the users, we learned that text interfaces don&#x27;t work for sooo many. So we started Uplift AI to solve this problem fulltime.<p>The most challenging part is that all the building blocks needed for great voice models are broken for these languages. For example, if you are creating a speech synthesis model, you will scrape a lot of data from youtube and auto-label it using a transcription model… all very easy to do in English. But it doesn&#x27;t work in under-served languages because the transcription modes are not accurate.<p>There are many other challenges. Like when you hire human transcribers to label the data, often they don&#x27;t have any spell correctors for their languages, and this creates lots of noise in the data… making it hard to train models with low data. There are many more challenges in phonemes, silence detection, diacritization etc.<p>We solve these problems by making great internal tooling to help with data labeling. Also, we source our own data and don&#x27;t buy it. This is counterintuitive, but a big advantage over companies buying data and then training. By sourcing our own data we create the right data distributions and get much better models with much less data. By doing the entire thing inhouse, (data, labeling, training, deploying) we are able to make a lot faster progress.<p>Today we publicly offer a text to speech APIs for Urdu, Sindhi, and Balochi. Here&#x27;s a video which shows this: <a href="https:&#x2F;&#x2F;www.loom.com&#x2F;share&#x2F;dcd5020967444c228e9c127151e7a9f5" rel="nofollow">https:&#x2F;&#x2F;www.loom.com&#x2F;share&#x2F;dcd5020967444c228e9c127151e7a9f5</a>.<p>Khan Academy is using our tech to dub videos to Urdu (<a href="https:&#x2F;&#x2F;ur.khanacademy.org" rel="nofollow">https:&#x2F;&#x2F;ur.khanacademy.org</a>).<p>Our models excel at informational use cases (like AI bots) but need more work in emotive use-cases like poetry.<p>We have been giving a lot of people private access in beta mode, and today are launching our models publicly. We believe this will be the fastest way for us to learn about areas that are not performing well so we can fix them quickly.<p>We&#x27;d love to hear from all of you, especially around your experiences with under-served languages (not just the Pakistani ones we&#x27;re starting with) and your comments in general.