Show HN: LemonSlice – 给你的语音助手安上脸
5 分•作者: lcolucci•17 天前
大家好,我们是 LemonSlice (<a href="https://lemonslice.com">https://lemonslice.com</a>) 的联合创始人。我们训练交互式虚拟形象视频模型。通过我们的 API,您可以上传照片,然后立即与该角色进行 FaceTime 风格的通话。这是一个演示:<a href="https://www.loom.com/share/941577113141418e80d2834c83a5a0a9" rel="nofollow">https://www.loom.com/share/941577113141418e80d2834c83a5a0a9</a>
聊天机器人无处不在。语音 AI 最近也蓬勃发展。但我们认为视频虚拟形象将成为会话式 AI 最常见的形式。大多数人更愿意观看内容而不是阅读。问题在于,实时生成视频非常困难,而克服“恐怖谷效应”则更难。
我们尚未突破恐怖谷效应。没有人做到。但我们正在接近,而且我们逼真的虚拟形象目前是同类产品中最好的(您可以自己判断:<a href="https://lemonslice.com/try/taylor">https://lemonslice.com/try/taylor</a>)。此外,我们是唯一可以制作动物和高度风格化卡通形象的虚拟形象模型。试试看:<a href="https://lemonslice.com/try/alien">https://lemonslice.com/try/alien</a>。警告!和这个小家伙聊天可能会改善你的心情。
今天,我们发布了我们的新模型* - Lemon Slice 2,一个 200 亿参数的扩散 Transformer,可以在单个 GPU 上以 20fps 的速度生成无限长度的视频 - 并开放我们的 API。
我们是如何让视频扩散模型实时运行的?这并非单一技巧,而是很多技巧叠加在一起的结果。第一个重大变化是让我们的模型具有因果性。标准的视频扩散模型是双向的(它们会查看当前帧之前和之后的帧),这意味着您无法进行流式传输。
从那里开始,就是将所有内容都放在一个 GPU 上。我们从全注意力机制切换到滑动窗口注意力机制,这解决了我们的内存瓶颈。我们从 40 个去噪步骤提炼到只有几个 - 质量下降的程度低于我们担心的,尤其是在使用基于 GAN 的蒸馏之后(尽管调整该对抗性损失以避免模式崩溃是其自身的挑战)。
剩下的就是推理工作:将 RoPE 从复数改为实数(这个很酷!),精度调整,融合内核,一个特殊的滚动 KV 缓存,大量的其他缓存等等。我们不断地尽可能地减少毫秒数,最终实现了实时。
我们为 HN 搭建了一个访客游乐场,您可以在其中创建角色并与他们交谈,无需登录:www.lemonslice.com/hn。对于那些希望使用我们的 API 进行构建的人(我们有一个新的 LiveKit 集成,我们对此感到非常兴奋!),请在 HN 游乐场中获取优惠券代码,即可免费获得第一个 Pro 月份(价值 100 美元)。请参阅文档:<a href="https://lemonslice.com/docs">https://lemonslice.com/docs</a>。定价是基于使用量的,视频生成费用为 0.12-0.20 美元/分钟。
期待您的反馈!我们也很乐意看到您制作的任何酷炫角色 - 请在评论中分享他们的链接
*我们去年为我们的 V1 模型做了一个 Show HN:<a href="https://news.ycombinator.com/item?id=43785044">https://news.ycombinator.com/item?id=43785044</a>。它在技术上令人印象深刻,但与我们今天拥有的相比,简直太糟糕了。
查看原文
Hey HN, we're the co-founders of LemonSlice (<a href="https://lemonslice.com">https://lemonslice.com</a>). We train interactive avatar video models. Our API lets you upload a photo and immediately jump into a FaceTime-style call with that character. Here's a demo: <a href="https://www.loom.com/share/941577113141418e80d2834c83a5a0a9" rel="nofollow">https://www.loom.com/share/941577113141418e80d2834c83a5a0a9</a><p>Chatbots are everywhere. Voice AI has recently taken off. But we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.<p>We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: <a href="https://lemonslice.com/try/taylor">https://lemonslice.com/try/taylor</a>). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: <a href="https://lemonslice.com/try/alien">https://lemonslice.com/try/alien</a>. Warning! Talking to this little guy may improve your mood.<p>Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.<p>How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.<p>From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).<p>And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.<p>We set up a guest playground for HN so you can create and talk to characters without logging in: www.lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: <a href="https://lemonslice.com/docs">https://lemonslice.com/docs</a>. Pricing is usage-based at $0.12-0.20/min for video generation.<p>Looking forward to your feedback! And we’d love to see any cool characters you make - please share their links in the comments<p>*We did a Show HN last year for our V1 model: <a href="https://news.ycombinator.com/item?id=43785044">https://news.ycombinator.com/item?id=43785044</a>. It was technically impressive but so bad compared to what we have today.