Launch HN: Webhound (YC S23) – Webhound 发布 (YC S23) – 研究助手,从网络构建数据集
6 分•作者: mfkhalil•8 个月前
我们是 Webhound(<https://webhound.ai>)背后的团队,Webhound 是一款 AI 智能体,可以根据自然语言提示从网络上构建数据集。你描述你想要查找的内容,智能体就会确定如何构建数据、在哪里查找,然后进行搜索、提取结果,并将所有内容以 CSV 格式输出,供你导出。
我们为 HN 社区设置了一个特殊的免注册版本,请访问 <https://hn.webhound.ai> - 只需点击“以访客身份继续”即可试用,无需注册。
这里有一个演示:<https://youtu.be/fGaRfPdK1Sk>
我们开始构建它的原因,是因为厌倦了手动进行这类研究。打开 50 个标签页,将所有内容复制到电子表格中,意识到它不一致,然后重新开始。感觉这应该是 LLM 能够处理的事情。
以下是过去一个月人们使用它的一些示例:
竞品分析:“创建一个内部工具平台(Retool、Appsmith、Superblocks、UI Bakery、BudiBase 等)的对比表格,包括它们的免费套餐限制、定价层级、入门体验、集成以及它们在着陆页上的定位。”(<https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff6927c44f80>)
潜在客户开发:“查找最近推出的销售护肤品的 Shopify 商店。我需要商店的 URL、创始人姓名、电子邮件、Instagram 账号和产品类别。”(<https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e341c67c8>)
价格跟踪:“使用官方网站和变更日志,跟踪过去 6 个月内笔记应用程序的免费和付费套餐的变化情况。列出每个应用程序,包括变化的时间线和每个变化的来源。”(<https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8deab09e85d7>)
投资者映射:“查找在过去一年中,为基于浏览器的开发工具初创公司领投或参与了 pre-seed 或 seed 轮融资的 VC。包括 VC 名称、相关合伙人、联系方式和投资组合链接,以供参考。”(<https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fda3444340>)
研究收集:“获取最近关于 NLP 中弱监督的 arXiv 论文列表。对于每篇论文,包括摘要、引用次数、发表日期和 GitHub 仓库(如果可用)。”(<https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b7c423ce2>)
假设检验:“检查过去 3 个月内用户对 Figma 在大文件上的性能问题的抱怨是否有所增加。搜索 Hacker News、Reddit 和 Figma 社区网站等论坛,并显示带有时间戳和参与度指标的最相关帖子。”(<https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b66e845cd>)
Webhound 的第一个版本是一个在 Claude 4 Sonnet 上运行的单一智能体。它有效,但每次会话的成本通常超过 1100 美元,而且它经常陷入无限循环。我们知道这不可持续,所以我们开始围绕更小的模型进行构建。
这意味着需要增加更多结构。我们引入了一个多智能体系统,以保持其可靠性和准确性。有一个主智能体,一组并行运行子任务的搜索智能体,一个保持事情按部就班的评论智能体,以及一个在保存数据之前仔细检查提取数据的验证器。我们还为它提供了一个用于长期记忆的记事本,这有助于避免重复并跟踪它已经看到的内容。
在切换到 Gemini 2.5 Flash 并分层构建智能体系统后,我们能够将成本降低 30 倍以上,同时提高了速度和输出质量。
该系统分两个阶段运行。第一阶段是规划,它决定模式、如何搜索、使用哪些来源以及如何知道何时完成。然后是提取阶段,它执行计划并收集数据。
它使用我们构建的基于文本的浏览器,该浏览器将页面渲染为 markdown 并直接提取内容。我们尝试了使用完整的浏览器,但它速度较慢且可靠性较低。纯文本对于这种任务仍然更好。
我们还构建了定期刷新以保持数据集的最新状态,并构建了 API,以便您可以将数据直接集成到您的工作流程中。
目前,所有内容在运行期间都保留在智能体的上下文中。根据属性的数量,它在 1000-5000 行左右开始崩溃。我们正在开发一种更好的架构来扩展到这个范围之外。
我们很乐意收到反馈,特别是来自任何尝试过解决这个问题或构建过类似工具的人。很乐意在帖子中回答任何问题。
谢谢!
Moe
查看原文
We're the team behind Webhound (<a href="https://webhound.ai">https://webhound.ai</a>), an AI agent that builds datasets from the web based on natural language prompts. You describe what you're trying to find. The agent figures out how to structure the data and where to look, then searches, extracts the results, and outputs everything in a CSV you can export.<p>We've set up a special no-signup version for the HN community at <a href="https://hn.webhound.ai">https://hn.webhound.ai</a> - just click "Continue as Guest" to try it without signing up.<p>Here's a demo: <a href="https://youtu.be/fGaRfPdK1Sk" rel="nofollow">https://youtu.be/fGaRfPdK1Sk</a><p>We started building it after getting tired of doing this kind of research manually. Open 50 tabs, copy everything into a spreadsheet, realize it's inconsistent, start over. It felt like something an LLM should be able to handle.<p>Some examples of how people have used it in the past month:<p>Competitor analysis: "Create a comparison table of internal tooling platforms (Retool, Appsmith, Superblocks, UI Bakery, BudiBase, etc) with their free plan limits, pricing tiers, onboarding experience, integrations, and how they position themselves on their landing pages." (<a href="https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff6927c44f80">https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff69...</a>)<p>Lead generation: "Find Shopify stores launched recently that sell skincare products. I want the store URLs, founder names, emails, Instagram handles, and product categories." (<a href="https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e341c67c8">https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e...</a>)<p>Pricing tracking: "Track how the free and paid plans of note-taking apps have changed over the past 6 months using official sites and changelogs. List each app with a timeline of changes and the source for each." (<a href="https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8deab09e85d7">https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8dea...</a>)<p>Investor mapping: "Find VCs who led or participated in pre-seed or seed rounds for browser-based devtools startups in the past year. Include the VC name, relevant partners, contact info, and portfolio links for context." (<a href="https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fda3444340">https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fd...</a>)<p>Research collection: "Get a list of recent arXiv papers on weak supervision in NLP. For each, include the abstract, citation count, publication date, and a GitHub repo if available." (<a href="https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b7c423ce2">https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b...</a>)<p>Hypothesis testing: "Check if user complaints about Figma's performance on large files have increased in the last 3 months. Search forums like Hacker News, Reddit, and Figma's community site and show the most relevant posts with timestamps and engagement metrics." (<a href="https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b66e845cd">https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b...</a>)<p>The first version of Webhound was a single agent running on Claude 4 Sonnet. It worked, but sessions routinely cost over $1100 and it would often get lost in infinite loops. We knew that wasn't sustainable, so we started building around smaller models.<p>That meant adding more structure. We introduced a multi-agent system to keep it reliable and accurate. There's a main agent, a set of search agents that run subtasks in parallel, a critic agent that keeps things on track, and a validator that double-checks extracted data before saving it. We also gave it a notepad for long-term memory, which helps avoid duplicates and keeps track of what it's already seen.<p>After switching to Gemini 2.5 Flash and layering in the agent system, we were able to cut costs by more than 30x while also improving speed and output quality.<p>The system runs in two phases. First is planning, where it decides the schema, how to search, what sources to use, and how to know when it's done. Then comes extraction, where it executes the plan and gathers the data.<p>It uses a text-based browser we built that renders pages as markdown and extracts content directly. We tried full browser use but it was slower and less reliable. Plain text still works better for this kind of task.<p>We also built scheduled refreshes to keep datasets up to date and an API so you can integrate the data directly into your workflows.<p>Right now, everything stays in the agent's context during a run. It starts to break down around 1000-5000 rows depending on the number of attributes. We're working on a better architecture for scaling past that.<p>We'd love feedback, especially from anyone who's tried solving this problem or built similar tools. Happy to answer anything in the thread.<p>Thanks!
Moe