HackerNews中文版

大家好，我们是 Abhinav、Andy 和 Jeremy，我们正在构建 Lucidic AI (<a href="https://dashboard.lucidic.ai">https://dashboard.lucidic.ai</a>)，一个 AI 智能体可解释性工具，用于帮助观察/调试 AI 智能体。这里有一个演示：<a href="https://youtu.be/Zvoh1QUMhXQ" rel="nofollow">https://youtu.be/Zvoh1QUMhXQ</a>。只需一行代码即可轻松开始。您只需在您的智能体代码中调用 lai.init() 并登录仪表板。您可以查看每次运行的轨迹、跨会话的累积趋势、内置或自定义评估以及分组的失败模式。使用 lai.create_step() 并附带您想要的任何元数据、内存快照、工具输出、状态信息，我们将对其进行索引以便调试。我们曾在斯坦福大学人工智能实验室 (SAIL) 进行 NLP 研究，在那里我们致力于创建一个 AI 智能体（使用微调模型和 DSPy）来解决数学奥林匹克问题（重点是 AIME/USAMO）；我们意识到调试这些智能体很困难。但最后一根稻草是当我们构建了一个可以在线购买商品的电子商务智能体。它在结账时不断失败，而每一次单行代码的更改，调整提示、切换到 Llama、调整工具逻辑，都意味着需要另一次 10 分钟的重新运行，才能看到我们是否到达了同一个结账页面。在这一点上，我们都觉得，这太糟糕了，所以我们通过更好的调试、监控和评估来改进智能体的可解释性。我们首先听取了用户的意见，他们告诉我们传统的 LLM 可观察性平台无法捕捉智能体的复杂性。智能体有工具、记忆、事件，而不仅仅是输入/输出对。因此，我们自动将 OTel（和/或常规）智能体日志转换为交互式图形可视化，这些可视化基于内存和动作模式对相似状态进行聚类。我们听说人们希望即使使用图形也能测试小的更改，因此我们创建了“时间旅行”，您可以在其中修改任何状态（内存内容、工具输出、上下文），然后重新模拟 30-40 次以查看结果分布。我们嵌入响应，按相似性聚类，并显示哪些修改导致稳定与发散行为。然后我们看到人们在同一任务上运行他们的智能体 10 次，单独观察每次运行，并浪费数小时查看大部分重复的状态。因此，我们基于相似状态嵌入（例如相似的工具或记忆）构建了轨迹聚类，以揭示大规模模拟中的行为模式。然后我们使用它来创建一个力导向布局，该布局会自动对您的智能体所采取的相似路径进行分组，将状态显示为节点，将动作显示为边，并将失败概率显示为颜色强度。聚类使失败模式显而易见；您可以看到数百次运行的趋势，而不是单个轨迹。最后，当人们看到我们的可观察性功能时，他们自然而然地希望拥有评估能力。因此，我们开发了一个供人们创建自己的评估的概念，称为“标准”，它允许您定义特定标准，为每个标准分配权重，并设置分数定义，从而为您提供一种结构化的方式来衡量智能体根据您的确切要求的性能。为了评估这些标准，我们使用我们自己的平台构建了一个调查员智能体，该智能体可以比传统的 LLM-as-judge 方法更有效地审查您的标准并评估性能。要开始使用，请访问 dashboard.lucidic.ai 和 <a href="https://docs.lucidic.ai/getting-started/quickstart">https://docs.lucidic.ai/getting-started/quickstart</a>。您可以免费使用它，用于创建 1,000 个事件和步骤。期待您的想法！如有任何疑问，请随时通过 team@lucidic.ai 与我们联系。

查看原文

Hi HN, we’re Abhinav, Andy, and Jeremy, and we’re building Lucidic AI (<a href="https://dashboard.lucidic.ai">https://dashboard.lucidic.ai</a>), an AI agent interpretability tool to help observe/debug AI agents.Here is a demo: <a href="https://youtu.be/Zvoh1QUMhXQ" rel="nofollow">https://youtu.be/Zvoh1QUMhXQ</a>.Getting started is easy with just one line of code. You just call lai.init() in your agent code and log into the dashboard. You can see traces of each run, cumulative trends across sessions, built-in or custom evals, and grouped failure modes. Call lai.create_step() with any metadata you want, memory snapshots, tool outputs, stateful info, and we'll index it for debugging.We did NLP research at Stanford AI Lab (SAIL), where we worked on creating an AI agent (w/ fine-tuned models and DSPy) to solve math olympiad problems (focusing on AIME/USAMO); and we realized debugging these agents was hard. But the last straw was when we built an e-commerce agent that could buy items online. It kept failing at checkout, and every one-line change, tweaking a prompt, switching to Llama, adjusting tool logic, meant another 10-minute rerun just to see if we hit the same checkout page.At this point, we were all like, this sucks, so we improved agent interpretability with better debugging, monitoring, and evals.We started by listening to users who told us traditional LLM observability platforms don't capture the complexity of agents. Agents have tools, memories, events, not just input/output pairs. So we automatically transform OTel (and/or regular) agent logs into interactive graph visualizations that cluster similar states based on memory and action patterns. We heard that people wanted to test small changes even with the graphs, so we created “time traveling,” where you can modify any state (memory contents, tool outputs, context), then re-simulate 30–40 times to see outcome distributions. We embed the responses, cluster by similarity, and show which modifications lead to stable vs. divergent behaviors.Then we saw people running their agent 10 times on the same task, watching each run individually, and wasting hours looking at mostly repeated states. So we built trajectory clustering on similar state embeddings (like similar tools or memories) to surface behavioral patterns across mass simulations.We then use that to create a force-directed layout that automatically groups similar paths your agent took, which displays states as nodes, actions as edges, and failure probability as color intensity. The clusters make failure patterns obvious; you see trends across hundreds of runs, not individual traces.Finally, when people saw our observability features, they naturally wanted evaluation capabilities. So we developed a concept for people to make their own evals called "rubrics," which lets you define specific criteria, assign weights to each criterion, and set score definitions, giving you a structured way to measure agent performance against your exact requirements.To evaluate these criteria, we used our own platform to build an investigator agent that reviews your criteria and evaluates performance much more effectively than traditional LLM-as-a-judge approaches.To get started visit dashboard.lucidic.ai and <a href="https://docs.lucidic.ai/getting-started/quickstart">https://docs.lucidic.ai/getting-started/quickstart</a>. You can use it for free for 1,000 event and step creations.Look forward to your thoughts! And don’t hesitate to reach out at team@lucidic.ai

Launch HN: Lucidic (YC W25) – Lucidic：在生产环境中调试、测试和评估 AI 智能体