Show HN: AI Web智能体语义几何视觉定位演示 (亚马逊)
1 分•作者: tonyww•6 个月前
Hi HN,
我是一个独立创始人,正在开发 SentienceAPI,这是一个感知与执行层,帮助 LLM 代理在真实网站上可靠地行动。
LLM 擅长规划步骤,但在实际与网络交互时经常失败。仅基于视觉的代理成本高且不稳定,而基于 DOM 的自动化在现代页面上很容易崩溃,这些页面具有叠加层、动态布局和大量干扰。
我的方法是基于语义几何的视觉定位。
API 不再向模型提供原始 HTML(巨大的上下文)或屏幕截图(不精确),而是首先将网页简化为一个小的、基于定位的动作空间,该空间仅由实际可见且可交互的元素构成。每个元素都包含几何信息以及轻量级的视觉提示,因此模型无需猜测即可决定做什么。
我在此基础上构建了一个名为 MotionDocs 的参考应用程序。以下演示展示了该系统导航亚马逊畅销商品、打开一个产品并使用定位坐标点击“添加到购物车”(无脚本点击)。
演示视频(添加到购物车):
[https://youtu.be/1DlIeHvhOg4](https://youtu.be/1DlIeHvhOg4)
代理如何查看页面(地图模式线框图):
[https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/hn_wireframe.png](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/hn_wireframe.png)
此线框图显示了呈现给 LLM 的简化动作空间。每个框对应一个可见的、可交互的元素。
代码摘录(简化):
```python
from sentienceapi_sdk import SentienceApiClient
from motiondocs import generate_video
video = generate_video(
url="https://www.amazon.com/gp/bestsellers/",
instructions="打开一个产品并将其添加到购物车",
sentience_client=SentienceApiClient(api_key="your-api-key-here")
)
video.save("demo.mp4")
```
工作原理(高层次):
执行层将浏览器视为一个黑盒,并公开三种模式:
* 地图:使用几何信息和视觉提示识别可交互元素
* 视觉:将几何信息与屏幕截图对齐以进行定位
* 读取:提取干净的、LLM 准备好的文本
关键的见解是视觉提示,尤其是简单的 is\_primary 信号。人类不会读取每个像素——我们扫描视觉层次结构。直接编码这一点可以让代理优先处理正确的操作,而无需处理原始像素或嘈杂的 DOM。
这为什么重要:
* 更小的动作空间 → 更少的幻觉
* 确定性几何 → 可重复的执行
* 比仅基于视觉的方法更便宜
总结:我正在构建一个语义几何定位层,它将网页转化为一个紧凑的、视觉定位的 LLM 代理动作空间。它为模型提供了一张捷径,而不是要求它解决一个视觉难题。
这是早期工作,尚未发布。我希望收到反馈或质疑,特别是来自构建代理、RPA、QA 自动化或开发工具的人。
— Tony W
查看原文
Hi HN,<p>I’m a solo founder working on SentienceAPI, a perception & execution layer that helps LLM agents act reliably on real websites.<p>LLMs are good at planning steps, but they fail a lot when actually interacting with the web. Vision-only agents are expensive and unstable, and DOM-based automation breaks easily on modern pages with overlays, dynamic layouts, and lots of noise.<p>My approach is semantic geometry-based visual grounding.<p>Instead of giving the model raw HTML (huge context) or a screenshot (imprecise) and asking it to guess, the API first reduces a webpage into a small, grounded action space made only of elements that are actually visible and interactable. Each element includes geometry plus lightweight visual cues, so the model can decide what to do without guessing.<p>I built a reference app called MotionDocs on top of this. The demo below shows the system navigating Amazon Best Sellers, opening a product, and clicking “Add to cart” using grounded coordinates (no scripted clicks).<p>Demo video (Add to Cart):
[<a href="https://youtu.be/1DlIeHvhOg4" rel="nofollow">https://youtu.be/1DlIeHvhOg4</a>](<a href="https://youtu.be/1DlIeHvhOg4" rel="nofollow">https://youtu.be/1DlIeHvhOg4</a>)<p>How the agent sees the page (map mode wireframe):
[<a href="https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/hn_wireframe.png" rel="nofollow">https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a>](<a href="https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.com/hn_wireframe.png" rel="nofollow">https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...</a>)<p>This wireframe shows the reduced action space surfaced to the LLM. Each box corresponds to a visible, interactable element.<p>Code excerpt (simplified):<p>```
from sentienceapi_sdk import SentienceApiClient
from motiondocs import generate_video<p>video = generate_video(
url="<a href="https://www.amazon.com/gp/bestsellers/" rel="nofollow">https://www.amazon.com/gp/bestsellers/</a>",
instructions="Open a product and add it to cart",
sentience_client=SentienceApiClient(api_key="your-api-key-here")
)<p>video.save("demo.mp4")
```<p>How it works (high level):<p>The execution layer treats the browser as a black box and exposes three modes:<p>* Map: identify interactable elements with geometry and visual cues
* Visual: align geometry with screenshots for grounding
* Read: extract clean, LLM-ready text<p>The key insight is visual cues, especially a simple is_primary signal. Humans don’t read every pixel — we scan for visual hierarchy. Encoding that directly lets the agent prioritize the right actions without processing raw pixels or noisy DOM.<p>Why this matters:<p>* smaller action space → fewer hallucinations
* deterministic geometry → reproducible execution
* cheaper than vision-only approaches<p>TL;DR: I’m building a semantic geometry grounding layer that turns web pages into a compact, visually grounded action space for LLM agents. It gives the model a cheat sheet instead of asking it to solve a vision puzzle.<p>This is early work, not launched yet. I’d love feedback or skepticism, especially from people building agents, RPA, QA automation, or dev tools.<p>— Tony W