HackerNews中文版

嗨 HN，我们是 Pulse 的联合创始人 Sid 和 Ritvik。Pulse 是一个文档提取系统，用于创建 LLM 预处理文本。我们构建 Pulse 是因为我们意识到，尽管现代视觉语言模型在生成看似合理的文本方面表现出色，但这使得它们在大规模 OCR 和数据摄取方面存在风险。当我们开始研究文档提取时，我们假设了当今许多团队都在做的事情：基础模型正在快速改进，多模态系统似乎可以很好地读取文档，对于小型或干净的输入，这种假设通常是成立的。一旦我们开始批量处理真实文档，局限性就显现出来了。长篇 PDF、密集的表格、混合布局、低保真扫描以及财务或运营数据暴露了微妙、难以检测且纠正成本高昂的错误。输出结果通常看起来合理，但却包含小的但有意义的错误，尤其是在表格和数字字段中。从那时起，我们的大部分工作都是应用研究。我们对复杂文档进行受控评估，微调视觉模型，并构建了真实情况真正重要的标记数据集。我们的团队曾多次熬夜手动注释页面，在表格周围绘制边界框，逐点标记图表，或者争论一个数字是无法辨认还是仅仅扫描不佳。这个过程塑造了我们的直觉，远远超过了基准测试。有一点很快变得清晰。核心挑战不在于提取本身，而在于置信度。视觉语言模型将文档图像嵌入到高维表示中，这些表示针对语义理解而不是精确转录进行了优化。这个过程本质上是有损的。当出现不确定性时，模型倾向于使用已学习的先验来解决它，而不是揭示歧义。这种行为在消费者环境中可能有所帮助。但在生产流水线中，它会产生无法很好扩展的验证问题。 Pulse 的出现源于试图通过系统设计而非仅仅通过提示来解决这一差距。该系统没有将文档理解视为单个生成步骤，而是将布局分析与语言建模分开。文档被规范化为结构化表示，在进行模式映射之前保留层次结构和表格。提取受到预先定义的模式的约束，并且提取的值与源位置相关联，因此可以检查不确定性，而不是将其猜测掉。实际上，这产生了一种混合方法，结合了传统的计算机视觉技术、布局模型和视觉语言模型，因为没有单一方法可以单独可靠地处理这些情况。我们有意分享了一些文档，这些文档反映了促使这项工作产生的输入类型。这些是我们在通用 OCR 或基于 VLM 的流水线中看到问题的案例的代表。这是一个财务 10K 文件： [https://platform.runpulse.com/dashboard/examples/example1](https://platform.runpulse.com/dashboard/examples/example1) 这是一份报纸： [https://platform.runpulse.com/dashboard/examples/example2](https://platform.runpulse.com/dashboard/examples/example2) 这是一个租金清单： [https://platform.runpulse.com/dashboard/examples/example3](https://platform.runpulse.com/dashboard/examples/example3) Pulse 并非完美无缺，尤其是在高度退化的扫描或不常见的笔迹上，并且仍有改进的空间。目标不是完全消除错误，而是使其可见、可审计且更易于推理。 Pulse 通过基于使用的 API 和平台访问提供。您可以在此处试用，并在此处访问 API 文档。演示链接：[https://video.runpulse.com/video/pulse-platform-walkthrough-69f9](https://video.runpulse.com/video/pulse-platform-walkthrough-69f9) 我们有兴趣听取这里其他人如何评估文档提取的正确性，您在实践中看到了哪些失败模式，以及您依赖哪些信号来决定是否可以信任输出。我们将随时回答问题，并很乐意运行其他文档，如果人们想分享示例的话。

查看原文

Hi HN, we’re Sid and Ritvik, co-founders of Pulse. Pulse is a document extraction system to create LLM-ready text. We built Pulse as we realized that although modern vision language models are very good at producing plausible text, that makes them risky for OCR and data ingestion at scale.When we started working on document extraction, we assumed the same thing many teams do today: foundation models were improving quickly, multi modal systems appeared to read documents well, and for small or clean inputs that assumption often held. The limitations showed up once we began processing real documents in volume. Long PDFs, dense tables, mixed layouts, low-fidelity scans, and financial or operational data exposed errors that were subtle, hard to detect, and expensive to correct. Outputs often looked reasonable while containing small but meaningful mistakes, especially in tables and numeric fields.A lot of our work since then has been applied research. We run controlled evaluations on complex documents, fine tune vision models, and build labeled datasets where ground truth actually matters. There have been many nights where our team stayed up hand annotating pages, drawing bounding boxes around tables, labeling charts point by point, or debating whether a number was unreadable or simply poorly scanned. That process shaped our intuition far more than benchmarks alone.One thing became clear quickly. The core challenge was not extraction itself, but confidence. Vision language models embed document images into high-dimensional representations optimized for semantic understanding rather than precise transcription. That process is inherently lossy. When uncertainty appears, models tend to resolve it using learned priors instead of surfacing ambiguity. This behavior can be helpful in consumer settings. In production pipelines, it creates verification problems that do not scale well.Pulse grew out of trying to address this gap through system design rather than prompting alone. Instead of treating document understanding as a single generative step, the system separates layout analysis from language modeling. Documents are normalized into structured representations that preserve hierarchy and tables before schema mapping occurs. Extraction is constrained by schemas defined ahead of time, and extracted values are tied back to source locations so uncertainty can be inspected rather than guessed away. In practice, this results in a hybrid approach that combines traditional computer vision techniques, layout models, and vision language models, because no single approach handled these cases reliably on its own.We are intentionally sharing a few documents that reflect the types of inputs that motivated this work. These are representative of cases where we saw generic OCR or VLM-based pipelines struggle.Here is a financial 10K: <a href="https://platform.runpulse.com/dashboard/examples/example1">https://platform.runpulse.com/dashboard/examples/example1</a>Here is a newspaper: <a href="https://platform.runpulse.com/dashboard/examples/example2">https://platform.runpulse.com/dashboard/examples/example2</a>Here is a rent roll: <a href="https://platform.runpulse.com/dashboard/examples/example3">https://platform.runpulse.com/dashboard/examples/example3</a>Pulse is not perfect, particularly on highly degraded scans or uncommon handwriting, and there is still room for improvement. The goal is not to eliminate errors entirely, but to make them visible, auditable, and easier to reason about.Pulse is available via usage-based access to the API and platform You can try it here and access the API docs here.Demo link here: <a href="https://video.runpulse.com/video/pulse-platform-walkthrough-69f9">https://video.runpulse.com/video/pulse-platform-walkthrough-...</a>We’re interested in hearing how others here evaluate correctness for document extraction, which failure modes you have seen in practice, and what signals you rely on to decide whether an output can be trusted. We will be around to answer questions and are happy to run additional documents if people want to share examples.

Launch HN: Pulse (YC S24) – Pulse（YC S24）发布：面向生产环境的非结构化文档提取