Show HN:免费 API,用于提取 PDF 数据

3作者: leftnode8 个月前
各位 HN 用户, 大家好!<p>和大家一样,我正在开发一款使用 LLM 从照片和文档中提取数据的产品。处理流程的一部分是从 PDF 文件中提取数据,提取方式包括原始文本或栅格图像。<p>作为我们潜在客户开发策略的一部分,我们开放了 REST API,允许您处理 PDF 页面。该 API 供匿名用户免费使用,但速率限制为每 30 秒处理 1 页。创建免费帐户可以解除此限制。<p>两个端点如下:<p>- <a href="https:&#x2F;&#x2F;extract.dev&#x2F;api&#x2F;pages&#x2F;extract&#x2F;raster" rel="nofollow">https:&#x2F;&#x2F;extract.dev&#x2F;api&#x2F;pages&#x2F;extract&#x2F;raster</a> - 将 PDF 页面栅格化<p>- <a href="https:&#x2F;&#x2F;extract.dev&#x2F;api&#x2F;pages&#x2F;extract&#x2F;text" rel="nofollow">https:&#x2F;&#x2F;extract.dev&#x2F;api&#x2F;pages&#x2F;extract&#x2F;text</a> - 从 PDF 页面提取文本<p>两者都使用相同的请求格式:<p><pre><code> { &quot;file&quot;: &quot;https:&#x2F;&#x2F;assets.extract-cdn.com&#x2F;data&#x2F;hd-receipt.pdf&quot;, &quot;page&quot;: 1 } </code></pre> 我在此处详细介绍了更多文档:<a href="https:&#x2F;&#x2F;extract.dev&#x2F;docs" rel="nofollow">https:&#x2F;&#x2F;extract.dev&#x2F;docs</a><p>在后台,API 使用 Poppler 提取文本和栅格化页面。请注意,文本提取功能提取的是 PDF 中编码的实际文本,而不是使用 OCR 模型。欢迎试用,如果您觉得有用,我很乐意听取您的反馈。
查看原文
Hi HN,<p>Like everyone, I&#x27;m working on an product that uses LLMs to extract data from photos and documents. Part of the processing pipeline is extracting data from PDFs as raw text or a raster image.<p>As part of our leadgen strategy, we&#x27;ve opened our REST API that lets you process pages of a PDF. The API is completely free to use anonymously, but is rate limited to 1 page per 30 seconds. Creating a free account removes this restriction.<p>The two endpoints are:<p>- <a href="https:&#x2F;&#x2F;extract.dev&#x2F;api&#x2F;pages&#x2F;extract&#x2F;raster" rel="nofollow">https:&#x2F;&#x2F;extract.dev&#x2F;api&#x2F;pages&#x2F;extract&#x2F;raster</a> - Rasterize a page of a PDF<p>- <a href="https:&#x2F;&#x2F;extract.dev&#x2F;api&#x2F;pages&#x2F;extract&#x2F;text" rel="nofollow">https:&#x2F;&#x2F;extract.dev&#x2F;api&#x2F;pages&#x2F;extract&#x2F;text</a> - Extract text from a page of a PDF<p>Both have the same request format:<p><pre><code> { &quot;file&quot;: &quot;https:&#x2F;&#x2F;assets.extract-cdn.com&#x2F;data&#x2F;hd-receipt.pdf&quot;, &quot;page&quot;: 1 } </code></pre> I&#x27;ve outlined more of the documentation here: <a href="https:&#x2F;&#x2F;extract.dev&#x2F;docs" rel="nofollow">https:&#x2F;&#x2F;extract.dev&#x2F;docs</a><p>Under the hood, the API is using Poppler to extract text and rasterize pages. Note that the text extraction functionality extracts actual text encoded in the PDF, and does not employ an OCR model. Give it a spin, I&#x27;m interested in your feedback if this is useful or not.