HackerNews中文版

Hi HN，和许多人一样，我对发布的爱泼斯坦/麦克斯韦尔法庭文件大多是扫描图像（PDF）且没有文本层感到很沮丧。这使得它们无法使用Ctrl+F搜索或进行程序化分析。我使用Python、Tesseract和OpenSearch构建了一个管道来解决这个问题。网站：<a href="https://epsteinfilez.com" rel="nofollow">https://epsteinfilez.com</a> 技术栈：摄取：Python worker使用ocrmypdf（Tesseract）对原始文件执行并行OCR。搜索：OpenSearch用于索引提取的文本。前端：Next.js (SSR) 用于UI。基础设施：自托管Docker swarm。功能：在约15,000页中实现亚秒级的全文搜索。直接在PDF页面上高亮显示搜索词。深度链接到特定页面/文档。这是一个透明度工具，而非政治工具。我希望让研究人员和记者能够访问原始的第一手资料。欢迎提供关于搜索相关性或索引管道的反馈！

查看原文

Hi HN,Like many people, I was frustrated that the released Epstein/Maxwell court documents were mostly scanned images (PDFs) with no text layer. This made them impossible to Ctrl+F or analyze programmatically.I built a pipeline to fix this using Python, Tesseract, and OpenSearch.The Site: <a href="https://epsteinfilez.com" rel="nofollow">https://epsteinfilez.com</a>The Stack:Ingestion: Python workers using ocrmypdf (Tesseract) to perform parallel OCR on raw files.Search: OpenSearch for indexing the extracted text.Frontend: Next.js (SSR) for the UI.Infrastructure: Self-hosted Docker swarm.Features:Sub-second full-text search across ~15,000 pages.Highlights search terms directly on the PDF page.Deep linking to specific pages/documents.This is a transparency tool, not a political one. I wanted to make the raw primary sources accessible to researchers and journalists.Feedback on the search relevance or indexing pipeline is welcome!

Show HN: 爱泼斯坦文档全文搜索引擎（OCR 和 OpenSearch）