HackerNews中文版

我的任务是为包含 1000 万份文本文件（存储在 PostgreSQL 中）的语料库构建一个私人的 AI 助手。目标是实现语义搜索和聊天功能，并要求定期进行增量更新。我正在尝试在以下方案中做出选择：前沿技术：实现类似 LightRAG 或 GraphRAG 的方案。成熟方案：标准的混合搜索（Weaviate/Elastic + 重新排序），由 Dify 等工具编排。对于那些构建过大规模 RAG 系统的人来说：您在 2025 年更倾向于哪种技术栈？对于如此大的数据量，Graph/LightRAG 的复杂性是否值得投入，还是标准的文本分块/检索就足够了？您如何高效地处理维护和更新？希望获得架构方面的建议和经验分享。

查看原文

I'm tasked with building a private AI assistant for a corpus of 10 million text documents (living in PostgreSQL). The goal is semantic search and chat, with a requirement for regular incremental updates.I'm trying to decide between:Bleeding edge: Implementing something like LightRAG or GraphRAG.Proven stack: Standard Hybrid Search (Weaviate/Elastic + Reranking) orchestrated by tools like Dify.For those who have built RAG at this scale:What is your preferred stack for 2025?Is the complexity of Graph/LightRAG worth it over standard chunking/retrieval for this volume?How do you handle maintenance and updates efficiently?Looking for architectural advice and war stories.

Ask HN：如果现在要为 1000 万+ 份文档构建 RAG 系统，你会如何设计架构？