问 HN:Common Crawl 被任何搜索引擎彻底使用了吗?
1 分•作者: n1xis10t•7 个月前
Common Crawl 数据库大约有 3000 亿个网页,如果以提取文本格式下载所有内容,压缩后大约需要 816 TB。如果有人用它来制作搜索引擎,我认为会比 Bing 更全面,甚至可能与 Google 相当。据我所知,现有的基于 Common Crawl 的搜索引擎只使用了其中一小部分数据。您知道有哪个搜索引擎使用了全部数据吗?
查看原文
The Common Crawl has about 300 billion pages in it, and if you downloaded all of it in extracted text format it would only take up about 816 TB compressed. If someone were to make a search engine with this I think it would be more comprehensive than Bing, and possibly pretty similar to Google. The only CC based search engines that I know of use a tiny fraction of what they have available. Do you know of any that use the whole thing?