加速 Spark 查询

1作者: hkverma23 天前
在查询中,Join 操作总是令人头疼,有时更好的方法是在数据本身内部创建多维索引。因此,我在业余时间构建了 LitenDB,这是一个开源项目,它使用 Delta Lake 中存储的数据和索引扩展了 Spark,利用 Arrow 将数据重塑为快速的分布式张量: https://github.com/hkverma/litendb 它可以加速大量使用 Join 操作的查询和分析查询,简化执行计划,并可以实现 10–100 倍的性能提升。你可以在此 Colab Notebook 中尝试,看看它是如何工作的: https://github.com/hkverma/litendb/blob/main/py/notebooks/LitenTpchQ5Q6.ipynb 欢迎社区提供反馈并探讨合作。 谢谢, HK
查看原文
In queries, joins are always painful, and sometimes the better approach is to create multi-dimensional indices inside the data itself. So in my spare time I built LitenDB, an open-source project that extends Spark with data and indices stored in Delta Lake to reshape data into fast, distributed tensors using Arrow: https:&#x2F;&#x2F;github.com&#x2F;hkverma&#x2F;litendb<p>It speeds up join-heavy and analytic queries, simplifies plans, and can deliver 10–100× performance improvements. You can try the Colab notebook here to see how it works: https:&#x2F;&#x2F;github.com&#x2F;hkverma&#x2F;litendb&#x2F;blob&#x2F;main&#x2F;py&#x2F;notebooks&#x2F;LitenTpchQ5Q6.ipynb<p>Would love to hear feedback from the community and explore collaborations. Thanks, HK