无需完全重写,如何最好地标注大型 Parquet 格式的 LLM 日志?
1 分•作者: platypii•6 个月前
我在 Apache 邮件列表上问过这个问题,但还没有找到好的解决方案。想知道是否有人对如何实现这个目标有一些想法?
我的问题是:我在 S3 中有以 Parquet 格式存储的、大小为千兆字节的 LLM 对话日志。我希望添加逐行的注释(llm-as-a-judge 分数),最好是不触及原始文本数据。
因此,对于给定的数据集,我想添加一个新列。这似乎是 Iceberg 的一个完美用例。Iceberg 确实允许你演进表模式,包括添加列。但是,你只能添加带有默认值的列。如果我想用注释填充该列,Iceberg 会让我重写每一行。因此,尽管 Iceberg 基于列式存储格式 Parquet,但我需要重新写入整个源文本数据(千兆字节的数据),仅仅是为了添加大约 1MB 的注释。这感觉非常低效。
我考虑过将该列存储在它自己的表中,然后将它们连接起来。这确实可行,但连接操作使用起来很麻烦,而且我怀疑查询引擎不会很好地优化“基于 row_number 的连接”操作。
我一直在探索使用 Parquet 中鲜为人知的功能,例如 file_path 字段,以将列数据存储在外部文件中。但实际上没有任何 Parquet 客户端支持这一点。
我快要用尽处理这些数据的有效方法了。如果找不到解决方案,我甚至考虑构建自己的表格式。有人有建议吗?
查看原文
I asked this on the Apache mailing list but haven’t found a good solution yet. Wondering if anyone has some ideas for how to engineer this?<p>Here’s my problem: I have gigabytes of LLM conversation logs in parquet in S3. I want to add per-row annotations (llm-as-a-judge scores), ideally without touching the original text data.<p>So for a given dataset, I want to add a new column. This seemed like a perfect use case for Iceberg. Iceberg does let you evolve the table schema, including adding a column. BUT you can only add a column with a default value. If I want to fill in that column with annotations, ICEBERG MAKES ME REWRITE EVERY ROW. So despite being based on parquet, a column-oriented format, I need to re-write the entire source text data (gigabytes of data) just to add ~1mb of annotations. This feels wildly inefficient.<p>I considered just storing the column in its own table and then joining them. This does work but the joins are annoying to work with, and I suspect query engines do not optimize well a "join on row_number" operation.<p>I've been exploring using little-known features of parquet like the file_path field to store column data in external files. But literally zero parquet clients support this.<p>I'm running out of ideas for how to work with this data efficiently. It's bad enough that I am considering building my own table format if I can’t find a solution. Anyone have suggestions?