我们用 Go 语言打造了世界上最快的数据复制工具。
1 分•作者: Cappybara12•8 个月前
大家好!
在 OLake,我们的团队一直在使用 Go 构建一个高吞吐量的数据复制工具。随着我们推进实际工作负载,越来越清楚的是,Go 非常适合数据工程,因为它具有简单的并发性、可预测的部署、微小的容器,以及在没有 JVM 的情况下也能提供出色的性能。
作为这一过程的一部分,我们一直在为 Apache Iceberg Go 生态系统贡献上游代码。本周,我们用于支持写入分区表的 PR 已经合并(https://github.com/apache/iceberg-go/pull/524)。
虽然这听起来可能很小众,但它为 Go 服务提供了一条非常实用的路径,可以直接写入 Iceberg(无需 Spark/Flink 绕道),并立即在 Trino/Spark/DuckDB 中进行查询。
我们添加了什么:
* 分区扇出写入器,将数据分割成多个分区,每个分区都有自己的滚动数据写入器
* 高效的 Parquet 刷新/滚动,当达到目标文件大小时
* 支持所有常见的 Iceberg 转换:identity、bucket、truncate、year/month/day/hour
* 基于 Arrow 的写入,实现稳定的内存和快速的列式处理
为什么我们看好 Go 用于构建我们的平台 - OLake?
* 运行时并发模型使得协调分区写入器、批处理和背压变得简单。
* 小的静态二进制文件 → 易于部署到边缘和 Sidecar 摄取器。
* 出色的运维故事(可观察性、性能分析和合理的资源使用),这在您以高速率复制时非常重要。
这目前有什么帮助:
* 构建微型摄取器,这些摄取器用 Go 将数据库中的更改流式传输到 Iceberg。
* 边缘或本地捕获,您不想要一个大型 JVM 堆栈。
* 希望获得更干净表(更少的小文件)的团队,而无需为每个写入路径单独进行压缩作业。
对于仍然担心 Go 的数据团队,我们有一个案例研究可以帮助您:查看我们由于该语言的轻量级模型而达到的基准测试。在此处查看数字:https://olake.io/docs/benchmarks
如果您正在尝试 Go + Iceberg,我们很乐意合作,因为我们相信开源 :)
仓库:https://github.com/datazip-inc/olake/
查看原文
hey people!
At OLake, our team has been building a high-throughput data replication tool in Go for a while now. the more we push real workloads, the more it is getting clear that Go is a fantastic fit for data engineering simple concurrency, predictable deploys, tiny containers, and great perf without a JVM.<p>As part of that journey, we’ve been contributing upstream to the Apache Iceberg Go ecosystem. this week, our PR to enable writing into partitioned tables got merged (https://github.com/apache/iceberg-go/pull/524)<p>However that may sound niche, but it unlocks a very practical path for Go services to write straight to Iceberg (no Spark/Flink detour) and be query-ready in Trino/Spark/DuckDB right away.<p>what we added : partitioned fan-out writer that splits data into multiple partitions, with each partition having its own rolling data writer efficient Parquet flush/roll as the target file size is reached, all the usual Iceberg transforms supported: identity, bucket, truncate, year/month/day/hour Arrow-based write for stable memory & fast columnar handling<p>and why we’re bullish on Go for building our platform - OLake?<p>the runtime’s concurrency model makes it straightforward to coordinate partition writers, batching, and backpressure. small static binaries → easy to ship edge and sidecar ingestors. great ops story (observability, profiling, and sane resource usage) which is a big deal when you’re replicating at high rates. where this helps right now: building micro-ingestors that stream changes from DBs to Iceberg in Go. edge or on-prem capture where you don’t want a big JVM stack. teams that want cleaner tables (fewer tiny files) without a separate compaction job for every write path.<p>For data teams still worried about Go, we have our case study helps you : check the benchmarks we’re hitting thanks to the language’s lightweight model See numbers here: https://olake.io/docs/benchmarks<p>If you’re experimenting with Go + Iceberg, we’d love to collaborate as we believe in open source :)<p>repo: https://github.com/datazip-inc/olake/