我们用 Go 语言打造了世界上最快的数据复制工具。

1作者: Cappybara128 个月前
大家好! 在 OLake,我们的团队一直在使用 Go 构建一个高吞吐量的数据复制工具。随着我们推进实际工作负载,越来越清楚的是,Go 非常适合数据工程,因为它具有简单的并发性、可预测的部署、微小的容器,以及在没有 JVM 的情况下也能提供出色的性能。 作为这一过程的一部分,我们一直在为 Apache Iceberg Go 生态系统贡献上游代码。本周,我们用于支持写入分区表的 PR 已经合并(https://github.com/apache/iceberg-go/pull/524)。 虽然这听起来可能很小众,但它为 Go 服务提供了一条非常实用的路径,可以直接写入 Iceberg(无需 Spark/Flink 绕道),并立即在 Trino/Spark/DuckDB 中进行查询。 我们添加了什么: * 分区扇出写入器,将数据分割成多个分区,每个分区都有自己的滚动数据写入器 * 高效的 Parquet 刷新/滚动,当达到目标文件大小时 * 支持所有常见的 Iceberg 转换:identity、bucket、truncate、year/month/day/hour * 基于 Arrow 的写入,实现稳定的内存和快速的列式处理 为什么我们看好 Go 用于构建我们的平台 - OLake? * 运行时并发模型使得协调分区写入器、批处理和背压变得简单。 * 小的静态二进制文件 → 易于部署到边缘和 Sidecar 摄取器。 * 出色的运维故事(可观察性、性能分析和合理的资源使用),这在您以高速率复制时非常重要。 这目前有什么帮助: * 构建微型摄取器,这些摄取器用 Go 将数据库中的更改流式传输到 Iceberg。 * 边缘或本地捕获,您不想要一个大型 JVM 堆栈。 * 希望获得更干净表(更少的小文件)的团队,而无需为每个写入路径单独进行压缩作业。 对于仍然担心 Go 的数据团队,我们有一个案例研究可以帮助您:查看我们由于该语言的轻量级模型而达到的基准测试。在此处查看数字:https://olake.io/docs/benchmarks 如果您正在尝试 Go + Iceberg,我们很乐意合作,因为我们相信开源 :) 仓库:https://github.com/datazip-inc/olake/
查看原文
hey people! At OLake, our team has been building a high-throughput data replication tool in Go for a while now. the more we push real workloads, the more it is getting clear that Go is a fantastic fit for data engineering simple concurrency, predictable deploys, tiny containers, and great perf without a JVM.<p>As part of that journey, we’ve been contributing upstream to the Apache Iceberg Go ecosystem. this week, our PR to enable writing into partitioned tables got merged (https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;iceberg-go&#x2F;pull&#x2F;524)<p>However that may sound niche, but it unlocks a very practical path for Go services to write straight to Iceberg (no Spark&#x2F;Flink detour) and be query-ready in Trino&#x2F;Spark&#x2F;DuckDB right away.<p>what we added : partitioned fan-out writer that splits data into multiple partitions, with each partition having its own rolling data writer efficient Parquet flush&#x2F;roll as the target file size is reached, all the usual Iceberg transforms supported: identity, bucket, truncate, year&#x2F;month&#x2F;day&#x2F;hour Arrow-based write for stable memory &amp; fast columnar handling<p>and why we’re bullish on Go for building our platform - OLake?<p>the runtime’s concurrency model makes it straightforward to coordinate partition writers, batching, and backpressure. small static binaries → easy to ship edge and sidecar ingestors. great ops story (observability, profiling, and sane resource usage) which is a big deal when you’re replicating at high rates. where this helps right now: building micro-ingestors that stream changes from DBs to Iceberg in Go. edge or on-prem capture where you don’t want a big JVM stack. teams that want cleaner tables (fewer tiny files) without a separate compaction job for every write path.<p>For data teams still worried about Go, we have our case study helps you : check the benchmarks we’re hitting thanks to the language’s lightweight model See numbers here: https:&#x2F;&#x2F;olake.io&#x2F;docs&#x2F;benchmarks<p>If you’re experimenting with Go + Iceberg, we’d love to collaborate as we believe in open source :)<p>repo: https:&#x2F;&#x2F;github.com&#x2F;datazip-inc&#x2F;olake&#x2F;