HackerNews中文版

大家好！在 OLake，我们的团队一直在使用 Go 构建一个高吞吐量的数据复制工具。随着我们推进实际工作负载，越来越清楚的是，Go 非常适合数据工程，因为它具有简单的并发性、可预测的部署、微小的容器，以及在没有 JVM 的情况下也能提供出色的性能。作为这一过程的一部分，我们一直在为 Apache Iceberg Go 生态系统贡献上游代码。本周，我们用于支持写入分区表的 PR 已经合并（https://github.com/apache/iceberg-go/pull/524）。虽然这听起来可能很小众，但它为 Go 服务提供了一条非常实用的路径，可以直接写入 Iceberg（无需 Spark/Flink 绕道），并立即在 Trino/Spark/DuckDB 中进行查询。我们添加了什么： * 分区扇出写入器，将数据分割成多个分区，每个分区都有自己的滚动数据写入器 * 高效的 Parquet 刷新/滚动，当达到目标文件大小时 * 支持所有常见的 Iceberg 转换：identity、bucket、truncate、year/month/day/hour * 基于 Arrow 的写入，实现稳定的内存和快速的列式处理为什么我们看好 Go 用于构建我们的平台 - OLake？ * 运行时并发模型使得协调分区写入器、批处理和背压变得简单。 * 小的静态二进制文件 → 易于部署到边缘和 Sidecar 摄取器。 * 出色的运维故事（可观察性、性能分析和合理的资源使用），这在您以高速率复制时非常重要。这目前有什么帮助： * 构建微型摄取器，这些摄取器用 Go 将数据库中的更改流式传输到 Iceberg。 * 边缘或本地捕获，您不想要一个大型 JVM 堆栈。 * 希望获得更干净表（更少的小文件）的团队，而无需为每个写入路径单独进行压缩作业。对于仍然担心 Go 的数据团队，我们有一个案例研究可以帮助您：查看我们由于该语言的轻量级模型而达到的基准测试。在此处查看数字：https://olake.io/docs/benchmarks 如果您正在尝试 Go + Iceberg，我们很乐意合作，因为我们相信开源 :) 仓库：https://github.com/datazip-inc/olake/

查看原文

hey people! At OLake, our team has been building a high-throughput data replication tool in Go for a while now. the more we push real workloads, the more it is getting clear that Go is a fantastic fit for data engineering simple concurrency, predictable deploys, tiny containers, and great perf without a JVM.As part of that journey, we’ve been contributing upstream to the Apache Iceberg Go ecosystem. this week, our PR to enable writing into partitioned tables got merged (https://github.com/apache/iceberg-go/pull/524)However that may sound niche, but it unlocks a very practical path for Go services to write straight to Iceberg (no Spark/Flink detour) and be query-ready in Trino/Spark/DuckDB right away.what we added : partitioned fan-out writer that splits data into multiple partitions, with each partition having its own rolling data writer efficient Parquet flush/roll as the target file size is reached, all the usual Iceberg transforms supported: identity, bucket, truncate, year/month/day/hour Arrow-based write for stable memory & fast columnar handlingand why we’re bullish on Go for building our platform - OLake?the runtime’s concurrency model makes it straightforward to coordinate partition writers, batching, and backpressure. small static binaries → easy to ship edge and sidecar ingestors. great ops story (observability, profiling, and sane resource usage) which is a big deal when you’re replicating at high rates. where this helps right now: building micro-ingestors that stream changes from DBs to Iceberg in Go. edge or on-prem capture where you don’t want a big JVM stack. teams that want cleaner tables (fewer tiny files) without a separate compaction job for every write path.For data teams still worried about Go, we have our case study helps you : check the benchmarks we’re hitting thanks to the language’s lightweight model See numbers here: https://olake.io/docs/benchmarksIf you’re experimenting with Go + Iceberg, we’d love to collaborate as we believe in open source :)repo: https://github.com/datazip-inc/olake/

我们用 Go 语言打造了世界上最快的数据复制工具。