Show HN: PyTorch K-均值聚类 GPU 友好、单文件、分层且可重采样

1作者: hassonofer9 个月前
我用纯 PyTorch 实现了一个小巧、自包含的 K-Means 算法:<a href="https:&#x2F;&#x2F;gitlab.com&#x2F;hassonofer&#x2F;pt_kmeans" rel="nofollow">https:&#x2F;&#x2F;gitlab.com&#x2F;hassonofer&#x2F;pt_kmeans</a> 我一直在研究数据集采样和近似最近邻搜索,并尝试了几个现有的用于大规模 K-Means 的库。我找不到一个既快速又简单,并且可以在我的工作站上舒适运行而不会达到内存限制的库。也许我错过了现有的解决方案,但我最终写了一个符合我需求的。 核心见解:将数据保存在 CPU 上(你有更多的 RAM),并在迭代步骤中仅将必要的块智能地移动到 GPU 进行计算。结果始终返回到 CPU,以便于后处理。 (注意:对于在 GPU 上计算的 K-Means++ 初始化,完整的数据集仍然需要适合 GPU。) 它提供了一些实用的功能: ``` - 分块计算:通过仅将必要的数据块移动到 GPU 来实现大型数据集的内存高效处理,从而防止内存溢出错误 - 簇分裂:通过将单个簇分裂成多个子簇来细化现有簇 - 零依赖:单个文件,仅需要 PyTorch。可以复制粘贴到任何项目中 - 高级聚类:具有可选重采样的层次 K-Means(遵循最新研究),簇分裂实用程序。 - 设备灵活性:显式设备控制 - 数据可以存在于任何地方,计算发生在您指定的地方(PyTorch 支持的任何加速器) ``` 未来计划: ``` - 添加对内存映射文件的支持,以处理更大的数据集 - 探索 PyTorch 分布式用于多节点 K-Means ``` 该实现处理 L2 和余弦距离,包括 K-Means++ 初始化。 可在 PyPI 上获取(`pip install pt_kmeans`),完整实现位于:<a href="https:&#x2F;&#x2F;gitlab.com&#x2F;hassonofer&#x2F;pt_kmeans" rel="nofollow">https:&#x2F;&#x2F;gitlab.com&#x2F;hassonofer&#x2F;pt_kmeans</a> 欢迎提供关于该方法的反馈,以及我可能错过的任何用例!
查看原文
I built a small, self-contained K-Means implementation in pure PyTorch: <a href="https:&#x2F;&#x2F;gitlab.com&#x2F;hassonofer&#x2F;pt_kmeans" rel="nofollow">https:&#x2F;&#x2F;gitlab.com&#x2F;hassonofer&#x2F;pt_kmeans</a><p>I was working on dataset sampling and approximate nearest neighbor search, and tried several existing libraries for large-scale K-Means. I couldn&#x27;t find something that was fast, simple, and would run comfortably on my own workstation without hitting memory limits. Maybe I missed an existing solution, but I ended up writing one that fit my needs.<p>The core insight: Keep your data on CPU (where you have more RAM) and intelligently move only the necessary chunks to GPU for computation during the iterative steps. Results always come back to CPU for easy post-processing. (Note: For K-Means++ initialization when computing on GPU, the full dataset still needs to fit on the GPU.)<p>It offers a few practical features:<p><pre><code> - Chunked Computations: Memory-efficient processing of large datasets by only moving necessary data chunks to the GPU, preventing Out-Of-Memory errors - Cluster splitting: Refine existing clusters by splitting a single cluster into multiple sub-clusters - Zero Dependencies: Single file, only requires PyTorch. Copy-paste into any project - Advanced Clustering: Hierarchical K-Means with optional resampling (following recent research), cluster splitting utilities. - Device Flexibility: Explicit device control - data can live anywhere, computation happens where you specify (any accelerator PyTorch supports) </code></pre> Future plans:<p><pre><code> - Add support for memory-mapped files to handle even bigger datasets - Explore PyTorch distributed for multi-node K-Means </code></pre> The implementation handles both L2 and cosine distances, includes K-Means++ initialization.<p>Available on PyPI (`pip install pt_kmeans`) and the full implementation is at: <a href="https:&#x2F;&#x2F;gitlab.com&#x2F;hassonofer&#x2F;pt_kmeans" rel="nofollow">https:&#x2F;&#x2F;gitlab.com&#x2F;hassonofer&#x2F;pt_kmeans</a><p>Would love feedback on the approach and any use cases I might have missed!