HackerNews中文版

各位 HN 的朋友们，我们是 Ismaeel、Eren、Yafet 和 Nikodem。我们开发了 Expanse (<a href="https://expanse.sh/">https://expanse.sh/</a>)，旨在提高您运行 Kubernetes 和 SLURM 等调度器/编排器的 HPC/GPU 集群的有效容量。我们读取源代码、作业提交脚本以及工作负载即将运行的硬件，以便在集群看到作业之前预测其实际需求。我们还会标记我们认为即将发生的故障，并提供研究人员可以自行应用的行级优化建议。问题所在：数据中心的有效利用率约为 30% 至 40%。用户请求的资源通常比实际需要的要多，这是因为风险不对称：过度请求虽然成本高昂且浪费了其他人可以使用的容量，但请求不足会导致作业在运行过程中夭折，并损失数天的工作。因此，每个人都会过度请求两到三倍的资源。我们对一个国家级 HPC 集群进行了为期一个月的测量，在 122,000 个作业中，有 59% 的计算资源被浪费了。按照相同硬件的按需云费率计算，这相当于一个月内一个集群浪费了约 850 万美元的计算资源。在大型计算行业，如量化基金、AI 实验室和制造业，也存在类似的模式。我们四人曾在最大的量化基金和 HPC 设施中运行 HPC 和 GPU 训练工作负载。Ismaeel 在 EPCC（爱丁堡并行计算中心，英国国家 HPC 中心）的 Adrian Jackson 教授指导下进行研究，他构建了第一个多模态 HPC 资源预测器：一个模型，该模型能够摄取作业源代码、提交脚本、硬件遥测数据和集群元数据，以确定实际需要多少计算资源。在 EPCC 自有集群的真实工作负载数据集上，该模型比任何其他基线模型提高了 34%，并且在相同预测任务上，其性能比提示的同类通用 LLM 模型约好 8 倍。这些结果让我们相信，这个问题可以通过软件来解决。 Expanse 安装在每个节点上，并与 SLURM（或 K8s 调度器）集成。它摄取您集群的实时硬件遥测数据（DCGM、CUPTI、Cgroups、网络/IO 监控），创建您硬件性能的自定义嵌入。我们扫描将通过 SLURM/K8s 提交的任何工作负载（接入作业的生命周期，因此您无需更改提交方式），并将这些信息输入我们的深度学习模型，以便在提交时为研究人员提供准确的资源建议、故障检测和优化建议。我们对特定集群的模型进行微调，随着您运行更多工作负载，这些模型会变得越来越精确。我们的模型经过训练，倾向于过度配置而非配置不足，这是因为作业崩溃的后果是不可逆的。我们还提供不确定性估计和 p90 值，以允许用户选择他们的风险承受能力。我们向集群用户提供三项功能： 1. **提交时的资源预测**：我们预测作业实际需要的 GPU VRAM、利用率、内存、CPU 和运行时间，并提供置信区间。基于这些预测，我们还提供 OOM（内存不足）和其他内存相关问题的故障预测，以及代码行级别的优化建议，以提高作业在硬件上的利用率。 2. **实时可观测性**：在作业运行时，我们通过一个仪表板展示我们收集的遥测数据，该仪表板直观地显示硬件中发生的情况以及您的工作负载在代码堆栈剖析方面的进展。我们动态剖析工作负载，以实现低至个位数百分比的开销，同时提供丰富的信息。 3. **故障诊断**：如果工作负载失败，我们会收集所有数据，并对堆栈剖析和我们收集的硬件遥测数据进行关联分析，以提供面向解决方案的日志。这些日志通常只有一两行，不仅告诉您作业失败时发生了什么，还解释了原因以及如何通过代码行级别的建议来修复它。我们的方法有何不同：大多数集群的现状是采用用户级别的历史平均值（来自 sacct，即 SLURM 记账数据库）、手动编写的规则/启发式方法，或者前沿的 LLM 编码代理。对于来自 sacct 的用户级别历史平均值，一旦向集群提交了新型工作负载或进行了代码级更改，模型就会变得非常不准确。对于 LLM 基线，我们向其提供了提交脚本和正在运行的工作负载的源代码，并赋予了其在集群中进行编码的全部能力，但其表现相当糟糕。我们将 Expanse 与当时最先进的模型（Gemini 3.5 pro、Claude Opus 4.8、GPT 5.5、Codex 5.3）进行了基准测试，结果显示我们的性能比它们高出 8 倍。您可能会想，随着这些模型规模的扩大和性能的提升，它们可能会在这一任务上超越我们；然而，我们没有发现模型规模或迭代次数与准确性提高之间存在相关性。Claude Haiku 在许多工作负载上的表现实际上优于 Opus，而早期版本的模型也具有相同的，甚至略好的准确性。即使是像 Codex 5.3 这样的编码专用模型，表现也相当糟糕（准确性与 GPT5.5 相当）。这些模型是在真空中进行推理的，缺乏对模态输入（如源代码，以理解底层数据流和计算模式）以及硬件遥测和拓扑（以理解集群的性能模式）的原生支持，因此它们无法准确预测工作负载所需的资源。此外，Expanse 会持续更新其内部模型，以确保随着更多工作负载在您的集群上运行，我们的预测会越来越准确，这使其非常适合应对新硬件或工作负载模式的变化。LLM 非常擅长编写代码和进行超参数搜索，但它们需要 Expanse 来完成完整的自主循环以实现自动研究。将我们的工具集成到这些代理中非常容易，我们已将 CLI 工具设计为 LLM 友好型。有关我们 LLM 评估的更多详细信息，请访问：<a href="https://x.com/ismaeel_bashir_/status/2059683849404383283" rel="nofollow">https://x.com/ismaeel_bashir_/status/2059683849404383283</a> 我们目前正在为客户提供付费试点项目。定价按集群确定。我们提供为期两周的测量窗口，在此期间我们进行安装、数据摄取，并向数据中心运营商报告可恢复的容量，然后在一个部门进行付费试点部署，收取固定的月费，除非范围扩大，否则将按相同费率续订。如果您运行 HPC/GPU 集群（SLURM 或 K8s，100+ GPU），我们非常乐意与您交流。我们将在您集群的一部分进行为期一周的安装，发送一份关于可恢复容量的书面报告，然后由您决定是否继续。如果您尝试过类似的方法但效果不佳，我们非常希望了解原因。如果您想预测的某种故障模式在本文中未提及，请在此帖中回复，我们将告知您模型是否已涵盖该情况，或需要进行哪些更改才能添加。我从未想过自己会成为 HN 发布的一方！即使您不运行集群，我们也仍然希望听到您的声音。对于我们的方法、您在集群上运行工作负载的经验，或者您认为我们有任何错误的地方，我们都非常乐意听取您的意见。祝好！

查看原文

Hey HN, we’re Ismaeel, Eren, Yafet and Nikodem. We built Expanse (<a href="https://expanse.sh/">https://expanse.sh/</a>) to increase the effective capacity of your HPC/GPU clusters running schedulers/orchestrators like Kubernetes and SLURM. We read the source code, job submission script, and the hardware a workload is about to run on to predict what the job actually needs before the cluster sees it. We also flag failures we think are about to happen and surface line-level optimisations the researcher can apply themselves.The problem: Datacenters run at roughly 30% to 40% effective utilisation. Users request more resources than what they actually need, because of asymmetric risk: while over-requesting is bad because it’s expensive and wastes capacity that someone else could have used, under-requesting kills your job mid-run and you lose days of work. So everyone over-requests by two to three times.We measured one national-scale HPC cluster for a month and from 122k jobs, 59% of the compute was wasted. At on-demand cloud rates for the same hardware, that’s roughly $8.5M of compute wasted in one month on one cluster. The pattern is similar in large scale compute industries as well, such as quant funds, AI labs, and manufacturing.The four of us ran HPC and GPU training workloads at the largest quant funds and HPC facilities. Ismaeel did research at EPCC (Edinburgh’s Parallel Computing Centre, the UK’s national HPC site) under Adrian Jackson, where he built the first multimodal HPC resource predictor: a model that ingests job source code, submission scripts, hardware telemetry and cluster metadata in order to figure out how much compute will actually be needed. On a dataset of real workloads on EPCC’s own clusters it scored 34% better than any other baseline, and outperformed frontier general-purpose LLMs prompted on the same prediction task by roughly 8x. These results convinced us the problem was solvable with software.Expanse installs on every node and hooks into SLURM (or the K8s scheduler). It ingests live hardware telemetry (DCGM, CUPTI, Cgroups, Network/IO monitoring) of your cluster creating a custom embedding of how your hardware performs. We scan any workloads about to be submitted through SLURM/K8s (plugging into the life cycles of the job so you don't have to change how you submit things) and we feed this into our deep learning models to give researchers accurate resource recommendations, failure detections, and optimisation suggestions at submission time. We fine tune cluster-specific models that get sharper over time as you run more workloads. Our models are trained to over-provision rather than under-provision due to the asymmetric outcomes of a job crashing. We also provide uncertainty estimates and p90 values to allow users to choose their risk tolerance.We surface three capabilities to users of the cluster:(1) Resource prediction at submit time. We predict the GPU VRAM, Utilisation, memory, CPUs and walltime the job actually needs, with a confidence interval. From these predictions we also surface failure predictions for OOMs and other memory related issues, and code line level optimisations to increase the utilisation of the job on the hardware.(2) Live Observability. While the job runs we showcase the telemetry we are collecting through a dashboard that gives an intuitive view of what's going on in the hardware and where your workload is at in terms of code stack profiling. We dynamically profile workloads to achieve a low single digit overhead while being informative.(3) Failure diagnosis. If a workload fails, we take all the data we collected and perform correlations on the stack profiling and the hardware telemetry we collect to surface solution oriented logs. These are one, two line logs telling you not only what happened when the job failed, but why and how to fix it with code line level suggestions.What’s different about our approach: The state of the art for most clusters is to either have per-user historical averages from sacct (SLURM accounting DB); hand-written rules/heuristics; or frontier LLM coding agents. For per-user historical averages from sacct, once a new type of workload is submitted onto the cluster or code level changes are made the model becomes wildly inaccurate. For the LLM baseline we provided them with the submission script and source code of the workload being ran, and gave it the full capabilities of its coding harness in the cluster and it performed quite poorly. We benchmarked Expanse against the state of the art at the time (Gemini 3.5 pro, Claude Opus 4.8, GPT 5.5, Codex 5.3) and outperformed them by 8x.You might be thinking, as these models scale and get better, they could beat us on this task; however we saw no correlation in model size or iteration on accuracy improvement. Claude Haiku actually performed better than Opus on a lot of workloads and previous iterations of models had the same, if not slightly better, accuracy. Even coding specific models, such as Codex 5.3 performed poorly (matching accuracy with GPT5.5). These models reason in a vacuum, without native support for modal inputs such as source code (to understand the underlying data flow and computational patterns), and hardware telemetry and topology (to understand performance patterns of the cluster) they cannot accurately predict the resources a workload needs. Additionally, Expanse continuously updates its internal models to make sure our predictions get more accurate as more workloads run on your cluster, making it well suited for changes in new hardware or workload patterns. LLMs are very good at writing code and hyper parameter sweeps, but they need Expanse to complete the full agentic loop for auto research. It's super easy to plug our tools into these agents, we have made our CLI tools LLM friendly. For more details on our LLM eval, check out: <a href="https://x.com/ismaeel_bashir_/status/2059683849404383283" rel="nofollow">https://x.com/ismaeel_bashir_/status/2059683849404383283</a>We’re currently onboarding customers as paid pilots. Pricing is determined per-cluster. We offer a two-week measurement window where we install, ingest, and report recoverable capacity to datacenter operators, followed by a paid pilot deployment in one department at a fixed monthly fee, renewing at the same rate unless the scope expands.If you run a HPC/GPU cluster (SLURM or K8s, 100+ GPUs), we'd love to have a talk. We’ll install on a section of your cluster for a week, send a written report of what’s recoverable, and you decide whether to keep going. If you’ve tried something like this and it didn’t work, we’d really like to hear why. And if there’s a failure mode you’d want predicted that the post doesn’t mention, drop it in this thread and we’ll write back with whether the model already catches it or what it would take to add. I never thought I’d be on the other side of launch HN :). Even if you don’t run a cluster, we’d still love to hear from you. Any thoughts on our approach, your experiences running workloads on clusters, or even anywhere you think we’re wrong - we'd love to hear it.Tally Ho!

Launch HN：Expanse (YC P26) – 释放被浪费的 GPU 算力