必要的工具?用于分布式系统的异步 LoRA

2作者: jfileto9 个月前
我一直在构建一个我称之为“异步LoRA”的东西,以解决我一直遇到的一个难题:在廉价GPU(Salad、runpod、现货实例等)上进行训练对于长时间运行的任务来说简直是一场噩梦。一个随机节点挂掉,训练几个小时的心血就付诸东流。大多数调度器只是简单地重启整个容器,这并没有真正解决问题。到目前为止,我所做的工作包括: * 聚合器/工作器设置,其中聚合器分配小的工作“租赁”(按token大小而不是时间片) * 异步检查点,以便持续保存进度,而无需暂停训练。 * 抢占处理——当一个工作器挂掉时,它已经完成的工作仍然有效,剩余的工作将被重新分配。 * 训练感知逻辑(步数、token、损失),而不是将任务视为黑盒容器。 * 为PyTorch/DeepSpeed提供开箱即用的钩子,这样你就不必自己把所有东西粘合在一起了。我的目标是让不稳定的集群表现得更像可靠的集群。 我很乐意收到大家的反馈: * 如果你在现货/可抢占GPU上运行训练,你通常如何处理检查点/故障? * 什么能让它更容易融入现有的流水线(Airflow、K8s、Ray等)? * 对于监控,你更希望看到原生训练指标(损失、token、时效性),还是仅仅显示日志/事件,让你接入自己的堆栈?
查看原文
I’ve been building something I call Async LoRA to scratch an itch I kept running into: training on cheap GPUs (Salad, runpod, spot instances, etc.) is a nightmare for long jobs. One random node dying and suddenly hours of training are gone. Most schedulers just restart the whole container, which doesn’t really help. What I’ve put together so far:<p>• Aggregator&#x2F;worker setup where the aggregator hands out small “leases” of work (per token sizes not time slices)<p>• Async checkpointing so progress gets saved continuously without pausing training.<p>• Preemption handling — when a worker dies, whatever it managed to do still counts, and the remaining work just gets reassigned.<p>• Training-aware logic (steps, tokens, loss) instead of treating jobs like black-box containers.<p>• Out-of-the-box hooks for PyTorch&#x2F;DeepSpeed so you don’t have to glue it all together yourself. My goal is to make sketchy clusters behave more like reliable ones<p>I’d love feedback from people here:<p>• If you run training on spot&#x2F;preemptible GPUs, how do you usually handle checkpoints&#x2F;failures?<p>• What would make this easier to drop into an existing pipeline (Airflow, K8s, Ray, etc.)?<p>• For monitoring, would you rather see native training metrics (loss, tokens, staleness) or just surface logs&#x2F;events and let you plug into your own stack?