KernelEvolve:面向异构 AI 加速器的智能内核编码(Meta)

1作者: gangliao6 个月前
我们分享了 KernelEvolve,这是一个我们在 Meta 构建的智能体系统,用于在异构 AI 加速器上自动生成和演进高性能内核。<p>其核心动机是,现代 AI 堆栈越来越依赖于手动优化的内核(GEMM、注意力机制、规约、融合运算),但为每个硬件目标(NVIDIA GPU、AMD GPU、MTIA 等定制加速器)编写和调整内核并不可扩展。<p>KernelEvolve 将内核编程视为一个搜索 + 演进问题:<p>• LLM 生成候选内核(例如,类似 Triton 的代码) • 内核在真实硬件上编译、基准测试和验证 • 性能反馈用于在多次迭代中演进更好的变体 • 系统跨大型集群和多种加速器类型扩展评估<p>与一次性代码生成不同,KernelEvolve 使用闭环、硬件在环的反馈持续改进内核,并且可以发现媲美或超越专家编写代码的非显而易见的优化。<p>在论文中,我们描述了:<p>• 智能体架构和搜索空间设计 • 我们如何跨异构加速器高效地扩展内核评估 • 展示超越手动调整基线的性能提升的案例研究 • 从在生产 ML 工作负载中部署该系统获得的实践经验<p>论文 (arXiv):https://arxiv.org/abs/2512.23236 (66 页)<p>LinkedIn:https://www.linkedin.com/posts/gangliao_excited-to-share-our-recent-work-on-kernelevolve-activity-7411781675740897280-AQth?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAAzsrfsBRed-BvPAGqq9FgvVZ-v6F-sG4SM<p>我们欢迎从事编译器、内核、ML 系统或代码生成智能体方法的人们提供反馈。
查看原文
We’re sharing KernelEvolve, an agentic system we built at Meta to automatically generate and evolve high-performance kernels across heterogeneous AI accelerators.<p>The core motivation is that modern AI stacks increasingly depend on hand-optimized kernels (GEMM, attention, reductions, fused ops), but writing and tuning them for each hardware target (NVIDIA GPUs, AMD GPUs, custom accelerators like MTIA) does not scale.<p>KernelEvolve treats kernel programming as a search + evolution problem:<p>• An LLM generates candidate kernels (e.g., Triton-like code) • Kernels are compiled, benchmarked, and validated on real hardware • Performance feedback is used to evolve better variants over many iterations • The system scales evaluation across large fleets and multiple accelerator types<p>Unlike one-shot code generation, KernelEvolve continuously improves kernels using closed-loop, hardware-in-the-loop feedback, and can discover non-obvious optimizations that rival or exceed expert-written code.<p>In the paper we describe:<p>• The agent architecture and search space design • How we scale kernel evaluation efficiently across heterogeneous accelerators • Case studies showing performance gains over hand-tuned baselines • Practical lessons from deploying this system in production ML workloads<p>Paper (arXiv): https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2512.23236 (66 pages)<p>LinkedIn: https:&#x2F;&#x2F;www.linkedin.com&#x2F;posts&#x2F;gangliao_excited-to-share-our-recent-work-on-kernelevolve-activity-7411781675740897280-AQth?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAAzsrfsBRed-BvPAGqq9FgvVZ-v6F-sG4SM<p>We’d love feedback from folks working on compilers, kernels, ML systems, or agentic approaches to code generation.