HackerNews中文版

大家好，我整理了一个代码库，演示了如何从头开始直接在单个 TSPLIB 实例（lin318）上训练 PPO，无需预训练或 GPU。代码库：https://github.com/jivaprime/TSP 1. 实验设置问题：TSPLIB lin318 (最优解：42,029) & rd400 硬件：Google Colab（仅 CPU）模型：单实例 PPO 策略 + 价值网络。从随机初始化开始。局部搜索：训练期间使用轻量级 2-opt，评估时使用 Numba 加速的 3-opt。核心概念：该策略并非设计为“稳定的平均误差最小化器”，而是作为一种高方差的探索者。目标不是保持平均差距较低，而是偶尔“激发出”局部搜索可以优化的极低误差巡回路线。 2. 结果：lin318 最佳结果：42,064（差距 ≈ +0.08%）时间：在 Colab CPU 上大约 20 分钟内达到。根据日志（包含在代码库中），低于 0.1% 差距的结果出现在 elapsed=0:19:49 左右。虽然平均误差在 3-4% 左右波动，但该策略成功地找到了一个 3-opt 可以利用的深盆地。 3. 扩展实验：Smart ILS & rd400 我使用“Smart ILS”（迭代局部搜索）后处理扩展了流程，以查看我们是否可以达到确切的最优解。 A. lin318 + ILS 将 PPO 生成的巡回路线（0.08% 差距）作为种子。运行 Smart ILS 大约 20 分钟。结果：达到了确切的最优解（42,029）。 B. rd400 + ILS PPO 阶段：在 CPU 上大约 2 小时。产生了大约 1.9% 差距的巡回路线。 ILS 阶段：使用 PPO 巡回路线作为种子。运行大约 40 分钟。结果：达到了 0.079% 的差距（成本 15,293 vs 最优解 15,281）。总结该工作流程有效地分离了关注点： PPO：将搜索引向高质量的盆地（1-2% 的差距）。 ILS：在该盆地内深入挖掘以找到最优解。如果您对实例级别的强化学习、基于 CPU 的优化或与 ML-TSP 基线（POMO、AM、NeuroLKH）进行比较感兴趣，请随时查看代码。欢迎建设性的反馈！

查看原文

Hi allI’ve put together a repo demonstrating how to train PPO directly on a single TSPLIB instance (lin318) from scratch—without pre-training or GPUs.Repo:https://github.com/jivaprime/TSP1. Experiment SetupProblem: TSPLIB lin318 (Opt: 42,029) & rd400Hardware: Google Colab (CPU only)Model: Single-instance PPO policy + Value network. Starts from random initialization.Local Search: Light 2-opt during training, Numba-accelerated 3-opt for evaluation.Core Concept: Instead of a "stable average-error minimizer," this policy is designed as a high-variance explorer. The goal isn't to keep the average gap low, but to occasionally "spike" very low-error tours that local search can polish.2. Results: lin318Best Shot: 42,064 (Gap ≈ +0.08%)Time: Reached within ~20 minutes on Colab CPU.According to the logs (included in the repo), the sub-0.1% shot appeared around elapsed=0:19:49. While the average error oscillates around 3–4%, the policy successfully locates a deep basin that 3-opt can exploit.3. Extended Experiment: Smart ILS & rd400I extended the pipeline with "Smart ILS" (Iterated Local Search) post-processing to see if we could hit the exact optimum.A. lin318 + ILSTook the PPO-generated tour (0.08% gap) as a seed.Ran Smart ILS for ~20 mins.Result: Reached the exact optimal (42,029).B. rd400 + ILSPPO Phase: ~2 hours on CPU. Produced tours with ~1.9% gap.ILS Phase: Used PPO tours as seeds. Ran for ~40 mins.Result: Reached 0.079% gap (Cost 15,293 vs Opt 15,281).SummaryThe workflow separates concerns effectively:PPO: Drives the search into a high-quality basin (1–2% gap).ILS: Digs deep within that basin to find the optimum.If you are interested in instance-wise RL, CPU-based optimization, or comparing against ML-TSP baselines (POMO, AM, NeuroLKH), feel free to check out the code.Constructive feedback is welcome!

纯 CPU PPO 算法在 20 分钟内求解 TSPLIB lin318 问题 (误差 0.08%)