Ask HN: 你如何发现那些“成功”运行但结果错误的 cron 作业?
1 分•作者: BlackPearl02•5 天前
我一直以来都在处理一个令人沮丧的问题:我的 cron 作业返回退出码 0,但结果却是错误的。
例如:
* 备份脚本成功完成,但创建了空的备份文件
* 数据处理作业完成,但只处理了 10% 的记录
* 报表生成器运行没有错误,但输出了不完整的数据
* 数据库同步完成,但计数不匹配
日志显示“成功”——退出码 0,没有异常——但实际结果是错误的。错误可能埋藏在日志中,但我不会每天主动检查日志。
我尝试过:
* 在脚本中添加验证检查(例如,如果计数 < 100: 退出 1)——有效,但你必须修改每个脚本,并且更改阈值需要更改代码
* Webhook 警报——需要为每个脚本编写连接器
* 错误监控工具(Sentry 等)——它们捕获异常,而不是错误的结果
* 手动抽查——不可扩展
脚本内验证的方法适用于简单情况,但它不够灵活。如果你需要更改阈值怎么办?如果文件存在但来自昨天怎么办?如果你需要检查多个条件怎么办?最终你会将监控逻辑与业务逻辑混在一起。
我构建了一个简单的监控工具,它监视作业结果,而不仅仅是执行状态。你向它发送实际结果(文件大小、记录计数、状态等),如果出现问题,它会发出警报。无需翻阅日志,你可以调整阈值,而无需部署代码。
你是如何处理这个问题的?你是在每个脚本中添加验证,主动检查日志,还是使用在结果与预期不符时发出警报的工具?你处理这些“静默失败”的方法是什么?
查看原文
I've been dealing with a frustrating problem: my cron jobs return exit code 0, but the results are wrong.<p>Examples:
Backup script completes successfully but creates empty backup files
Data processing job finishes but only processes 10% of records
Report generator runs without errors but outputs incomplete data
Database sync completes but the counts don't match
The logs show "success" — exit code 0, no exceptions — but the actual results are wrong. The errors might be buried in logs, but I'm not checking logs proactively every day.<p>I've tried:
Adding validation checks in scripts (e.g., if count < 100: exit 1) — works, but you have to modify every script, and changing thresholds requires code changes
Webhook alerts — requires writing connectors for every script
Error monitoring tools (Sentry, etc.) — they catch exceptions, not wrong results
Manual spot checks — not scalable<p>The validation-in-script approach works for simple cases, but it's not flexible. What if you need to change the threshold? What if the file exists but is from yesterday? What if you need to check multiple conditions? You end up mixing monitoring logic with business logic.<p>I built a simple monitoring tool that watches job results instead of just execution status. You send it the actual results (file size, record count, status, etc.) and it alerts if something's off. No need to dig through logs, and you can adjust thresholds without deploying code.<p>How do you handle this? Are you adding validation to every script, proactively checking logs, or using something that alerts when results don't match expectations? What's your approach to catching these "silent failures"?