Show HN:开源合成银行对账单,用于测试解析器
2 分•作者: Maesh•5 天前
我开源了一个包含 5 份合成银行和信用卡账单 PDF 的数据集,旨在测试提取/解析的准确性。每个 PDF 都使用一家虚构的银行,并采用来自不同国家的真实格式。
我一直在构建一个银行账单转换器 (Bankstatemently),并不断发现不同银行的边缘情况。在某个时候,我开始将它们归类为“怪癖”,目前已记录了 36 个挑战,而且还在不断增加(例如:跨年边界的无年份日期、信用卡收费显示为正数而不是负数、日期隐藏在描述文本中等)。
真实的银行数据是私密的,因此没有共享的数据集可用于测试解析器。一旦我掌握了这些怪癖,我意识到我可以使用它们来重建故意包含这些挑战的账单,以便更多人可以使用它们。
还有一个免费的评估 API:提交您解析的 JSON,即可获得字段级别的准确性分数。真实数据保存在服务器端,但这不一定能完全防止过拟合。
欢迎提供关于缺少哪些边缘情况的反馈。我计划让接下来的 10 份账单更具挑战性(扫描的 PDF、跨多表的多种货币、佛历日期)。
[https://github.com/bankstatemently/bank-statement-parsing-benchmark](https://github.com/bankstatemently/bank-statement-parsing-benchmark)
您可以在此处浏览所有具有真实示例的怪癖:[https://bankstatemently.com/benchmark/challenges](https://bankstatemently.com/benchmark/challenges)
查看原文
I open-sourced a dataset of 5 synthetic bank and credit card statement PDFs designed for testing extraction/parsing accuracy. Each PDF uses a fictional bank with realistic formatting from a different country<p>I've been building a bank statement converter (Bankstatemently) and kept discovering edge cases across different banks. At some point, I started cataloging them as "quirks" and I'm currently at 36 documented challenges and counting (think: dates without years across year boundaries, credit card charges shown as positive instead of negative, dates hiding inside description text etc)<p>Real bank data is private, so there's no shared dataset to test parsers against. Once I had these quirks, I realized I can use them to reconstruct statements that deliberately include these challenges so more people can use them<p>There's also a free evaluation API: submit your parsed JSON and get field-level accuracy scores back. Ground truth is held server-side, but that's not necessarily bullet-proof against overfitting<p>Would appreciate feedback on which edge cases are missing. I'm planning to make the next 10 statements a bit harder (scanned PDFs, multi-currency across multi-table, Buddhist era dates)<p><a href="https://github.com/bankstatemently/bank-statement-parsing-benchmark" rel="nofollow">https://github.com/bankstatemently/bank-statement-parsing-be...</a><p>You can browse all of the quirks here with real-world examples: <a href="https://bankstatemently.com/benchmark/challenges" rel="nofollow">https://bankstatemently.com/benchmark/challenges</a>