Show HN: 修复一个指针错误,解锁了 Windows 上百万行 JSON 解析
3 分•作者: hilti•7 个月前
我一直在构建一个跨平台的 JSONL 查看器应用程序,用于处理多 GB 的文件。它在 macOS(我的开发机器)上完美运行,但在 Windows 上总是崩溃,崩溃点恰好是 2,650 KB。以下是调试过程以及那个带来巨大差异的微小修复。<p>问题所在<p>- macOS:轻松处理 5GB+ 文件
- Windows:每次都在 2,650 KB 时崩溃
- 相同的代码库,使用 MinGW 从 Mac Silicon 交叉编译到 Windows<p>调查过程<p>添加了详细的日志记录来跟踪执行过程。崩溃发生在字符串驻留期间,在成功解析了大约 6,000 行之后。不是在解析期间,也不是在文件 I/O 期间,而是在合并阶段。<p>根本原因<p>我的 StringPool 类使用 std::unordered_map<std::string_view, uint32_t> 来去重字符串。string_view 指向一个 std::vector<std::string>。<p>当 vector 增长并重新分配时,所有 string_view 键都变成了悬空指针。哈希映射中充满了无效的引用。<p>为什么它在 macOS 上有效?不同的内存分配器行为,不同的默认堆栈大小(8MB vs 1MB),不同的重新分配模式。<p>修复方法<p>之前(出错):<p><pre><code> uint32_t intern(std::string_view str) {
auto it = indices_.find(str);
if (it != indices_.end()) return it->second;
uint32_t idx = strings_.size();
strings_.push_back(std::string(str));
indices_[std::string_view(strings_.back())] = idx; // 危险!
return idx;
}
</code></pre>
之后(已修复):<p><pre><code> uint32_t intern(const std::string& str) {
auto it = indices_.find(std::string_view(str));
if (it != indices_.end()) return it->second;
// 如果我们即将重新分配,则预先重建
if (strings_.size() >= strings_.capacity()) {
strings_.reserve(strings_.capacity() * 2);
rebuildIndices(); // 修复所有 string_view!
}
uint32_t idx = strings_.size();
strings_.push_back(str);
indices_[std::string_view(strings_.back())] = idx;
return idx;
}
void rebuildIndices() {
indices_.clear();
for (size_t i = 0; i < strings_.size(); i++) {
indices_[std::string_view(strings_[i])] = i;
}
}
</code></pre>
结果<p>- 100 万行:Windows 上 6 秒
- 多 GB 文件:无崩溃
- 约 166,000 行/秒的吞吐量
- 跨平台稳定性<p>经验教训<p>1. std::string_view 强大但危险 - 它是一个非拥有的引用。当底层存储移动时,你持有的就是垃圾。<p>2. 跨平台测试至关重要 - 由于不同的分配器行为和更大的默认堆栈大小,该错误在 macOS 上是不可见的。<p>3. 对于交叉编译,结构化日志记录胜过调试器 - 我正在从 Mac 交叉编译到 Windows。将带时间戳的日志记录添加到文件中,可以立即明确崩溃点。<p>4. 小改动,巨大影响 - 一个函数,大约 15 行代码,将“在 2MB 时崩溃”变成了“处理 5GB+ 文件”<p>5. 性能保持出色 - 重建只发生在 vector 重新分配期间(指数增长),因此分摊成本可以忽略不计。<p>技术栈<p>- simdjson (v4.2.2) 用于解析
- 多线程解析(我的测试机器上有 20 个线程)
- 列式存储,用于内存效率
- C++17,使用 MinGW-w64 交叉编译<p>这是一个令人警醒的提醒,最关键的错误往往是最简单的,隐藏在平台差异的背后。<p>很乐意讨论实现细节、simdjson 的使用或跨平台 C++ 调试技术!
查看原文
I've been building a cross-platform JSONL viewer app that handles multi-GB files. It worked perfectly on macOS (my development machine), but consistently crashed on Windows at exactly 2,650 KB. Here's the debugging journey and the tiny fix that made all the difference.<p>The Problem<p>- macOS: Handles 5GB+ files effortlessly
- Windows: Crashes at 2,650 KB every time
- Same codebase, cross-compiled from Mac Silicon to Windows using MinGW<p>The Investigation<p>Added detailed logging to track execution. The crash happened during string interning after successfully parsing ~6,000 rows. Not during parsing, not during file I/O, but during the merge phase.<p>The Root Cause<p>My StringPool class used std::unordered_map<std::string_view, uint32_t> to deduplicate strings. The string_views pointed into a std::vector<std::string>.<p>When the vector grew and reallocated, all the string_view keys became dangling pointers. The hash map was full of invalid references.<p>Why did it work on macOS? Different memory allocator behavior, different default stack sizes (8MB vs 1MB), different reallocation patterns.<p>The Fix<p>Before (broken):<p><pre><code> uint32_t intern(std::string_view str) {
auto it = indices_.find(str);
if (it != indices_.end()) return it->second;
uint32_t idx = strings_.size();
strings_.push_back(std::string(str));
indices_[std::string_view(strings_.back())] = idx; // DANGER!
return idx;
}
</code></pre>
After (fixed):<p><pre><code> uint32_t intern(const std::string& str) {
auto it = indices_.find(std::string_view(str));
if (it != indices_.end()) return it->second;
// Preemptively rebuild if we're about to reallocate
if (strings_.size() >= strings_.capacity()) {
strings_.reserve(strings_.capacity() * 2);
rebuildIndices(); // Fix all string_views!
}
uint32_t idx = strings_.size();
strings_.push_back(str);
indices_[std::string_view(strings_.back())] = idx;
return idx;
}
void rebuildIndices() {
indices_.clear();
for (size_t i = 0; i < strings_.size(); i++) {
indices_[std::string_view(strings_[i])] = i;
}
}
</code></pre>
The Result<p>- 1 million rows: 6 seconds on Windows
- Multi-GB files: No crashes
- ~166,000 rows/second throughput
- Cross-platform stability<p>Lessons Learned<p>1. std::string_view is powerful but dangerous - It's a non-owning reference. When the underlying storage moves, you're holding garbage.<p>2. Cross-platform testing is essential - The bug was invisible on macOS due to different allocator behavior and larger default stack sizes.<p>3. Structured logging beats debuggers for cross-compilation - I was cross-compiling from Mac to Windows. Adding timestamped logging to a file made the crash point obvious immediately.<p>4. Small changes, huge impact - One function, ~15 lines of code, turned "crashes at 2MB" into "handles 5GB+ files"<p>5. Performance stayed excellent - The rebuild only happens during vector reallocation (exponential growth), so amortized cost is negligible.<p>The Tech Stack<p>- simdjson (v4.2.2) for parsing
- Multi-threaded parsing (20 threads on my test machine)
- Columnar storage for memory efficiency
- C++17, cross-compiled with MinGW-w64<p>This was a humbling reminder that the most critical bugs are often the simplest ones, hiding in plain sight behind platform differences.<p>Happy to discuss the implementation details, simdjson usage, or cross-platform C++ debugging techniques!