教 Rust 学习 SQL 语言
1 分•作者: rustic-indian•7 个月前
小时候,我常常想知道计算机是如何“理解”一门新编程语言的。编译器对我来说还是一个模糊的概念,而解释型语言则像是魔法:你输入一些东西,机器就能知道该怎么做。<p>SQL 并不是直接编译成机器码的,但构建一个 SQL 引擎迫使你面对一个类似的问题:你如何让一种语言(比如我用的 Rust)理解并执行另一种语言(SQL),而 Rust 自己的编译器对 SQL 一无所知?<p>数据库不是从一堆现有代码中生长出来的。它源于一组概念:行、列、表达式、连接、聚合、索引、事务。<p>我的引擎的起点是一个通用的 SQL 方言,它深受 SQLite 和 DuckDB 的影响,因为它们都提供了一个大型的测试集,采用名为 SQLLogicTest 的通用格式。我的理论是,一个能够通过这些测试的引擎将具有坚实的兼容性基础。我的目标不是发明一种新的 SQL 风格,而是支持足够多的现有系统,以便我可以重用它们的测试套件和查询生态系统。<p>我没有编写自己的解析器。我使用 `sqlparser`,它将 SQL 转换为抽象语法树(AST),这是一个查询的内存中图状表示。然后,我将 AST 映射到我自己的 Rust 枚举和数据结构。从那里开始,SQL 就不再是文本,而变成了一组引擎可以推理的概念。<p>我没有从优化器开始,而是从一个测试框架开始。我将其连接起来以运行 SQLite 发布的 SQLLogicTest 测试集,并添加了一些对 DuckDB 风格测试的支持。如今,我运行的所有 SQLite 提供的测试都通过了。这并不意味着引擎是万无一失的;这仍然是早期的绿地代码。但这确实意味着引擎的行为是锚定在我的头脑之外的:当测试失败时,它就是我的实现与已知结果的对抗。该套件已成为事实上的规范;对规划器或执行器的任何更改都必须保持其通过。<p>一开始,我以为我会为我想支持的每个数据类型编写自己的计算内核。当你在一个领域缺乏经验时,你的头脑会欺骗你,让你相信重新发明是“正确”的唯一途径。最终,我将引擎转移到 Apache Arrow 的内存模型上。Arrow 为我提供了一个列式表示和一组用于常见操作的内核。我不必手动编写和维护所有这些,我可以专注于规划查询并将 SQL 语义映射到 Arrow 数组上。<p>这个项目是单人完成的。为了弥补这一点,我依赖于 LLM。在任何一天,我都会在 Claude Sonnet 4.5 和 GPT-5.x Codex 之间切换,以帮助勾勒新功能、推理生命周期问题或探索替代设计。它们不够可靠,不能盲目信任,但足够快,可以充当嘈杂的、上下文感知的结对编程者。<p>这创造了一种平衡:我有一个相对较大的测试集,必须保持通过,我希望新功能是高效的,而且我不想让微妙的正确性回归隐藏在“聪明”的 LLM 生成的代码背后。测试形成了一个护栏。如果建议的优化破坏了行为,它就会被抛弃。如果它通过了但看起来很脆弱,我就会重构它。<p>我从未写过编译器。这个项目是我离编译器最近的一次。它介于查询引擎、解释器和一堂课之间,告诉你站在其他系统的肩膀上你能走多远:SQLite 的测试、DuckDB 的查询风格、Apache Arrow 的内存模型,以及有时会以错误的形式提出正确想法的 LLM。<p>如果说有什么的话,它已经回答了我童年的问题。计算机不会“仅仅理解”新语言。必须有人来搭建这座桥梁。<p>(如果你喜欢这个故事,搜索“rust-llkv”并给它点个星。)
查看原文
When I was younger, I used to wonder how a computer could "just understand" a new programming language. Compilers were a vague idea at best, and interpreted languages felt like magic: you typed something in and the machine somehow knew what to do.<p>SQL is not compiled straight to machine code, but building a SQL engine forces you to face a similar question: how do you get one language (Rust, in my case) to understand and execute another language (SQL) that its own compiler knows nothing about?<p>A database does not grow out of a pile of existing code. It grows out of a set of concepts: rows, columns, expressions, joins, aggregates, indexes, transactions.<p>The starting point for my engine is a generic SQL dialect heavily influenced by SQLite and DuckDB, because they both provide a large corpus of tests in a common format called SQLLogicTest. My theory was that an engine that could pass those tests would have a solid baseline of compatibility. My goal is not to invent a new flavor of SQL, but to
support enough of what those systems already run that I can reuse their test suites and their ecosystem of queries.<p>I did not write my own parser. I use `sqlparser`, which turns SQL into an abstract syntax tree (AST), a graph-shaped in-memory representation of the query. From there I map that AST to my own set of Rust enums and data structures. That is the point where SQL stops being text and starts being a set of concepts the engine can reason about.<p>Instead of starting with an optimizer, I started with a test harness. I wired it up to run the SQLLogicTest corpus that SQLite publishes and added some support for DuckDB-style tests. Today all of the SQLite-provided tests I run are passing. That does not mean the engine is bulletproof; this is still early greenfield code. It does mean that the behavior of the engine is anchored to something outside my own head: when a test fails, it is my implementation vs. a known result. That suite has become the de facto specification; any change to the planner
or executor has to keep it passing.<p>At the beginning I thought I would write my own compute kernels for every data type I wanted to support. When you are inexperienced in a domain, your mind plays tricks on you and convinces you that reinventing is the only way to get it "right." I eventually moved the engine onto Apache
Arrow's memory model. Arrow gives me a columnar representation and a set of kernels for common operations. Instead of hand writing and maintaining all of that, I can focus on planning queries and mapping SQL semantics onto Arrow arrays.<p>This project is a solo effort. To compensate, I lean on LLMs. On any given day I bounce between Claude Sonnet 4.5 and GPT-5.x Codex to help sketch out new features, reason about lifetime issues, or explore alternative designs. They are not reliable enough to trust blindly, but
fast enough to act as noisy, context-aware pair programmers.<p>This creates a balancing act: I have a relatively large test corpus that must keep passing, I want new features to be efficient, and I do not want subtle correctness regressions hiding behind "clever" LLM-generated
code. The tests form a guardrail. If a suggested optimization breaks behavior, it gets thrown away. If it passes but looks fragile, I refactor it.<p>I have never written a compiler. This project is the closest I have come. It sits somewhere between a query engine, an interpreter, and a lesson in how far you can get by standing on the shoulders of other systems: SQLite's tests, DuckDB's style of queries, Apache Arrow's
memory model, and LLMs that sometimes suggest the right idea in the wrong shape.<p>If nothing else, it has answered my childhood question. Computers do not just understand" new languages. Someone has to build the bridge.<p>(If you like this story, search for "rust-llkv" and give it a star.)