SQL访问加密市场数据,不只是JSON

1作者: knazim2 天前
嗨,HN, 我是 Nazim,Koinju.io 的创始人。我想在这里分享我们最近推出的一个探索性选项:通过 SQL 访问我们的数据库,该数据库包含所有加密货币市场数据。REST 提供了直接检索的访问方式,但我们越来越认为,通过 SQL 访问统一的加密货币市场数据层进行分析工作可能会有所作为,这要归功于 LLM。 这部分是由 OpenBB 的首席执行官 Didier Lopes 最近发表的关于金融公司拥有金融工作发生的基础设施的论文(https://www.linkedin.com/pulse/how-did-we-end-up-here-didier-rodrigues-lopes-hgeqe/)触发的,特别是工作流程执行和 AI 推理发生的运行时。 大多数数据 API 都是为已经知道自己想要什么软件而设计的。调用一个端点,获取 JSON,解析它,然后在其他地方进行计算。这种模式运行良好,并且仍然运行良好。但我不确定它是否很好地映射到 LLM 驱动的工作流程,尤其是在大数据方面。 语言模型可以调用 API / 读取 JSON 或编写 Python 来实现这一点(Claude 代码可以强制 JSON 输出)。但这并不意味着该模型在通过标记化行摄取、重塑、连接、聚合、验证或推理大型结构化数据集方面是高效的。在小规模时,它适合上下文限制。在大规模时,它变得复杂,并且小细节可能会悄无声息地消失,就像它们是异常值一样…… 因此,我们正在测试的假设是: 对于大型数据集,面向 AI 的原语应该从“返回 JSON”切换到“对数据集执行一个有界、可检查的操作”,你可以规划、重放甚至精确地跟踪。在这种情况下,LLM 承担规划者/控制器的角色。它应该能够检查模式、理解约束、表达一个操作、检查限制甚至 AST,通过执行层运行计算,然后对紧凑的类型化结果进行推理。 因此,SQL 是我们目前在该层进行的尝试。 这真的不是什么新鲜事 :-) 甚至不是神奇的“AI 原生”。但它是显式的、可检查的、可组合的,并且可以在数据附近执行。REST 对于简单的检索仍然有意义。但对于大型市场数据集的分析问题,JSON 分页感觉就像是错误的工作单元。 这里还有一个治理问题:在金融领域,许多公司不希望他们的整个工作流程转移到供应商的黑盒界面中。这似乎是正确的。内部上下文、权限、模型策略、审计日志和决策工作流程可能应该存在于公司环境中,当然。但这并不一定意味着在提出任何问题之前,应该将每个外部数据集复制到本地。 也许更好的界限是: - 公司拥有工作流程和推理运行时 - 数据提供商公开一个受控的执行表面 - LLM 发出有界操作 - 查询引擎执行实际计算 - 结果返回 我对从事此类工作的人(市场数据、量化研究、分析等)的任何反馈都感兴趣。我试图回答的问题是: - 今天,LLM 处理大数据的正确接口是什么? - 模型应该对原始数据、JSON、模式、SQL、类型化工具、语义层或其他东西进行操作吗? - 客户拥有的运行时和提供商端数据执行之间的界限应该在哪里? 当调用者可能是代理时,查询限制、成本预览、试运行、权限和审计日志应该如何工作? 我不仅仅寻求验证。如果答案是“不要发明一个新的 AI 类别;只需提供干净的数据、稳定的模式、SQL、文档和可预测的限制”,那也会很有用。
查看原文
Hi HN,<p>I’m Nazim, founders of Koinju.io and I wanted to share here an exploratory option we opened very recently: providing access to our database, which contains all cryptocurrency market data, via SQL. REST give access for direct retrieval but we&#x27;re thinking more and more that SQL access for analytical work over a unified crypto market data layer could be of something because of llms.<p>This was partly triggered by Didier Lopes, ceo of OpenBB recent essay on financial firms owning the infrastructure where financial work happens (https:&#x2F;&#x2F;www.linkedin.com&#x2F;pulse&#x2F;how-did-we-end-up-here-didier-rodrigues-lopes-hgeqe&#x2F; ), especially the runtime where workflows execute and AI inference happens.<p>Most data APIs were designed for software that already knows what it wants. Call an endpoint, get JSON, parse it, compute somewhere else. That model worked great and still works great. But I’m not sure it maps well to llm-driven workflows, especially with big data.<p>A language model can call APIs &#x2F;read JSON or write python to do so (claude code can force json output). But that does not mean the model is efficient in ingesting, reshaping, joining, aggregating, validating, or reasoning over large structured datasets through tokenized rows. At small scale, it fit within context limit. At large scale, it becomes complexe and small details may disappear silently, as if they were outliers...<p>So the thesis we are testing is: For big datasets, the AI-facing primitive should be switched from “return json” to execute a bounded, inspectable operation over the dataset”, something that you could plan, replay and even trace precisely. In that case, the llm endorse the role of a planner&#x2F;controller. It should be able to inspect schemas, understand constraints, express an operation, check limits or even ASTs, run the computation through an execution layer, and then reason over a compact typed result.<p>So SQL is our current attempt at that layer.<p>This is really not new :-) not even magically “AI-native”. But it is explicit, inspectable, composable, and executable close to the data. REST still makes sense for simple retrieval. But for analytical questions over large market datasets, JSON pagination feels like the wrong unit of work.<p>And there is also a governance question here: In financial sector, many firms do not want their entire workflow to move into a vendor’s black-box interface. That seems right. Internal context, permissions, model policy, audit logs, and decision workflows should probably live in the firm env, of course. But that does not necessarily mean every external dataset should be copied locally before any question can be asked.<p>Maybe the better boundary is: -the firm owns the workflow and inference runtime -the data provider exposes a controlled execution surface, -the llm issues bounded operations, -the query engine performs the actual computation -result comes back<p>I’m interested in any feedback from people working on stuff like that, market data, quant research, analytics... The questions I’m trying to answer: -What is the right interface today for an llm working with bigdata? -Should the model operate on raw, JSON, schemas, SQL, typed tools, semantic layers, or something else? -Where should the boundary be between customer-owned runtime and provider-side data execution?<p>How should query limits, cost previews, dry runs, permissions, and audit logs work when the caller might be an agent?<p>I’m not looking only for validation. If the answer is “don’t invent a new AI category; just provide clean data, stable schemas, SQL, docs, and predictable limits”, that would also be useful.