我构建了一个包含 20 万条边的市场知识图谱,用于过滤虚假的抄底信号。
1 分•作者: gano•6 个月前
我一直在尝试使用基于图的方法来解决一个经典的交易问题:为什么大多数抄底策略无法区分市场对短期下跌的过度反应和真正的结构性崩溃。
大多数系统对 -5% 的价格变动一视同仁,而不管其背景如何。我的假设是,一家公司在市场结构中的位置比价格变动本身更重要。
工程理念
我构建了一个美国公开市场的知识图谱,其中包含大约 20.7 万条边,涵盖大约 21 种关系类型,组织成四个层级:
运营:供应链关系(SUPPLIES_TO, PRODUCES)
流动:ETF 和机构持股的管道
社交:董事会关联(SHARES_DIRECTOR_WITH)
环境:地理位置/竞争
对于每一层,我使用 PageRank 风格的方法计算中心性分数(使用逆度加权以避免 ETF 超级节点占据主导地位)。
然后,将这些结构性特征与基本的股价/成交量背景相结合,并输入到基于树的模型(XGBoost)中,以对大幅下跌后的股票进行排名。
让我惊讶的是
当我进行样本外验证排名时(2024-2025 年,使用 Alphalens 避免前瞻性偏差):
* 运营和流动边提供了大部分提升
* 社交边(董事会关联)带来的提升远低于我的预期
* 图特征与仅基于价格的基线相比,排名质量大约翻了一番
这对我来说一开始并不明显——我预计“社交”联系会更重要。
我发帖的原因
我正在将这个研究笔记本转化为生产仪表板,在锁定图谱模式之前,我希望获得在其他领域构建大型图谱的人的反馈。
特别是:
* 您是否在其他地方看到过董事会关联/社交边具有预测性?
* 您是否发现过在此规模下必不可少的图谱标准化技巧?
* 您在混合异构边类型时遇到过哪些陷阱?
很乐意回答有关图谱构建、中心性计算或验证设置的问题。
查看原文
I’ve been experimenting with a graph-based approach to a classic trading problem: why most dip-buying strategies can’t tell the difference between a temporary overreaction and a genuine structural collapse.<p>Most systems treat a −5% move the same regardless of context. My hypothesis was that where a company sits in the market’s structure matters more than the price move itself.<p>The engineering idea<p>I built a knowledge graph of the U.S. public markets with ~207k edges across ~21 relationship types, organized into four layers:<p>Operational: supply-chain relationships (SUPPLIES_TO, PRODUCES)<p>Flow: ETF and institutional ownership plumbing<p>Social: board interlocks (SHARES_DIRECTOR_WITH)<p>Environmental: geography / competition<p>For each layer, I compute centrality scores using PageRank-style methods (with inverse-degree weighting to avoid ETF super-nodes dominating).<p>These structural features are then combined with basic price/volume context and fed into a tree-based model (XGBoost) to rank stocks after sharp drawdowns<p>What surprised me<p>When I validated the rankings out-of-sample (2024–2025, using Alphalens to avoid look-ahead issues):
* Operational and Flow edges provided most of the lift
* Social edges (board interlocks) added much less than I expected
* Graph features roughly doubled ranking quality versus price-only baselines
This wasn’t obvious to me going in — I expected “social” connections to matter more.<p>Why I’m posting<p>I’m in the process of turning this from a research notebook into a production dashboard, and before I lock in the graph schema I’d love feedback from people who’ve built large graphs in other domains.
In particular:
* Have you seen board-interlock / social edges be predictive elsewhere?
* Are there graph normalization tricks you’ve found essential at this scale?
* Any pitfalls you’ve hit when mixing heterogeneous edge types?<p>Happy to answer questions about the graph construction, centrality calculations, or validation setup.