HackerNews中文版

我一直在尝试通过故意破坏DDR4时序规则，在DRAM中运行BitNet b1.58。我还制作了一个可视化解释：<a href="https://pcdeni.github.io/CaSA/explainer/" rel="nofollow">https://pcdeni.github.io/CaSA/explainer/</a> 这已经在商用现成内存中通过FPGA中的自定义内存控制器进行了测试，并且可以正常工作。其底层效应在学术论文中得到了充分的描述（例如，CMU Safari、Simra、DRAM Bender等）。在使其工作的过程中，我还对DDR行为做出了此前未被记录的发现：<a href="https://pcdeni.github.io/CaSA/explainer/xor-spread.html" rel="nofollow">https://pcdeni.github.io/CaSA/explainer/xor-spread.html</a> 总的来说，它有点慢，因为即使实际上只需要计算“1”位的数量（popcount），也需要移动数据（以整行方式）。为了使其具有竞争力，需要对内存芯片进行更改，但不必像将计算和内存合并到一块硅片中那么激进。这将避免行业目前面临的“内存墙”问题。

查看原文

I have been working on running BitNet b1.58 inside DRAM by intentionally breaking DDR4 timing rules. Also made a visual explainer: <a href="https://pcdeni.github.io/CaSA/explainer/" rel="nofollow">https://pcdeni.github.io/CaSA/explainer/</a> This is tested and works inside commercial off the shelf memory with custom memory controller in the FPGA. The underlying effect is well characterized in academic papers (cmu safari, simra, dram bender, etc). In the process of getting this to work I also made previously undocumented discovery about DDR behaviour: <a href="https://pcdeni.github.io/CaSA/explainer/xor-spread.html" rel="nofollow">https://pcdeni.github.io/CaSA/explainer/xor-spread.html</a> Overall it is a bit slow, since data (in full rows) needs to be moved even when what is actually needed is only the count of the '1' bits (popcount). To make it competitive memory die changes would be needed, but not as drastic as merging compute and memory into one silicon. This would then avoid the memory wall issue the industry is currently facing.

Show HN: 通过打破 DDR4 时序规则在 DRAM 中运行 BitNet b1.58