1bit.systems (no typos in this one).
-=bong-water-water-bong=- the full arc: how the 1bit.systems ternary stack got from broken to 76.7 tok/s.
(Posting in this sub, so I'll skip the hardware tour — you all know what you bought. This is about what to do with it.)
TL;DR. Three iterations, eight months, 22 research papers actually read and benched against, a kernel author named Claude Opus 4.7 (1M context) running in Claude Code , one box in a closet, and a hand-tuned ternary GEMV kernel currently running at 92% of memory-bandwidth peak. We're gonna need a bigger boat — or are we?
Where this started When this box arrived, the cloud-AI ecosystem had nothing native for it. MLX is Apple Silicon. CUDA is NVIDIA. ROCm support for the iGPU was nascent. llama.cpp's HIP backend scalar-faulted on Q1_0 at 1.78 t/s — 2700× slower than its own Vulkan backend on the exact same model. The whole stack was either "it works on a dGPU" or "it works on a phone." Our box sat between those categories with no native code path.
So we wrote one.
Iteration 1 — MLX panic (early 2026) First instinct: try MLX-on-ROCm. MLX has no ternary mode on the AMD path. Warmup blew up with a kernel-level panic. Game over, man. Game over. I posted a write-up to Reddit anyway because I thought "look, it almost works" was interesting. It wasn't. Took the post down. Refunded the attention.
Iteration 2 — 28 crates of Rust (late winter 2026) Rewrote the whole orchestrator in Rust + axum + tokio + every async crate in the index . It worked. Inconceivable! It was also a 28-crate workspace I couldn't finish, the kernels were calling into hipBLAS through FFI (which I'd promised myself I wasn't going to do), and the bench numbers I posted were prompt-cached, not steady-state. Pulled that post too. Two for two.
Iteration 3 — C++ end to end (spring 2026, what's live now) Strip everything. Three rules:
Rule A — no Python at runtime. Python is fine for dev-box scripts (requantizers, analysis notebooks). Never inside a systemd unit, never on an HTTP serving path.
Rule B — C++20 default for everything that runs on the box. HIP kernels stay in rocm-cpp/ . The orchestrator went C++ too.
Rule C — hipBLAS is banned in the runtime path. Native Tensile kernels only. If you reach for hipBLAS, port the kernel.
Built around the AMD lemonade-sdk stack — kept their recipe schema, their HTTP surface ( /v1/* , /api/v1/* , OpenAI / Ollama / Anthropic compat), their config layout. The one Critical Invariant we deliberately broke: their "backends run as subprocesses" rule. We added an in-process ternary backend that calls our HIP Engine directly from lemond for the perf path. Everything else stays subprocess.
This is what's running today. I know kung fu.
The papers we read, what they claimed, and what we did to prove it on this box Not "I read the abstract." For each one: the paper's own words , then the concrete path we took to put it on disk and bench it. Mess with the best, die like the rest.
1. BitNet 1.58 — arXiv 2402.17764 (Microsoft Research) "BitNet…
为什么值得关注
能改变理解方式,而不只是重复常识;符合当前抓取需求;它提供了新的理解或解释,而不只是表面观点
来源:reddit,领域:tech,保留分:0.64
讨论总结
讨论量较低,暂无明显增量信息。