June 3, 2026 ·The C+ team

Porting llama.cpp's ggml core from C to C+

We have been porting part of the C core of a llama.cpp fork, targeted at macOS, Apple Metal, Q4_K, and Qwen-3.5, from C to C+. It is worth being precise about what this is meant to prove, because the obvious reading is the wrong one.

The claim is not "C+ can replace ggml." It is narrower, and more useful:

C+ can enter an existing C/C++ system at object-file granularity, and the surrounding system does not need to know or care.

Our ray tracer proves standalone expressiveness: a whole program, written from scratch, that holds its own. This proves the other half, and for adoption it is the more important half: incremental embeddability. Most real codebases are never rewritten. They are infiltrated function by function, file by file. A C-adjacent language earns its place only if it can compile into ABI-compatible pieces that existing C and C++ code links against without noticing.

So the port has two goals, and the second is where most of the value landed:

Embed C+ into a live C/C++ build. Compile a .cplus translation unit to an object file, link it into the reference binary, and have the existing C and C++ consume it as if it were an ordinary compiled object. The build mixes .c, .cpp, and C+ outputs, and the binary still loads the model and generates correct tokens on Metal.
Harden the compiler. Every weakness the port exposes becomes an upstream compiler or stdlib fix. The port is the test bench, and the language grows under load.

Interop by invisibility

Targeting collapses a 102k-line tree into a bounded job. The compiled-.c set is exactly six ggml files, about 22k lines. Everything else (the llama layer, the Metal host glue, the GPU shaders, the CPU op implementations in C++) stays exactly as it is. None of that was ever in scope, and that is the point: the surrounding project does not migrate, does not get wrapped, and does not change architecture.

Completion is not the metric. Two of the six files are fully ported, four are partial, and all of the auxiliary C shim and glue is gone. Whether the count is two files or six, the property under test is the same, and these are the questions that actually matter:

Does C+ expose the same symbols the C did?
Do C and C++ link against them normally?
Does the ABI match, field for field and call for call?
Can the build swap one object file with no wider changes?
Do the existing tests still pass?
Does performance stay in the same envelope?
Do debugging and profiling still behave normally?

When the answer to all of these is yes, you have not proven that the rewrite is done. You have proven a migration model: replace one hot or ugly C island with C+, while the ocean around it stays C and C++. That is the adoption path a C-adjacent language actually needs, and it does not depend on finishing.

The method: a dogfood loop

For each function or cluster, the loop is the same:

Read the C as the specification.
Write the C+ equivalent.
Compile. If something cannot be expressed, stop, file the gap, patch the compiler, and resume. That is the entire point of the exercise.
Validate (the part that taught us the most).
Integrate via swap-and-link: compile the C+ to an object file, link it into the reference build, and disable the matching C definition. The C+ symbol wins, and the change is reversible.

The swap-and-link step is the key trick. It keeps the binary runnable at every single step, so a regression is caught the moment it is introduced rather than at the end.

What the port grew in the language

About 45 gaps were filed. The largest arc by far was SIMD.

The ggml quant kernels are built on integer-widening NEON operations that the original float-centric SIMD could not express. We grew it in tiers:

In the compiler: 64-bit lane types, low / high / combine, reinterpret, from_int / from_float, widen / narrow, a runtime table lookup, and round-to-nearest-even for quantizers. Plus a codegen crash fix (a non-literal swizzle index used to panic; now it is a clean diagnostic) and a new lint, W0001, that closes a silent-wrong-math footgun: i8x16.mul().sum() wraps at 8 bits and used to compile clean. Now it warns and points at the fix.
In a package: simd gained dot_i32, a widening int8 dot product composed entirely from the compiler primitives. No fused-dot builtin was needed.

The result is that real ggml kernels (the q8_0, q4_K, q5_K, and q6_K dot products, and the q8_0 and q8_K quantizers) are now pure C+ with no NEON intrinsic shims, and each one is bit-exact against an independent reference. The full surface is documented on the SIMD types page.

The other load-bearing addition was the f16 scalar type. ggml is saturated with half-precision floats and C+ had none. We added it with from_bits / to_bits and hardware-exact conversions to and from the wider floats. That is what let the last C shim disappear. See Primitives. Along the way, array-literal typing, indexed writes to a static array, and a handful of earlier gaps (module-level const and static, raw-pointer to integer casts, is_null(), #addr_of((*p).field), zero-fill primitives) were all closed.

A few requests were rejected by design, and the workaround is the answer: format! is a macro and C+ does not transform the AST, so use "${x}" interpolation or extern printf; virtual dispatch becomes a tagged enum plus match; a template<bool> becomes two named functions. These document the language's spine, not its gaps.

The hard-won lessons

These cost real time and are the most transferable takeaways.

Metal hides the CPU kernels. The default run offloads every layer to the GPU, so the CPU dot-product and quantize kernels are never executed. A broken kernel left the output byte-identical, which we proved by ablation. Coherent Metal output validates integration, not the kernel. CPU compute kernels have to be validated CPU-only (with -ngl 0) plus an ablation, or with a bit-exact unit test. This is the single most important validation rule we found.

A C static inline header function has no symbol. You cannot bind to it with extern fn, because each translation unit inlines a private copy. When the C file that used to emit one out-of-line gets ported, the symbol vanishes. The fix is not a language feature; it is to reimplement the function natively. That is how the last shim died.

The danger in this kind of port is not C's sharp edges, it is C's implicit freebies. The canonical bug: a C compound literal zero-fills every field you do not mention, and the explicit C+ port has to redo that with #zero::[T]() or write_zeroed(). We shipped a real bug here, a free() over pointers that were garbage but happened to work on zero-filled pages.

Verify table data, do not hand-type it. A 42-entry table of block and type sizes drives tensor allocation, and a wrong value silently corrupts loading. We extracted the exact values with a probe that evaluates the real sizeof against the headers. Never eyeball ABI data.

Validate against an independent reference. For each quant kernel, the cross-check used a different decode path than the port copied, so a shared bug cannot hide. Several were bit-exact, reldiff 0.0, including on overflow-prone inputs.

A C to C+ cheat-sheet

The patterns that came up on almost every port site:

C pattern	C+
`(T){.a=1}` compound literal	`#zero::[T]()`, then set fields (the rest do not auto-zero)
`static inline` header helper	reimplement natively; you cannot link to it
`if (p == NULL)`	`p.is_null()`
`&s->field`	`#addr_of((*s).field)`
`x += y` (signed)	`x +% y` (wrapping, to match C and avoid debug overflow traps)
widening int8 dot	`simd::dot_i32`; never `mul().sum()`, which wraps (W0001)
`(float)(__fp16)h`	`f16::from_bits(h) as f32`
`throw`	`Result[T, E]` (no exceptions)
virtual dispatch	tagged `enum` plus `match`

At every port site the rhythm is the same: read through (*ptr).field directly (the raw dereference is the marker, there is no unsafe wrapper), annotate types explicitly, and reach for the wrapping operators on integers.

Where this leaves things

The targeted claim, a llama.cpp fork for macOS and Metal with these C files reimplemented in C+ generating correct tokens, is reachable and de-risked. But "done" was never the interesting result. The interesting result is that the migration model holds: C+ object files drop into a live C/C++ build, export the symbols the C did, match the ABI, link without ceremony, and keep the binary correct and inside the same performance envelope at every step. The remaining files are bounded, concrete work with no further interop to design.

The deeper result is the second goal. The language is measurably sharper than when we started: a whole integer-SIMD subsystem, an f16 type, array-literal and static fixes, a footgun-closing lint, and a codegen crash fix. Each one was paid for by a real kernel that would not port without it. That is the point of building a language against a real workload instead of a wish list.

‹ Back to all posts