C+
June 3, 2026 ·The C+ team

Porting llama.cpp's ggml core from C to C+

We have been porting the C core of a llama.cpp fork, targeted at macOS, Apple Metal, Q4_K, and Qwen-3.5, from C to C+. The project has two goals, weighted equally:

  1. Replace the C. Every compiled .c translation unit becomes .cplus, and the binary still loads the model and generates correct tokens on Metal.
  2. Harden the compiler. Every weakness the port exposes becomes an upstream compiler or stdlib fix. The port is the test bench, and the language grows under load.

This is a writeup of what that taught us, because the second goal is where most of the value landed.

What "replace the C" actually means here

Targeting collapses a 102k-line tree into a bounded job. The compiled-.c set is exactly six ggml files, about 22k lines. Everything else (the llama layer, the Metal host glue, the GPU shaders, the CPU op implementations in C++) stays as it is. None of that was ever C in scope.

An honest status: we are not C-free yet. Two of the six files are fully ported, four are partial, and the milestone that prompted this post is that all of the auxiliary C shim and glue is gone. The only C left is the six ggml target files themselves.

The method: a dogfood loop

For each function or cluster, the loop is the same:

  1. Read the C as the specification.
  2. Write the C+ equivalent.
  3. Compile. If something cannot be expressed, stop, file the gap, patch the compiler, and resume. That is the entire point of the exercise.
  4. Validate (the part that taught us the most).
  5. Integrate via swap-and-link: compile the C+ to an object file, link it into the reference build, and disable the matching C definition. The C+ symbol wins, and the change is reversible.

The swap-and-link step is the key trick. It keeps the binary runnable at every single step, so a regression is caught the moment it is introduced rather than at the end.

What the port grew in the language

About 45 gaps were filed. The largest arc by far was SIMD.

The ggml quant kernels are built on integer-widening NEON operations that the original float-centric SIMD could not express. We grew it in tiers:

The result is that real ggml kernels (the q8_0, q4_K, q5_K, and q6_K dot products, and the q8_0 and q8_K quantizers) are now pure C+ with no NEON intrinsic shims, and each one is bit-exact against an independent reference. The full surface is documented on the SIMD types page.

The other load-bearing addition was the f16 scalar type. ggml is saturated with half-precision floats and C+ had none. We added it with from_bits / to_bits and hardware-exact conversions to and from the wider floats. That is what let the last C shim disappear. See Primitives. Along the way, array-literal typing, indexed writes to a static mut array, and a handful of earlier gaps (module-level const and static, raw-pointer to integer casts, is_null(), #addr_of((*p).field), zero-fill primitives) were all closed.

A few requests were rejected by design, and the workaround is the answer: format! is a macro and C+ does not transform the AST, so use "${x}" interpolation or extern printf; virtual dispatch becomes a tagged enum plus match; a template<bool> becomes two named functions. These document the language's spine, not its gaps.

The hard-won lessons

These cost real time and are the most transferable takeaways.

Metal hides the CPU kernels. The default run offloads every layer to the GPU, so the CPU dot-product and quantize kernels are never executed. A broken kernel left the output byte-identical, which we proved by ablation. Coherent Metal output validates integration, not the kernel. CPU compute kernels have to be validated CPU-only (with -ngl 0) plus an ablation, or with a bit-exact unit test. This is the single most important validation rule we found.

A C static inline header function has no symbol. You cannot bind to it with extern fn, because each translation unit inlines a private copy. When the C file that used to emit one out-of-line gets ported, the symbol vanishes. The fix is not a language feature; it is to reimplement the function natively. That is how the last shim died.

The danger in this kind of port is not C's sharp edges, it is C's implicit freebies. The canonical bug: a C compound literal zero-fills every field you do not mention, and the explicit C+ port has to redo that with #zero::[T]() or write_zeroed(). We shipped a real bug here, a free() over pointers that were garbage but happened to work on zero-filled pages.

Verify table data, do not hand-type it. A 42-entry table of block and type sizes drives tensor allocation, and a wrong value silently corrupts loading. We extracted the exact values with a probe that evaluates the real sizeof against the headers. Never eyeball ABI data.

Validate against an independent reference. For each quant kernel, the cross-check used a different decode path than the port copied, so a shared bug cannot hide. Several were bit-exact, reldiff 0.0, including on overflow-prone inputs.

A C to C+ cheat-sheet

The patterns that came up on almost every port site:

C pattern C+
(T){.a=1} compound literal #zero::[T](), then set fields (the rest do not auto-zero)
static inline header helper reimplement natively; you cannot link to it
if (p == NULL) p.is_null()
&s->field #addr_of((*s).field)
x += y (signed) x +% y (wrapping, to match C and avoid debug overflow traps)
widening int8 dot simd::dot_i32; never mul().sum(), which wraps (W0001)
(float)(__fp16)h f16::from_bits(h) as f32
throw Result[T, E] (no exceptions)
virtual dispatch tagged enum plus match

At every port site the rhythm is the same: read through (*ptr).field inside unsafe, annotate types explicitly, and reach for the wrapping operators on integers.

Where this leaves things

The targeted claim, a llama.cpp fork for macOS and Metal with the C reimplemented in C+ generating correct tokens, is reachable and de-risked, not done. The methodology, the SIMD and compiler work, the ABI and linking, and the validation discipline are all proven end to end. The remaining work is bounded and concrete, and it is all pure C with no further interop to design.

The deeper result is the second goal. The language is measurably sharper than when we started: a whole integer-SIMD subsystem, an f16 type, array-literal and static fixes, a footgun-closing lint, and a codegen crash fix. Each one was paid for by a real kernel that would not port without it. That is the point of building a language against a real workload instead of a wish list.


‹ Back to all posts