HeadlinesBriefing favicon HeadlinesBriefing.com

Zero‑Copy Wasm GPU Inference on Apple Silicon

Hacker News •
×

On Apple Silicon, a WebAssembly module can share its linear memory directly with the GPU, eliminating copies and serialization. The CPU and GPU read the same physical bytes, letting a Wasm guest fill a matrix, have Metal compute in place, and retrieve results via the original pointer. This Zero‑copy path hinges on the platform’s Unified Memory Architecture.

Developer Driftwood stitches three links to make this work. First, mmap on ARM64 macOS returns 16 KB‑aligned pages, satisfying Metal’s alignment requirement. Second, Metal’s makeBuffer(bytesNoCopy:) wraps the same pointer as a GPU buffer, verified by identical MTLBuffer.contents() addresses and a negligible 0.03 MB RSS delta. Third, Wasmtime’s MemoryCreator supplies the mmap region as Wasm linear memory, preserving pointer identity throughout.

Using this pipeline, the author ran a 128×128 GEMM inside a Wasm actor and saw identical ~6.75 ms latency compared to a copy‑based path, with zero errors across 16,384 elements. Scaling to a 4‑bit‑quantized Llama 3.2 1B model, prefill took 106 ms and per‑token generation about 9 ms on an M1 MacBook Pro. The shared memory also enables fast KV‑cache serialization, cutting a 24‑token restore to 1.4 ms—a 5.45× speedup over recomputation.