HeadlinesBriefing favicon HeadlinesBriefing.com

Manticore's 14× Faster ONNX Embeddings Redefine Search Speed

Hacker News •
×

Manticore Search 27.1.5 delivers a 14× speed boost for embeddings by rewriting its ONNX path. Previously, Auto Embeddings used Sentence Transformers via Candle, capping performance at 5–11 docs/sec. The new ONNX Runtime backend, integrated into 27.1.5, averages 70–230 docs/sec on the same hardware. This leap comes from intra_op_spinning deactivation and dropping document batching inside workers. No API changes—users switch models by adding a column with a new ONNX model, rebuilding embeddings, and dropping the old one.

The technical shift hinges on ONNX Runtime's capabilities. Microsoft's ORT engine optimizes models via graph fusion and kernel autotuning, already used by Hugging Face's Mini LM and BGE. Manticore adapted ORT to run without per-call locking, leveraging Linux/macOS thread-safe APIs. A key decision was disabling intra_op_spinning, which eliminated CPU idle time between operations. On a 16-core server, single-threaded inserts hit 72 docs/sec—7× faster than Candle. Concurrent loads scale to 233 docs/sec with batch sizes up to 64. The old path suffered lock contention and thread parking; the new design parallelizes operations internally, reducing coordination overhead.

This matters for auto-embedding workflows where database inserts drive indexing speed. Slow embeddings bottleneck ingestion, but the ONNX path raises the floor by an order of magnitude. Users gain tuning options: batch sizes and concurrency now meaningfully affect throughput. The backend runs models like Mini LM-L12-v2 without quality loss, handling 14 ms single-insert latency and 56 ms under 8-way load. Manticore adopted this path as default for ONNX-capable models, simplifying adoption. Switching models isn't trivial but avoids full table rebuilds. The change reflects a shift from correctness-focused Candle to performance-optimized ORT, balancing speed and accuracy. For developers, it underscores ONNX's role in production inference beyond academic settings.