Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks
Nii Osae Osae Dade, Tony Morri, Moinul Hossain Rahat, Sayandip Pal

TL;DR
Litespark-Inference introduces custom SIMD kernels for ternary neural networks, enabling efficient CPU inference and significantly improving speed and memory usage over standard methods.
Contribution
It develops and integrates custom SIMD kernels tailored for ternary models, optimizing CPU inference by replacing floating-point operations with simple integer additions and subtractions.
Findings
Achieves 9.2x faster time-to-first-token on Apple Silicon.
Provides 52x higher throughput compared to standard PyTorch.
Reduces memory usage by 14x.
Abstract
Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 9.2x faster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
