Skip to main content
SoundingSURFACE BREAK

llama.cpp b8261 extends Apple Silicon GPU optimizations to more model formats

llama.cpp build b8261 adds optimized Metal GPU kernels for BF16, Q2_K, and Q3_K quantization types on Apple Silicon, improving inference performance for small batch sizes that were previously slower than other formats. Anthropic's Claude Opus 4.6 is credited as co-author on the pull request.

VERIFIEDConfidence: 80%

llama.cpp, the leading open-source tool for running AI models locally on personal computers, released build b8261 on March 10, 2026, expanding optimized Apple Silicon GPU performance to three additional model compression formats: BF16, Q2_K, and Q3_K. Previously, these formats ran more slowly on Macs because they lacked the specialized small-batch processing code that other formats already used. This release brings them in line with the optimized formats.

llama.cpp allows anyone to run large language models—the technology behind chatbots like ChatGPT—directly on their own hardware without an internet connection or cloud fees. On Apple Silicon Macs (M1 through M4), the software uses Metal, Apple's framework for accessing the GPU chip, to accelerate inference (the process of generating a response). Quantization refers to compressing an AI model to reduce its memory footprint and speed up responses, with different formats offering trade-offs between accuracy and size. Until this release, the Q2_K and Q3_K formats—among the most aggressive compressions, enabling larger models to fit on devices with less memory—missed out on the small-batch kernel optimizations already available to Q4_K, Q5_K, and Q6_K. Users running smaller batches of requests, typical for personal use, will see the most benefit. This update could meaningfully improve the experience for Mac users who run highly compressed models to fit larger AI systems onto limited hardware.

Newsletter

Stay informed. The best AI coverage, delivered weekly.

Related