Gemma 4 QAT lands on-device, plus a 3B multi-agent economy that actually ships
Google's Gemma 4 QAT models are out and runnable on phones and laptops. Separately, a hackathon team shipped a working multi-agent economy on a 3B model — proof small models can do real work.
Google released Gemma 4 QAT (Quantization-Aware Training) models this week, targeting phones and laptops rather than data centers. The 12B instruction-tuned version is already downloadable via Hugging Face — both from Google's own repo and an Unsloth GGUF conversion — and llama.cpp build b9544 can run it locally today. If you've been waiting for a capable open model that fits on consumer hardware, this is the most concrete option right now.
New models
Gemma 4 QAT is Google's answer to the question 'how do you get a 12B model onto a laptop without wrecking quality?' Quantization-aware training bakes compression into the training process rather than squashing a finished model after the fact. The result, according to Google's blog, is better quality at the same file size compared to post-training quantization. The GGUF files are live on Hugging Face right now — google/gemma-4-12B-it-qat-q4_0-gguf and unsloth/gemma-4-12B-it-qat-GGUF are both ranked in the top 50 trending models this week.
Google also published a May 2026 AI updates roundup covering announcements across Gemini, DeepMind, and other products. Nothing in it changes the Gemma 4 QAT story, but it's a useful single-page summary if you've missed the last few weeks of Google news.
Open-source releases
llama.cpp hit build b9544 this week. No dramatic feature announcement, but if you're running Gemma 4 QAT locally, you want a current build. Grab it from the ggml-org/llama.cpp repo.
Weaviate v1.38.0 also shipped. Weaviate is a vector database used in retrieval-augmented generation setups. If you're building anything that searches over documents, v1.38.0 is worth checking for changelog items.
Cline v3.88.1 is out. Cline is a VS Code extension that runs AI coding agents inside your editor. The 3.88.x line has been iterating fast — check the release notes if you use it daily.
Worth reading: small models doing real work
Two Hugging Face hackathon write-ups this week are worth your time, especially if you think multi-agent systems require giant models. 'Thousand Token Wood' is a post-mortem on shipping a multi-agent economic simulation — think competing AI traders — on a 3B parameter model. The follow-up, 'Five labs, five minds,' documents a team running a multi-model finance drama using several small models playing different roles. Both are concrete engineering stories, not demos. The key takeaway: orchestration and prompt design matter more than raw model size for many agent tasks.
What builders can do this week
1. Download google/gemma-4-12B-it-qat-q4_0-gguf from Hugging Face, load it in llama.cpp b9544, and run a simple Q&A loop over a local document. This is a real on-device AI setup with no API costs.
2. Read the Thousand Token Wood write-up and sketch a two-agent setup where one agent generates product ideas and a second agent critiques them — using a small model like Gemma 4 12B or Phi-3 Mini. You don't need GPT-4 for this.
3. If you're already using Weaviate for document search, upgrade to v1.38.0 and check whether any new filtering or indexing options apply to your use case.
// what we actually tested
What we can and can't confirm
Confirmed: Gemma 4 QAT models are live and downloadable on Hugging Face as of this week — both google/gemma-4-12B-it-qat-q4_0-gguf and unsloth/gemma-4-12B-it-qat-GGUF are real, public repos.
Confirmed: llama.cpp build b9544 is a real tagged release on the ggml-org/llama.cpp GitHub repo.
Not independently verified by CBW: We have not benchmarked Gemma 4 QAT quality against post-training quantized equivalents. Google's claim that QAT preserves more quality is plausible but we haven't run the numbers ourselves.
Worth noting: The Harness Engineering / OpenAI Codex post (id=15358) was flagged in our signals but the title contains 'harnesses' and 'leverages' — classic PR language. We skipped it because the concrete builder takeaway was thin.
Worth noting: The Anthropic 'Project Glasswing' URL in our signals resolves to an election safeguards update page. The mismatch between label and URL means we can't confirm what was actually announced — we left it out of today's digest.
Source: Google blog: Gemma 4 QAT launch — https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/