Gemma 4 QAT lands on laptops, plus a 3B multi-agent economy demo
Google released quantization-aware training (QAT) versions of Gemma 4, making the models small enough to run on phones and laptops. Plus: a hackathon team shipped a working multi-agent economy on a 3B model.
Google dropped Gemma 4 QAT models this week — quantization-aware training versions tuned to run efficiently on mobile and laptop hardware. If you've been waiting for a capable open model that doesn't need a cloud GPU, this is worth your attention right now.
New models
Gemma 4 QAT is Google's answer to the 'great model, terrible laptop performance' problem. Quantization-aware training bakes compression into the training process rather than squashing the model after the fact, which usually means better quality at the same file size compared to post-training quantization. The 12B instruction-tuned GGUF version is already on Hugging Face via Unsloth, so you can pull it into llama.cpp today without waiting for an official Google release page.
Google's May 2026 AI roundup also covered a range of other updates across its AI stack, though the Gemma 4 QAT work is the most immediately actionable item for local builders.
Open-source releases
The Hugging Face Build Small hackathon produced something worth studying: Thousand Token Wood, a multi-agent economy simulation running on a 3B model. A team built multiple agents — traders, producers, resource managers — coordinating inside a simulated economy, all on a model small enough to run on consumer hardware. The write-up is detailed and the code is public. If you've been curious whether small models can handle real agent-to-agent coordination, this is a concrete existence proof.
llama.cpp hit build b9538 this week. No single headline feature, but steady inference improvements keep rolling in. If you're running local models, pull the update. Cline (the VS Code AI coding agent) also shipped v3.88.0, and Weaviate (vector database) released v1.38.0 — both routine maintenance releases with no announced breaking changes.
Tools
Two free models appeared on OpenRouter this week from Sourceful: Riverflow V2.5 Fast and Riverflow V2.5 Pro. Both are listed as free tier. Sourceful is not a household name, and CBW hasn't tested these — see the honest note below.
Research worth reading
A blog post making rounds on Hacker News argues that programmers will write careful documentation for Claude that they'd never bother writing for human teammates. The author's point: AI assistants are changing what documentation gets written and why, not just who reads it. Worth 10 minutes if you're thinking about how your team's knowledge management should evolve.
What builders can do this week
1. Download the Gemma 4 12B QAT GGUF from Unsloth on Hugging Face, load it in llama.cpp, and benchmark it against whatever local model you're currently using. Concrete test: run your usual summarization or coding prompt and compare output quality and tokens-per-second.
2. Read the Thousand Token Wood write-up on the Hugging Face blog and fork the repo. Try swapping in a different 3B model (Phi-3 Mini, Qwen2.5-3B) to see how agent coordination quality changes — you'll learn more about small-model limits in an afternoon than in a week of reading papers.
3. Try the Riverflow V2.5 Fast model on OpenRouter (it's free) for a quick task like customer email drafting or product description writing. Compare it to GPT-4o Mini on the same prompt. Free models on OpenRouter are a low-friction way to audit new names before committing to them.
// what we actually tested
What's confirmed vs. what we're not sure about
Confirmed: Google published Gemma 4 QAT models targeting mobile and laptop efficiency, cross-verified across Hacker News, Reddit r/LocalLLaMA, and Hugging Face model listings.
Confirmed: The Unsloth gemma-4-12B-it-qat-GGUF is live on Hugging Face and downloadable today.
Confirmed: Thousand Token Wood shipped as a public hackathon entry on the Hugging Face blog with a working multi-agent demo on a 3B model.
Not independently verified by CBW: We have not benchmarked Gemma 4 QAT ourselves or compared it head-to-head against Gemma 4 non-QAT on real tasks.
Not independently verified by CBW: Sourceful's Riverflow V2.5 Fast and Pro models are new on OpenRouter and listed as free, but CBW has not tested output quality, rate limits, or context window size. Treat as unvetted until you run your own tests.
Source: Hugging Face blog — Thousand Token Wood hackathon write-up — https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim
Source: Google blog — Gemma 4 QAT quantization-aware training — https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/
Source: Google blog — AI updates May 2026 — https://blog.google/innovation-and-ai/technology/ai/google-ai-updates-may-2026/