● LIVEReading: NewsUpdated: 10 min agoSubscribers: 23,400● LIVEReading: NewsUpdated: 10 min agoSubscribers: 23,400

// COOKEDBAKEDWORKED.COM

ModelsSun, Jun 7, 2026· 5 min read

Gemma 4 QAT lands on-device, plus a 3B multi-agent economy that actually ships

Google's Gemma 4 QAT models are out and runnable on phones and laptops. Separately, a hackathon team shipped a working multi-agent economy on a 3B model — proof small models can do real work.

Google released Gemma 4 QAT (Quantization-Aware Training) models this week, targeting phones and laptops rather than data centers. The 12B instruction-tuned version is already downloadable via Hugging Face — both from Google's own repo and an Unsloth GGUF conversion — and llama.cpp build b9544 can run it locally today. If you've been waiting for a capable open model that fits on consumer hardware, this is the most concrete option right now.

New models

Gemma 4 QAT is Google's answer to the question 'how do you get a 12B model onto a laptop without wrecking quality?' Quantization-aware training bakes compression into the training process rather than squashing a finished model after the fact. The result, according to Google's blog, is better quality at the same file size compared to post-training quantization. The GGUF files are live on Hugging Face right now — google/gemma-4-12B-it-qat-q4_0-gguf and unsloth/gemma-4-12B-it-qat-GGUF are both ranked in the top 50 trending models this week.

Google also published a May 2026 AI updates roundup covering announcements across Gemini, DeepMind, and other products. Nothing in it changes the Gemma 4 QAT story, but it's a useful single-page summary if you've missed the last few weeks of Google news.

Open-source releases

llama.cpp hit build b9544 this week. No dramatic feature announcement, but if you're running Gemma 4 QAT locally, you want a current build. Grab it from the ggml-org/llama.cpp repo.

Weaviate v1.38.0 also shipped. Weaviate is a vector database used in retrieval-augmented generation setups. If you're building anything that searches over documents, v1.38.0 is worth checking for changelog items.

Cline v3.88.1 is out. Cline is a VS Code extension that runs AI coding agents inside your editor. The 3.88.x line has been iterating fast — check the release notes if you use it daily.

Worth reading: small models doing real work

Two Hugging Face hackathon write-ups this week are worth your time, especially if you think multi-agent systems require giant models. 'Thousand Token Wood' is a post-mortem on shipping a multi-agent economic simulation — think competing AI traders — on a 3B parameter model. The follow-up, 'Five labs, five minds,' documents a team running a multi-model finance drama using several small models playing different roles. Both are concrete engineering stories, not demos. The key takeaway: orchestration and prompt design matter more than raw model size for many agent tasks.

What builders can do this week

1. Download google/gemma-4-12B-it-qat-q4_0-gguf from Hugging Face, load it in llama.cpp b9544, and run a simple Q&A loop over a local document. This is a real on-device AI setup with no API costs.

2. Read the Thousand Token Wood write-up and sketch a two-agent setup where one agent generates product ideas and a second agent critiques them — using a small model like Gemma 4 12B or Phi-3 Mini. You don't need GPT-4 for this.

3. If you're already using Weaviate for document search, upgrade to v1.38.0 and check whether any new filtering or indexing options apply to your use case.

// what we actually tested

What we can and can't confirm

Confirmed: Gemma 4 QAT models are live and downloadable on Hugging Face as of this week — both google/gemma-4-12B-it-qat-q4_0-gguf and unsloth/gemma-4-12B-it-qat-GGUF are real, public repos.

Confirmed: llama.cpp build b9544 is a real tagged release on the ggml-org/llama.cpp GitHub repo.

Not independently verified by CBW: We have not benchmarked Gemma 4 QAT quality against post-training quantized equivalents. Google's claim that QAT preserves more quality is plausible but we haven't run the numbers ourselves.

Worth noting: The Harness Engineering / OpenAI Codex post (id=15358) was flagged in our signals but the title contains 'harnesses' and 'leverages' — classic PR language. We skipped it because the concrete builder takeaway was thin.

Worth noting: The Anthropic 'Project Glasswing' URL in our signals resolves to an election safeguards update page. The mismatch between label and URL means we can't confirm what was actually announced — we left it out of today's digest.

Source: Google blog: Gemma 4 QAT launch — https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Source: Hugging Face: google/gemma-4-12B-it-qat-q4_0-gguf — https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf

Source: Hugging Face: unsloth/gemma-4-12B-it-qat-GGUF — https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF

Source: HF hackathon: Thousand Token Wood multi-agent economy — https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim

Source: HF hackathon: Five labs, five minds — multi-model finance drama — https://huggingface.co/blog/build-small-hackathon/thousand-token-wood-sim-v2

Source: GitHub: ggml-org/llama.cpp release b9544 — https://github.com/ggml-org/llama.cpp