Gemma 4 is out and the small one is the interesting story
Google pushed Gemma 4 out earlier this month. Four models, Apache 2.0, weights on Hugging Face and Kaggle and Ollama the same day. I've been poking at the smaller variants on a laptop for about a week now. The headline most people will write is about the 31B dense model cracking the top of the open-model leaderboards. The part I keep thinking about is that the 4B runs locally and doesn't feel like a toy.
The lineup
There are four models in the family. E2B and E4B are the edge variants, designed to actually fit on phones and laptops without heroic quantization. Then there's a 26B mixture-of-experts and a 31B dense model at the top end. The naming is a little confusing if you're used to thinking in active parameters versus total parameters, but the short version is that the E-prefixed models are the ones that will run on device, and the bigger two are for servers.
The 31B dense sits at number 3 on the Arena text leaderboard among open models. The 26B MoE lands at number 6. Both beat models more than twenty times their size, which is a sentence I'd have rolled my eyes at a year ago and now just seems like how this year is going.
What's actually new
Context window is 256K across the family. That's a big jump for Gemma and it's the first time I've felt comfortable throwing a full codebase at a local model without chunking it myself.
Multimodal is native now. Vision and audio both, not bolted on. I've done some tests pointing E4B at screenshots and asking it to describe UI flows and it's genuinely useful. Not GPT-4V useful, but useful in a way that makes offline inference feel like a real option for certain workflows.
The agentic side is what Google keeps bringing up. Function calling is cleaner, the models are tuned for multi-step tool use, and they've published benchmarks showing it holds up. I haven't wired mine into a full agent loop yet, but the tool-call formatting in my smaller tests has been consistent, which is half the battle.
There's also multilingual support across 140-ish languages, which I can't really evaluate from English but matters if you're shipping outside it.
The on-device thing
Here's what I didn't expect. The E4B running on my MacBook via Ollama feels like a model I'd have paid for API access to two years ago. It's fast enough that I stopped noticing the latency. It's competent at the small coding tasks I throw at it between bigger ones. It doesn't hallucinate APIs with confidence the way older small models did.
Google is clearly leaning into this. The blog post and the developer materials keep coming back to edge inference, personal AI, Android integration. For Gemma specifically that feels like the right direction. The frontier stuff lives under the Gemini name. Gemma is where Google is putting the bet that a lot of useful AI will eventually run where the user is, not in a data center.
The license
Apache 2.0. The previous Gemma license had enough ambiguity that a lot of shops just defaulted to Llama for anything they planned to ship. That friction is gone now. If you've been putting off evaluating Gemma because of licensing, there's no reason to put it off anymore.
Should you care
If you're building something that needs to run on user devices, yes. The E2B and E4B models are the best open options I've tested for that use case. They're also small enough that you can fine-tune them on a single GPU, which matters if you're doing any real domain adaptation.
If you're running server-side inference and you were already happy with Llama 3 or Qwen, the 31B dense is worth benchmarking on your actual workload. The leaderboard numbers are a signal, not a verdict. Run it on your prompts before you believe anything.
What I keep coming back to is how much the gap between "the best model you can run locally" and "the best model you can pay for" has shrunk in the last year. Gemma 4 is a real step in that direction, and I think the on-device story is where the interesting work is going to happen for a while.