Google shipped Gemma 4 on April 2, 2026, the same week the local-AI ecosystem started feeling crowded for the first time. Four variants, Apache 2.0, 256K context, native vision and audio, 140+ languages — and it was built from the same research as Gemini 3.

This is the most capable open model family Google has released so far. Here is what it covers.

The Four Variants

VariantActive paramsTotal paramsBest for
Gemma 4 E2B2B2BPhones, Raspberry Pi, edge
Gemma 4 E4B4B4BJetson Orin Nano, laptops
Gemma 4 26B A4B3.8B26BSingle H100 / consumer GPU with quantization
Gemma 4 31B31B31BSingle H100 unquantized, max quality

The 26B is a Mixture-of-Experts model: 26B parameters in memory, but only ~3.8B active per token. That makes it the speed/quality sweet spot if you have the VRAM to hold the weights.

The 31B is a dense model — the one to pick for fine-tuning and for the absolute best quality the family offers.

The E-series (“efficient”) models are designed to run completely offline on edge hardware. Near-zero latency, no network round trip.

Strengths

Multimodal out of the box. Text, image, and audio inputs without bolting on a separate vision encoder. Video is supported through frame sampling. This is the first Gemma generation where multimodal is a first-class input rather than a separate model variant.

256K context window. Four times what Gemma 3 offered. Enough to feed in a mid-size codebase, a long PDF, or a multi-hour transcript without chunking.

Intelligence per parameter. The 31B ranks #3 on Arena AI’s open-model leaderboard, with the 26B MoE at #6. Google’s own claim is that Gemma 4 “outcompetes models 20x its size” on math, reasoning, and instruction-following — that is marketing, but the leaderboard numbers do back the general direction.

Agentic-ready. Function calling is supported natively, which is the table-stakes feature for using a model as the brain of a coding agent or workflow runner.

Apache 2.0. No usage restrictions, commercial-friendly, no license-gate to click through.

140+ languages. Strong multilingual performance is a Gemma family tradition and Gemma 4 extends it.

Weaknesses

Coding is not the lead story. Qwen3.6-27B at 77.2% on SWE-bench Verified is currently the open-model champion for coding. Gemma 4 trades raw coding benchmark numbers for breadth — multimodality, languages, reasoning. If you specifically want a local coding model, Qwen is still the pick.

The MoE variant needs VRAM, not compute. 26B parameters have to fit in memory even though only 3.8B activate per token. So while inference is fast, you still need a 24GB+ GPU or unified memory setup to hold the weights.

Long context is not free. 256K is a marketing ceiling. Quality holds up well into the tens of thousands of tokens, less well as you approach the limit. This is true of every model with a long context window; treat it as a soft constraint, not a hard one.

Still behind Gemini 3 Pro at the very top end. The 1M-token, multimodal-reasoning-at-the-frontier regime is closed-model territory for now.

Use Cases Where Gemma 4 Fits

On-device assistants. The E2B and E4B variants are designed to run on phones, single-board computers, and edge devices. If you are building an assistant that should work offline — in a car, on a plane, in a factory — this is the variant family to start with.

Multilingual customer-facing tooling. 140+ languages with quality holding up across them. For a European product that needs to serve Norwegian, Swedish, Polish, and Greek users equally well, Gemma 4 is a stronger starting point than English-centric alternatives.

Multimodal pipelines. Image-plus-text-in, text-out workflows: receipt parsing, chart-reading, document understanding with diagrams, audio transcription with reasoning. Gemma 4 handles these natively without a separate vision-language model.

Agentic workflows on owned hardware. Function calling plus 256K context plus permissive licensing — this is a viable backbone for an agent loop you run on your own infrastructure, where you control the data flow end-to-end.

Fine-tuning targets. The 31B dense variant is the one to pick if you plan to specialize the model on a domain. Dense models fine-tune more predictably than MoE; 31B is large enough to absorb meaningful new knowledge.

How to Run It

The whole family is on Hugging Face and Kaggle, with Google Cloud Vertex AI as the managed option.

For edge (E2B / E4B):

  • MediaPipe LLM Inference on Android / iOS
  • llama.cpp with the GGUF builds on a Raspberry Pi 5 or Jetson Orin Nano
  • Ollama for the easiest desktop path: ollama run gemma4:e4b

For the 26B MoE on a single GPU:

  • llama.cpp with Unsloth Q4_K_XL quantization — fits in ~16GB
  • vLLM if you need throughput and have a 24GB+ card
  • MLX on Apple Silicon with 32GB+ unified memory

For the 31B dense on a single H100:

  • vLLM in bfloat16 for production serving
  • Transformers + bitsandbytes for fine-tuning at 4-bit / 8-bit
  • TGI (Text Generation Inference) if you want HuggingFace’s serving stack

The OpenAI-compatible API endpoint that llama.cpp ships with means most existing tooling — LiteLLM, LangChain, your custom agent loop — works against local Gemma 4 with a config change.

Where Gemma 4 Fits in the May 2026 Landscape

If you are choosing one open model family to standardize on right now:

  • Coding-heavy work → Qwen3.6-27B (10 points of SWE-bench above Gemma 4’s coding numbers)
  • Multimodal or multilingual work → Gemma 4 (breadth advantage)
  • Edge deployment → Gemma 4 E-series (purpose-built for it)
  • Mixed workload, single model → it depends on which axis you care about most

Gemma 4 is not the strongest open model on any single benchmark. It is the most complete open model family — covering edge to datacenter, single modality to four, English to 140+ languages — under a license that lets you ship it.

For most teams, that completeness is more useful than a few percentage points on a single eval.