Gemma 4 Complete Guide

Google's open multimodal AI model family with on-device agentic capabilities. Text, images, audio, and video — processed locally with full privacy.

4
Model Sizes
4
Modalities
128K
Context Window
On-Device
Agentic AI

What is Google Gemma 4?

The most capable open model family for on-device multimodal AI

Gemma 4 is Google DeepMind's latest open-weight multimodal model family, released in April 2026. Built on the same research that powers Gemini, Gemma 4 brings multimodal understanding (text, images, audio, video) and agentic tool use to devices — from smartphones to laptops — without requiring cloud connectivity.

Unlike previous Gemma releases that focused on text-only capabilities, Google Gemma 4 is natively multimodal across all model sizes. Even the smallest 2B variant can process images and audio, making it the first truly multimodal model designed for on-device deployment.

Gemma 4 Model Variants

Gemma 4 2B
2B
Mobile-first. Runs on phones and IoT devices.
Mobile / Edge
Gemma 4 4B
4B
Balanced performance for tablets and laptops.
Mobile / Laptop
Gemma 4 12B
12B
Desktop powerhouse. Best quality/efficiency ratio.
Desktop / Workstation
Gemma 4 27B
27B
Maximum capability for servers and research.
Server / Research

Gemma 4 Benchmark Comparison

How Gemma 4 stacks up against Llama 4, Phi-4, and Mistral

BenchmarkGemma 4 12BLlama 4 Scout 17BPhi-4 14BMistral Small 3.2
MMLU83.2 Best79.680.177.3
MMLU-Pro62.8 Best58.359.755.2
HumanEval78.473.880.2 Best71.5
MBPP+74.6 Best71.273.968.4
MATH68.5 Best63.167.259.8
GSM8K91.388.792.1 Best86.2
MMMU (Multimodal)58.9 Best54.247.844.1
MathVista63.4 Best57.655.349.7
DocVQA87.2 Best82.478.175.6
Tool Use Accuracy89.1 Best78.572.368.9

Visual Benchmark Comparison (12B Class)

MMLU
83.2
79.6
80.1
77.3
HumanEval
78.4
73.8
80.2
71.5
MMMU
58.9
54.2
47.8
44.1
Tool Use
89.1
78.5
72.3
68.9
Gemma 4 12B Llama 4 Scout Phi-4 14B Mistral Small

Gemma 4 On-Device Capabilities

Run powerful multimodal AI entirely on your device — no cloud required

📱

Mobile-First Design

Gemma 4 2B and 4B variants are optimized for ARM processors found in smartphones and tablets. Run inference at 30+ tokens/second on modern phones.

🌐

Offline Operation

Full functionality without internet. Process documents, analyze images, and run agentic workflows in airplane mode or remote locations.

Optimized Inference

INT4 and INT8 quantization built-in. Dynamic batching, speculative decoding, and KV-cache optimization for maximum throughput on limited hardware.

🛠

Native SDK Support

Google AI Edge SDK, MediaPipe, TensorFlow Lite, and ONNX Runtime support. Integrate Gemma 4 into iOS, Android, and embedded apps.

On-Device Performance Benchmarks

Gemma 4 achieves remarkable inference speeds across different device categories:

DeviceModelTokens/secFirst Token (ms)
Pixel 9 ProGemma 4 2B (INT4)38120
iPhone 16 ProGemma 4 2B (INT4)4295
Samsung S25 UltraGemma 4 4B (INT4)24210
MacBook Air M4Gemma 4 12B (INT4)55180
MacBook Pro M4 MaxGemma 4 27B (INT4)42350
RTX 5080 DesktopGemma 4 27B (FP16)65150

Gemma 4 Setup Guide

Get Gemma 4 running in minutes with your preferred framework

Setup with Ollama (Easiest)

Ollama provides the simplest way to run Gemma 4 locally. One command install, one command run.

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 (default: 12B)
ollama pull gemma4

# Or pull specific sizes
ollama pull gemma4:2b
ollama pull gemma4:4b
ollama pull gemma4:27b

# Run interactively
ollama run gemma4

# Run as API server
ollama serve
# Then call: curl http://localhost:11434/api/generate -d '{"model":"gemma4","prompt":"Hello"}'

Setup with llama.cpp (Maximum Performance)

llama.cpp offers the best raw performance with fine-grained control over quantization and inference parameters.

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc) LLAMA_CUDA=1 # Add CUDA for GPU

# Download Gemma 4 GGUF (from HuggingFace)
wget https://huggingface.co/google/gemma-4-12b-GGUF/resolve/main/gemma-4-12b-Q4_K_M.gguf

# Run inference
./llama-cli -m gemma-4-12b-Q4_K_M.gguf \
-n 512 -t 8 --temp 0.7 \
-p "Explain quantum computing"

# Run as server
./llama-server -m gemma-4-12b-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 -ngl 99

Setup with HuggingFace Transformers

Use the familiar Transformers API for research, fine-tuning, and Python integration.

# Install dependencies
pip install transformers torch accelerate

# Python code
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "google/gemma-4-12b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)

inputs = tokenizer("What is Gemma 4?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setup with Google AI Studio

Try Gemma 4 instantly in the browser via Google AI Studio, or use the API with your key.

# 1. Visit https://aistudio.google.com
# 2. Select "Gemma 4" from the model dropdown
# 3. Start prompting directly in the browser

# Or use the API:
pip install google-generativeai

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-12b")

response = model.generate_content("Explain Gemma 4 tool use")
print(response.text)

Gemma 4 Tool Use & Agentic Features

Built-in agentic capabilities that work entirely on-device

Gemma 4 introduces native tool use support, enabling the model to interact with external services and APIs while keeping all reasoning and orchestration on-device. The model decides when and how to call tools, processes responses, and chains multiple tool calls for complex tasks.

🗺
Google Maps
Location search, directions, nearby places
📖
Wikipedia
Knowledge lookup, fact verification
🔎
Web Search
Real-time information retrieval
🧮
Calculator
Math operations, unit conversion
📅
Calendar
Event management, scheduling
💻
Code Exec
Run Python snippets locally
📄
File Access
Read/write local documents
🔌
Custom APIs
Connect any REST/GraphQL endpoint

Example: Agentic Tool Use Flow

# Define tools for Gemma 4
tools = [
{"name": "search_maps", "description": "Search for places",
"parameters": {"query": "string", "location": "string"}},
{"name": "get_weather", "description": "Get weather data",
"parameters": {"city": "string"}}
]

# Gemma 4 automatically decides which tools to call
response = model.generate(
prompt="Find coffee shops near me and check if it'll rain today",
tools=tools,
tool_config={"mode": "auto"}
)

# Model chains: search_maps() -> get_weather() -> synthesize answer
# All orchestration happens ON-DEVICE

Hardware Requirements Calculator

Find out if your hardware can run Gemma 4

24 GB
Minimum RAM
12 GB
GPU VRAM (Recommended)
24 GB
Disk Space
~35 tok/s
Expected Speed (GPU)
Recommendation: Gemma 4 12B in FP16 fits well on a desktop with an RTX 4070 or better. For laptops, use INT4 quantization.

Privacy & Security Advantages

Your data never leaves your device

🔒

Zero Data Upload

All inference runs locally. No API calls, no cloud processing, no data stored on external servers. Your prompts and outputs stay on your hardware.

🏥

HIPAA & GDPR Ready

Process medical records, patient data, and PII-containing documents without compliance concerns. No data ever leaves the device perimeter.

🛡

Air-Gapped Operation

Works in fully disconnected environments. Ideal for classified settings, secure facilities, and areas with no internet connectivity.

🔍

Full Auditability

Open weights mean you can inspect the model. No hidden telemetry, no undisclosed data collection. Complete transparency in model behavior.

Gemma 4 Use Cases

From mobile apps to industrial IoT

📱 Mobile Applications

On-device photo understanding, real-time translation, voice assistants, and smart camera features that work without connectivity.

🏥 Healthcare

Private medical document analysis, radiology image assistance, patient intake processing, and clinical note generation on hospital devices.

💻 Edge Computing

Process sensor data, video feeds, and alerts at the edge. Reduce latency and bandwidth by keeping AI processing local.

🏭 Enterprise

Confidential document summarization, internal knowledge base Q&A, code review assistance, and meeting transcription without cloud leakage.

🚗 Automotive

In-car voice assistants, navigation augmentation, driver monitoring, and real-time object analysis in ADAS systems.

🎓 Education

Personal tutoring on student devices, homework help with image understanding, and adaptive learning without requiring school internet.

🛸 IoT & Robotics

Smart home control, robotic process understanding, sensor fusion, and autonomous decision-making on embedded devices.

💰 Finance

On-premise document processing, fraud detection, regulatory compliance checking, and secure trading analysis without data exposure.

Frequently Asked Questions

Everything you need to know about Google Gemma 4

What is Google Gemma 4?
Gemma 4 is Google DeepMind's open multimodal AI model family released in April 2026. It processes text, images, audio, and video with on-device agentic capabilities. It supports tool use (Maps, Wikipedia, calculators, etc.) without requiring cloud upload, making it privacy-focused by design. Available in 2B, 4B, 12B, and 27B parameter sizes.
How does Gemma 4 compare to Llama 4?
Gemma 4 12B outperforms Llama 4 Scout 17B on most benchmarks while being more parameter-efficient. Key advantages include: higher MMLU scores (83.2 vs 79.6), significantly better multimodal understanding (MMMU: 58.9 vs 54.2), superior tool use accuracy (89.1 vs 78.5), and better on-device optimization. Llama 4 has an edge in raw code generation (HumanEval) with its larger parameter count.
Can Gemma 4 run on my phone?
Yes! Gemma 4 2B and 4B are designed for mobile deployment. The 2B model runs on phones with 4GB+ RAM at 30-40 tokens/second using INT4 quantization. The 4B model needs 6GB+ RAM. Both support full multimodal capabilities on-device, including image understanding and tool use. Works on Android (Pixel 8+, Samsung S24+) and iOS (iPhone 15+).
What hardware do I need for Gemma 4 12B?
For Gemma 4 12B in FP16: 24GB RAM and 12GB VRAM (RTX 4070+). With INT4 quantization: 8GB RAM and 6GB VRAM is sufficient. On Apple Silicon: M2 Pro/Max or better with 16GB unified memory. The INT4 quantized version runs well on most modern laptops with 16GB RAM using CPU-only inference at ~15 tokens/second.
How do I install Gemma 4 with Ollama?
Install Ollama with curl -fsSL https://ollama.com/install.sh | sh, then pull the model with ollama pull gemma4 (default 12B) or specify a size like ollama pull gemma4:2b. Start the server with ollama serve and interact via ollama run gemma4 or the REST API at localhost:11434.
Does Gemma 4 support tool use and function calling?
Yes, Gemma 4 has built-in agentic tool use. It can call Google Maps for location queries, Wikipedia for knowledge, calculators for math, web search for real-time info, and custom REST/GraphQL APIs. All tool orchestration happens on-device. You define tools as JSON schemas, and Gemma 4 automatically decides when and how to call them.
Is Gemma 4 free to use commercially?
Yes, Gemma 4 is released under Google's permissive open model license. It's free for both research and commercial use with no royalties. You can download weights from HuggingFace, Kaggle, or Google AI Studio. The only restriction is on using outputs to train competing models above a certain parameter threshold.
What modalities does Gemma 4 support?
Gemma 4 is natively multimodal across all variants. It supports: text (generation, summarization, translation), images (understanding, VQA, captioning), audio (transcription, understanding, analysis), and video (frame analysis, temporal understanding, action recognition). Even the 2B model supports all four modalities on-device.
Can I fine-tune Gemma 4?
Yes, Gemma 4 supports LoRA and QLoRA fine-tuning. The 2B model can be fine-tuned on a single RTX 3090, and the 4B on an A100. Google provides official fine-tuning guides, Keras/JAX integration, and PEFT-compatible checkpoint formats. HuggingFace PEFT, Unsloth, and Axolotl all support Gemma 4 fine-tuning.
What is Gemma 4's context window?
Gemma 4 12B and 27B support 128K token context windows. The 2B and 4B on-device variants support 32K tokens. The 128K context enables processing full codebases, long documents, and extended multi-turn conversations. RoPE-based position encoding allows reliable extrapolation beyond the training context length.
How does Gemma 4 ensure privacy?
Gemma 4 runs entirely on-device with zero data upload. No prompts, inputs, or outputs are sent to any server. This makes it suitable for HIPAA, GDPR, SOC2, and classified environments. The open weights allow full auditability. No telemetry, no usage tracking, no data collection. You control the complete inference pipeline.