How to run Gemma 4 12B locally with Ollama

11 views
3 mins read

Google’s Gemma 4 12B is one of the most capable open models you can run on a regular laptop right now. Not a server. Not a cloud subscription. Your actual laptop — as long as it has 16GB of RAM.

This guide walks you through the full setup using Ollama, the easiest way to run AI models locally. No Python environment. No CUDA headaches. Just a few terminal commands and you’re talking to the model in under 10 minutes.

What you need before you start

Gemma 4 12B at full precision needs around 26.7GB of memory, which is a lot. But with 4-bit quantization (the default in Ollama), you can get it down to roughly 6.7GB VRAM or 13–16GB system RAM.

Here’s the minimum you need:

  • 16GB of system RAM (works fine without a GPU using CPU inference)
  • 8–12GB VRAM if you want GPU acceleration for faster responses
  • About 8GB of free disk space for the model download
  • Windows 10/11, macOS 12+, or any modern Linux distro

No GPU? You can still run it on CPU. Responses will be slower (expect 3–6 tokens per second), but it works.

Step 1: Install Ollama

Head to ollama.com/download and grab the installer for your OS.

Windows: Download OllamaSetup.exe and run it. It installs like any normal app and starts a background service automatically.

macOS: Download the zip, unzip it, and drag the Ollama app into your Applications folder. Launch it and you’ll see the alpaca icon in your menu bar.

Linux: Open a terminal and run this single command:

curl -fsSL https://ollama.com/install.sh | sh

Once installed, confirm it’s working:

ollama --version

You should see a version number. If you do, you’re ready.

See Also
How to run Gemma 4 locally with Unsloth AI

Step 2: Pull the Gemma 4 12B model

This is the one command that downloads everything you need:

ollama pull gemma4:12b

The download is around 7–8GB, so it’ll take a few minutes depending on your connection. Ollama shows a progress bar while it runs. Let it finish before moving on.

If you’re not sure which model size fits your hardware, here’s the quick reference:

  • gemma4:e2b — runs on 4–8GB RAM, fastest, least capable
  • gemma4:e4b — runs on 8GB+ RAM, good all-rounder
  • gemma4:12b — runs on 16GB RAM, best quality-to-hardware ratio
  • gemma4:26b — needs 16GB+ VRAM, near-frontier quality

The 12B hits the sweet spot for most people with a standard developer machine.

Step 3: Run the model

Once the download finishes, start a chat session with:

ollama run gemma4:12b

Ollama loads the model and drops you into an interactive prompt. Type anything and hit Enter. Your first response will take a few extra seconds while it loads into memory. After that, it’s fast.

To exit the session, type /bye and press Enter.

Step 4: Use the API (optional but useful)

Ollama also runs a local REST API at http://localhost:11434 automatically. This means you can connect it to tools like Open WebUI, Obsidian, or any app that supports custom OpenAI-compatible endpoints.

To test it from a second terminal window:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:12b",
  "prompt": "Explain what an LLM is in two sentences.",
  "stream": false
}'

You’ll get a JSON response with the model’s output. From here, you can build scripts, connect to frontends, or integrate Gemma 4 into your own apps.

See Also
How to run Gemma 4 locally with Unsloth AI

What Gemma 4 12B is actually good at

After testing it, here’s what stood out:

  • Coding: It writes clean Python and JavaScript. Not perfect, but genuinely useful for everyday tasks
  • Image understanding: You can pass an image path and ask questions about it (supported via Ollama’s multimodal API)
  • Long context: The model supports up to 128K tokens, so you can paste large documents without truncation issues
  • Instruction following: It respects system prompts well, which makes it easy to customize for specific roles or workflows

Where it struggles: very long multi-step reasoning chains and tasks that need real-time information. For those, you still want a cloud model. For everything local and private, the 12B delivers.

Upgrade your setup with Open WebUI

The terminal works fine, but if you want a ChatGPT-style browser interface, install Open WebUI. It connects directly to your local Ollama instance and gives you chat history, model switching, and file uploads — all running 100% offline.

Install it with Docker in one command:

docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser. Select Gemma 4 12B from the model dropdown and you’re running a private, local AI assistant with a full UI.

Local AI is no longer a hobbyist experiment. A 12B model running on a regular laptop in 2026 is genuinely useful for daily work — and Gemma 4 12B is one of the best options to start with. Download Ollama today at ollama.com and have it running before dinner.

Leave a Reply

Your email address will not be published.

Previous Story

China’s new Lisuan LX 7G100 GPU is a technical marvel you shouldn’t buy

Latest from Blog