How to run Gemma 4 locally with Unsloth AI

Google’s Gemma 4 fits in 5GB of RAM. The largest variant tops out at 20GB at 4-bit. That means most people reading this can run a genuinely capable open model on hardware they already own, right now.

Unsloth updated their full stack for Gemma 4 on April 8, 2026, adding llama.cpp fixes and full support inside their no-code Studio UI. There are two ways to get this running: a browser-based interface that handles everything for you, or a manual llama.cpp setup for people who want full control. This guide covers both.

Pick the right variant before you download anything

Gemma 4 ships in four models. Downloading the wrong one wastes time, so sort this out first.

E2B: needs 4GB RAM at 4-bit. Built for phones and edge devices. Supports text, image, and audio. Good for ASR and speech translation.
E4B: needs 5.5–6GB RAM at 4-bit. Better quality than E2B, still runs on most laptops. Also supports text, image, and audio.
26B-A4B: needs 16–18GB RAM at 4-bit. A Mixture of Experts model with only 4B active parameters per pass. Fast, capable, and the best speed-to-quality tradeoff for desktop use.
31B: needs 17–20GB RAM at 4-bit. The strongest Gemma 4 model. Slightly slower than 26B-A4B but scores higher on every benchmark.

If you have an RTX 3090, 4090, or a Mac with 24GB+ unified memory, start with 26B-A4B. On a laptop with 8–16GB RAM, E4B hits the sweet spot. The 31B earns its extra memory only if you need maximum output quality and can accept slower generation.

All four variants are Apache 2.0 licensed, meaning commercial use is permitted without restrictions. The full Unsloth Gemma 4 GGUF collection is on Hugging Face.

Method 1: Unsloth Studio (no terminal experience needed)

Unsloth Studio is a local web UI that wraps llama.cpp with a chat interface, auto-sets inference parameters, and handles model downloads. It runs on macOS, Windows, and Linux. If you’ve never run a local model before, start here.

unsloth studio gemma4 model selector — Unsloth Studio’s model selector showing Gemma 4 E4B quantization options. UD-Q4_K_XL at 4.8GB is the recommended starting point for most users

Step 1: Install Unsloth.

On macOS, Linux, or WSL, run:

curl -fsSL https://unsloth.ai/install.sh | sh

On Windows PowerShell:

irm https://unsloth.ai/install.ps1 | iex

Step 2: Launch the Studio.

unsloth studio -H 0.0.0.0 -p 8888

Open http://localhost:8888 in your browser. On first launch, you’ll create a password and see a short onboarding wizard. You can skip the wizard at any time.

Step 3: Download Gemma 4.

Go to the Studio Chat tab and search “Gemma 4.” Pick your variant and your quantization: use 8-bit for E2B/E4B and Dynamic 4-bit (UD-Q4_K_XL) for 26B-A4B and 31B. Hit download and wait.

Step 4: Run it.

Inference parameters are auto-configured when using Unsloth Studio. You can still change the context length, chat template, and individual settings manually. The full settings reference is in the Unsloth Studio inference guide.

Method 2: llama.cpp (manual, more control)

This path is better for Linux servers, CPU-only machines, or anyone who wants to script their own setup. Apple Silicon users should follow this too, just swap the CUDA flag.

First, build llama.cpp. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don’t have an NVIDIA GPU. Metal support for Apple devices is on by default, so Mac users can just set CUDA to OFF and continue normally.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first \
    --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Now run the model directly. Here’s the command for 26B-A4B:

export LLAMA_CACHE="unsloth/gemma-4-26B-A4B-it-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64

For E4B on a laptop:

export LLAMA_CACHE="unsloth/gemma-4-E4B-it-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64

If you’d rather download the model file first before running, install the Hugging Face CLI and pull it manually:

pip install huggingface_hub hf_transfer
hf download unsloth/gemma-4-26B-A4B-it-GGUF \
    --local-dir unsloth/gemma-4-26B-A4B-it-GGUF \
    --include "*mmproj-BF16*" \
    --include "*UD-Q4_K_XL*"

If downloads stall, check the Unsloth Hugging Face Hub debugging guide.

Inference settings that actually matter

Google published recommended defaults for Gemma 4, and they’re worth sticking to:

Temperature: 1.0
Top-p: 0.95
Top-k: 64
Repetition penalty: leave at 1.0 (disabled) unless the model starts looping output

For context window, start at 32K. Only go higher if you’re feeding long documents. The 26B-A4B and 31B models support up to 256K context, but that eats RAM quickly and slows generation noticeably.

One firm warning from the Unsloth team: do not use CUDA 13.2 runtime when running any Gemma 4 GGUF. It causes degraded outputs. This is a known issue as of the April 8 update.

How to enable thinking mode

Gemma 4 has an optional reasoning channel that shows the model’s internal thought process before the final answer. It’s off by default and easy to turn on.

Add <|think|> at the very start of your system prompt:

<|think|>
You are a careful coding assistant. Explain your answer clearly.

With thinking enabled, the model outputs a thought block first, then the final answer:

<|channel>thought
[internal reasoning]
<channel|>
[final answer]

unsloth studio gemma4 chat response tools — Gemma 4 26B-A4B running on UD-Q4_K_XL inside Unsloth Studio. The model used web search and Python tools to answer a live question, then returned a clean formatted table

If you’re running llama-server and want to toggle thinking via a flag instead of editing the system prompt, use:

./llama.cpp/llama-server \
    --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
    --temp 1.0 --top-p 0.95 --top-k 64 \
    --port 8001 \
    --chat-template-kwargs '{"enable_thinking":true}'

On Windows PowerShell, escape the inner quotes: --chat-template-kwargs "{\"enable_thinking\":false}"

One rule for multi-turn chats: never feed prior thought blocks back into the next turn. Only keep the final visible answer in the chat history. Passing thought blocks back inflates context and produces worse results. This is not optional, it’s how the model expects to be used.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Real Tech. No Filter

How to run Gemma 4 locally with Unsloth AI

Pick the right variant before you download anything

Method 1: Unsloth Studio (no terminal experience needed)

Method 2: llama.cpp (manual, more control)

Inference settings that actually matter

How to enable thinking mode

Tags:

How to run Gemma 4 12B locally with Ollama

Aji

Leave a Reply Cancel reply

NVIDIA GeForce RTX 5090 vs RTX 4090: A Comprehensive Performance Analysis

Google’s AI Search Faces Legal Reckoning: Antitrust, Copyright, and the Future of Online Search

The US just forced Anthropic to pull its most powerful AI models

How to run Gemma 4 12B locally with Ollama

China’s new Lisuan LX 7G100 GPU is a technical marvel you shouldn’t buy

8 creative apps Claude can now control directly

Aji

Latest from Blog

The US just forced Anthropic to pull its most powerful AI models

How to run Gemma 4 12B locally with Ollama

China’s new Lisuan LX 7G100 GPU is a technical marvel you shouldn’t buy

8 creative apps Claude can now control directly

GPT Image 2 knows who Sam Altman is. That’s the story.

Pick the right variant before you download anything

Method 1: Unsloth Studio (no terminal experience needed)

Method 2: llama.cpp (manual, more control)

Inference settings that actually matter

How to enable thinking mode

Tags:

You might be interested in

Leave a Reply Cancel reply

GPT Image 2 knows who Sam Altman is. That’s the story.

8 creative apps Claude can now control directly

Latest from Blog