Instructions to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF", dtype="auto")

llama-cpp-python

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF",
	filename="gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

Use Docker

docker model run hf.co/llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

SGLang

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with Ollama:
```
ollama run hf.co/llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M
```

Unsloth Studio

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF to start chatting

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with Docker Model Runner:
```
docker model run hf.co/llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M
```

Lemonade

How to use llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-GGUF-Q4_K_M

List all available models

lemonade list

🚨⚠️ I HAVE REACHED HUGGING FACE'S FREE STORAGE LIMIT ⚠️🚨

I can no longer upload new models unless I can cover the cost of additional storage.
I host 70+ free models as an independent contributor and this work is unpaid.
Without your support, no more new models can be uploaded.

🎉 Patreon (Monthly) | ☕ Ko-fi (One-time)

Every contribution goes directly toward Hugging Face storage fees to keep models free for everyone.

87% fewer refusals (13/100 Uncensored vs 99/100 Original) while preserving model quality (0.0367 KL divergence).

❤️ Support My Work

Creating these models takes significant time, work and compute. If you find them useful consider supporting me:

Platform	Link	What you get
🎉 Patreon	Monthly support	Priority model requests
☕ Ko-fi	One-time tip	My eternal gratitude

Your help will motivate me and would go into further improving my workflow and coverings fees for storage, compute and may even help uncensoring bigger model with rental Cloud GPUs.

GGUF quantizations of llmfan46/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic.

This is a decensored version of yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF, made using Heretic v1.4.0 with a variant of the Magnitude-Preserving Orthogonal Ablation (MPOA) method

Abliteration parameters

Parameter	Value
direction_index	29.18
attn.o_proj.max_weight	1.30
attn.o_proj.max_weight_position	35.73
attn.o_proj.min_weight	0.90
attn.o_proj.min_weight_distance	26.76
mlp.down_proj.max_weight	1.49
mlp.down_proj.max_weight_position	38.14
mlp.down_proj.min_weight	1.43
mlp.down_proj.min_weight_distance	18.44

Targeted components

attn.o_proj
mlp.down_proj

Performance

Metric	This model	Original model (gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF)
KL divergence	0.0367	0 (by definition)
Refusals	✅ 13/100	❌ 99/100

MMLU test results:

Original:

============================================================

Total questions: 7021
Correct: 5024
Accuracy: 0.7156 (71.56%)
Parse failures: 313

============================================================

Tested subject scores:

professional_law: 0.6076 (477/785)
moral_scenarios: 0.6719 (297/442)
miscellaneous: 0.8277 (317/383)
professional_psychology: 0.7722 (244/316)
high_school_psychology: 0.8556 (231/270)
high_school_macroeconomics: 0.7868 (155/197)
elementary_mathematics: 0.6739 (124/184)
moral_disputes: 0.7414 (129/174)
prehistory: 0.8081 (139/172)
philosophy: 0.7421 (118/159)
high_school_biology: 0.9145 (139/152)
professional_accounting: 0.5385 (77/143)
clinical_knowledge: 0.8071 (113/140)
high_school_microeconomics: 0.8235 (112/136)
nutrition: 0.7852 (106/135)
professional_medicine: 0.4925 (66/134)
conceptual_physics: 0.7812 (100/128)
high_school_mathematics: 0.1890 (24/127)
human_aging: 0.7155 (83/116)
security_studies: 0.7857 (88/112)
high_school_statistics: 0.6486 (72/111)
marketing: 0.8991 (98/109)
high_school_world_history: 0.8585 (91/106)
sociology: 0.8738 (90/103)
high_school_government_and_politics: 0.8812 (89/101)
high_school_geography: 0.8485 (84/99)
high_school_chemistry: 0.6495 (63/97)
high_school_us_history: 0.8526 (81/95)
virology: 0.4944 (44/89)
college_medicine: 0.7500 (66/88)
world_religions: 0.7727 (68/88)
high_school_physics: 0.5000 (42/84)
electrical_engineering: 0.6790 (55/81)
astronomy: 0.7342 (58/79)
logical_fallacies: 0.8026 (61/76)
high_school_european_history: 0.8082 (59/73)
anatomy: 0.7606 (54/71)
college_biology: 0.8281 (53/64)
human_sexuality: 0.8125 (52/64)
formal_logic: 0.5000 (32/64)
public_relations: 0.6393 (39/61)
international_law: 0.8333 (50/60)
college_physics: 0.4035 (23/57)
college_mathematics: 0.3273 (18/55)
econometrics: 0.6667 (36/54)
jurisprudence: 0.7358 (39/53)
high_school_computer_science: 0.9038 (47/52)
machine_learning: 0.7115 (37/52)
medical_genetics: 0.7255 (37/51)
global_facts: 0.4314 (22/51)
management: 0.9200 (46/50)
us_foreign_policy: 0.9200 (46/50)
college_chemistry: 0.3617 (17/47)
abstract_algebra: 0.4681 (22/47)
business_ethics: 0.7174 (33/46)
college_computer_science: 0.6222 (28/45)
computer_security: 0.7674 (33/43)

Heretic:

============================================================

Total questions: 7021
Correct: 5016
Accuracy: 0.7144 (71.44%)
Parse failures: 346

============================================================

Tested subject scores:

professional_law: 0.5924 (465/785)
moral_scenarios: 0.6493 (287/442)
miscellaneous: 0.8277 (317/383)
professional_psychology: 0.7880 (249/316)
high_school_psychology: 0.8630 (233/270)
high_school_macroeconomics: 0.8173 (161/197)
elementary_mathematics: 0.6522 (120/184)
moral_disputes: 0.7471 (130/174)
prehistory: 0.8081 (139/172)
philosophy: 0.7799 (124/159)
high_school_biology: 0.9079 (138/152)
professional_accounting: 0.5804 (83/143)
clinical_knowledge: 0.7857 (110/140)
high_school_microeconomics: 0.8235 (112/136)
nutrition: 0.8074 (109/135)
professional_medicine: 0.4328 (58/134)
conceptual_physics: 0.7969 (102/128)
high_school_mathematics: 0.1732 (22/127)
human_aging: 0.7155 (83/116)
security_studies: 0.7768 (87/112)
high_school_statistics: 0.6036 (67/111)
marketing: 0.8991 (98/109)
high_school_world_history: 0.8396 (89/106)
sociology: 0.8738 (90/103)
high_school_government_and_politics: 0.9109 (92/101)
high_school_geography: 0.8586 (85/99)
high_school_chemistry: 0.6701 (65/97)
high_school_us_history: 0.8421 (80/95)
virology: 0.4831 (43/89)
college_medicine: 0.7727 (68/88)
world_religions: 0.8068 (71/88)
high_school_physics: 0.5000 (42/84)
electrical_engineering: 0.6420 (52/81)
astronomy: 0.7595 (60/79)
logical_fallacies: 0.8158 (62/76)
high_school_european_history: 0.8082 (59/73)
anatomy: 0.7887 (56/71)
college_biology: 0.8594 (55/64)
human_sexuality: 0.7969 (51/64)
formal_logic: 0.5312 (34/64)
public_relations: 0.6557 (40/61)
international_law: 0.8833 (53/60)
college_physics: 0.3684 (21/57)
college_mathematics: 0.2727 (15/55)
econometrics: 0.6111 (33/54)
jurisprudence: 0.7547 (40/53)
high_school_computer_science: 0.8654 (45/52)
machine_learning: 0.6538 (34/52)
medical_genetics: 0.7647 (39/51)
global_facts: 0.4510 (23/51)
management: 0.9000 (45/50)
us_foreign_policy: 0.9200 (46/50)
college_chemistry: 0.3617 (17/47)
abstract_algebra: 0.4468 (21/47)
business_ethics: 0.7391 (34/46)
college_computer_science: 0.6222 (28/45)
computer_security: 0.7907 (34/43)

MMLU - Massive Multitask Language Understanding, multiple-choice questions across 57 subjects (math, history, law, medicine, etc.).

Quantizations

Filename	Quant	Description
gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-F16.gguf	F16	Full precision
gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-Q8_0.gguf	Q8_0	Near-lossless, recommended
gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-Q6_K.gguf	Q6_K	Excellent quality
gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-Q5_K_M.gguf	Q5_K_M	Good balance
gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-Q5_K_S.gguf	Q5_K_S	Smaller Q5
gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-Q4_K_M.gguf	Q4_K_M	Good for limited VRAM

Vision Projector

Filename	Quant	Description
gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic-mmproj-F16.gguf	F16	Native precision

A Vision Projector File is Required for vision/multimodal capabilities. Use alongside any quantization above.

Usage

Works with llama.cpp, LM Studio, Ollama, and other GGUF-compatible tools.

💻🤖 Gemma4-12B v2 — safetensors master (full precision) ✨

Coding + Agentic Edition · Composer 2.5 × Fable 5 · v2

This is the full-precision safetensors master for my Gemma 4 12B coding + agentic fine-tune — the same model many of you have been running as GGUF, now in its original weights. 🧠🛠️ v2 is the big agentic upgrade: it reads, reasons, uses tools, and works through multi-step technical tasks before it acts. This repo is for builders — roll your own quants, fine-tune further, or run it in transformers.

🎉 Surprise!

A huge thank-you for all the attention this project has gotten — really, thank you. 🙏 I only managed to get out tonight to upload the full-precision original (safetensors master) of this model, so sorry for the wait — I'd planned to put it up last week. But the delay comes with two big surprises I've been dying to share:

1. v3 is coming soon. 🔮 The next version is on its way and will fix several of the known issues you've reported.

2. I'm now working with a top-tier AI lab to give back to the open-source community. 🤝 Many of you have already noticed the side effects in v1 and v2 — and honestly they come down to just two things: (1) not enough compute, and (2) one person with limited expertise behind the whole thing. This collaboration solves both of those completely. And the benchmarks you care about will absolutely be addressed — the things I simply couldn't fully pull off before because of time and compute limits. The people working on this with me are PhDs from top universities, with seriously strong papers and citation records. Just think about that for a second: the people who actually build large models are now contributing to the open-source community together with me — that is genuinely wild. 🤯 We're in active discussions right now, and the project is still in the R&D phase, so I can't share specifics yet — but the moment I have news, you'll be the first to know. 🚀

🎯 What this repo is for

This repo holds the un-quantized master weights (model.safetensors, bf16). Use it to:

🔧 Roll your own quants — make custom GGUF / MLX / AWQ / GPTQ builds from full precision.
🧪 Fine-tune further — it's a clean base for your own LoRA / continued training.
🤗 Run it in transformers (needs a recent build with gemma4_unified support).

🏃 Just want to run it? You don't need this repo — grab a ready-made quant from the GGUF repo → (runs in ~4.5 GB of VRAM / unified memory in LM Studio, Ollama, llama.cpp, Jan…). This master is for builders. 💚

📊 The headline — it works as an agent (tau2-bench)

v2 is built for coding + agentic work — writing code, running commands, using tools, debugging, multi-step technical tasks. The clearest signal is tau2-bench telecom, an agentic tool-use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work:

tau2-bench telecom · 20 tasks · local, same harness, all Q8_0	score
official `gemma-4-12B-it` (base)	~15%
🟢 Gemma4-12B v2 (this model)	~55%

→ Roughly 3.5× higher than the base model on technical-agentic tasks. 🎯

🔬 Honest methodology: these are local, same-harness, relative numbers (all models tested at Q8_0, greedy decoding, self-simulated user, 20 tasks). They are not directly comparable to published tau2-bench leaderboard figures (different user-simulator, full task sets, full precision) — local self-eval runs systematically lower than published scores. Read them as "v2 vs the base model under identical conditions", which is the comparison that actually matters here.

Grounded, not made-up. A coding/terminal fabrication probe (tasks that deliberately tempt the model to invent file paths / function signatures / values) found v2 grounds before it acts just like the base — it grep/read/ls first, and doesn't make things up (0% fabrication, on par with the base).

The trade-off — no free lunch. On a general-knowledge benchmark (MMLU-Pro), v2 lands a little below the base — completely normal for a focused fine-tune: you trade a sliver of broad-knowledge breadth for coding + agentic strength. Need a generalist? Try my general-purpose Claude Opus 4.6/4.8 distillation or the base google/gemma-4-12B-it. Need a local coding/agentic worker? That's what v2 is tuned for. 💚

🤗 Run it in transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto")

msgs = [{"role": "user", "content": "Write a Python function to check if a string is a valid IPv4 address."}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=1024)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

🧠 Thinking mode: it thinks in Gemma's native thought channel before answering (keep enable_thinking=true, the default chat template handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64; for coding you can also go greedy (temp 0). Needs a recent transformers that knows the gemma4_unified architecture.

🛠️ Agentic / tool use: v2 emits structured tool-calls in Gemma 4's native protocol. The smoothest agent setup is a GGUF quant served with llama.cpp --jinja (pass your tools via the OpenAI tools field) — see the GGUF repo for the full command.

📦 Ready-made GGUF quants

All from the GGUF repo:

Quant	Size	Vibe
🟡 Q3_K_M	5.7 GB	great for 8 GB VRAM
🔵 Q4_K_M	6.87 GB	the sweet spot 👌 (recommended)
🟣 Q6_K	9.11 GB	near-lossless
⚪ Q8_0	11.8 GB	basically full quality

⚠️ GGUF needs a recent llama.cpp — this is the gemma4_unified architecture, older builds won't load it. ℹ️ No Q2_K this release — it didn't pass real stress-testing (2-bit is too lossy for 12B coding). Smallest reliable quant = Q3_K_M.

📚 What's new in v2 (training)

v2 continues from the v1 coder and adds a big agentic push — the piece v1 was missing:

🛠️ Agentic / terminal — real multi-step tool-use trajectories (read → reason → act → verify), in Gemma 4's native tool protocol. This is what drove the tau2-bench telecom jump, and it fixes v1's "stops after the first step" behavior.
💻 Coding — verified chain-of-thought over Python tasks (real CoT, gated on passing tests) plus the Fable-5-redo set for the hard cases.
📚 General — a curated slice of reasoning/instruction data to keep broad competence.

All reasoning is distilled CoT. A bittersweet note: none of us saw it coming that Fable 5 would be retired, and only my own dataset holds Fable 5's genuine, self-authored traces — so for the community-contributed data I rebuilt the missing reasoning from scratch with Opus 4.8 (xhigh). It may diverge from the original Fable 5 traces, but it was the only workable path — and the improvement turned out really huge. 💚

⚡ Speculative decoding (MTP draft) — verified build

The GGUF repo's MTP/ folder ships the Gemma 4 multi-token-prediction draft (unsloth's GGUF conversion of Google's official gemma-4-12B-it-assistant) for speculative decoding. Gemma 4 MTP is in llama.cpp mainline (PR #23398) — no fork needed — but the gemma4-assistant loader is build-sensitive right now, so use the exact build below:

✅ Verified working: llama.cpp b9553 (commit 9e3b928fd). Reproduced with gemma4-v2-Q8_0 + the MTP-Q8_0 draft: loads cleanly and accelerates generation (~88 → ~180 tok/s on a simple deterministic prompt; expect ~1.2–1.3× on real coding/thinking). Lossless either way.
⚠️ Newer builds (e.g. b9702 / b9717) currently crash while loading the draft with invalid vector subscript — an upstream regression in the gemma4-assistant loader path, not a problem with the GGUFs. Stick with b9553 until it's fixed upstream.

llama-server -m gemma4-v2-Q8_0.gguf ^
  --model-draft MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^
  --spec-type draft-mtp --spec-draft-n-max 4 ^
  -ngl 99 -ngld 99 -fa on --jinja

ℹ️ The draft is the generic Gemma 4 assistant (not retrained for v2), so acceptance is a touch lower than a model-specific draft would give — still 100% lossless.

⚠️ Good to know

Specialized for coding / terminal / agentic. General-knowledge facts/numbers should still be double-checked.
Reduced refusals: task-focused training, not safety-aligned — add your own guardrails for production. Use responsibly. 🙏
English-centric.

📚 Base & License

License: Apache 2.0. Gemma 4 is released by Google under Apache 2.0 (unlike the older Gemma 1/2/3 terms), so this fine-tune is Apache 2.0 too — free to use, modify, and redistribute. 🎉
Base model: google/gemma-4-12B-it.
Personal/hobby project — shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun, and happy hacking! 🐾✨