GFusion-10B-A1.8B

GFusion-10B-A1.8B is an experimental instruction-tuned diffusion language model trained by adapting GigaChat3-10B-A1.8B-base to block diffusion generation, followed by context extension, SFT, and confidence tuning.

GFusion uses a block size of 32 tokens and performs decoding with entropy-bounded sampling. In contrast to standard autoregressive generation, the model iteratively refines partially masked token blocks. This allows it to finalize multiple tokens in a single forward pass and provides a controllable trade-off between generation quality and decoding speed.

For architecture details, please refer to the GigaChat3-10B-A1.8B-base model card.

More details about GFusion are available in the Habr article.

Inference

Single-user (concurrency=1) decode throughput comparison using aiperf + sglang.

image

Benchmarks

Benchmark GFusion
10B-A1.8B
GFusion + CT
10B-A1.8B
GigaChat3
10B-A1.8B
LLaDA-MoE
7B-A1.4B
LLaDA2.0-mini preview
16B-A1.4B
MMLU 73.38 73.09 71.20 67.18 72.49
MMLU-Pro 58.48 58.04 59.60 44.64 49.22
IFEval 70.38 71.22 66.55 59.33 62.50
GPQA 33.84 32.12 35.02 -- 23.74
TruthfulQA 44.84 44.68 45.90 -- 56.54
GSM8K 84.48 83.78 85.44 82.41 89.01
MGSM 78.80 79.20 76.80 -- 81.44
MATH 68.08 66.86 70.00 58.68 73.50
MBPP+ 67.20 65.81 63.60 -- 66.67
HumanEval 75.00 71.34 72.56 61.59 80.49
HumanEval+ 65.63 63.63 66.46 -- 71.95
LCB-Lite 29.10 29.09 31.94 -- 29.07
RUBQ 63.49 62.56 65.16 -- 16.84
MMLU-RU 67.92 67.74 66.20 -- 50.48
IFEval-RU 61.27 64.51 64.19 -- 55.75

Quickstart

HF Transformers 🤗

from transformers import AutoTokenizer, AutoModelForCausalLM

device = "auto"
model_path = "ai-sage/GFusion-10B-A1.8B"

model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map=device, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    model_path, device_map=device, trust_remote_code=True
)

messages = [
    {"role": "user", "content": "What are the KKT optimality conditions?"}
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    block_size=32,
    gamma=0.70
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

SGLang

GFusion support is available in SGLang PR #29776:

git clone https://github.com/sgl-project/sglang.git
cd sglang

git fetch origin refs/pull/29776/head:gfusion
git switch gfusion

python -m pip install --upgrade pip setuptools wheel
python -m pip install -e "python"

Create an EBSampling config file:

# eb_sampling.yaml
gamma: 0.15

Start the server with entropy-bounded sampling and FA3 attention:

python -m sglang.launch_server \
  --model-path ai-sage/GFusion-10B-A1.8B \
  --dllm-algorithm EBSampling \
  --dllm-algorithm-config eb_sampling.yaml \
  --attention-backend fa3 \
  --host 0.0.0.0 \
  --port 30000 \
  --dtype auto \
  --mem-fraction-static 0.88 \
  --cuda-graph-bs-decode 1

If FA3 is not available in your environment, use the Triton backend instead:

python -m sglang.launch_server \
  --model-path ai-sage/GFusion-10B-A1.8B \
  --dllm-algorithm EBSampling \
  --dllm-algorithm-config eb_sampling.yaml \
  --attention-backend triton \
  --host 0.0.0.0 \
  --port 30000 \
  --dtype auto \
  --mem-fraction-static 0.88

Example request for the instruct model:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai-sage/GFusion-10B-A1.8B",
    "temperature": 0,
    "max_tokens": 512,
    "messages": [
      {
        "role": "user",
        "content": "What are the KKT optimality conditions?"
      }
    ]
  }'
Downloads last month
-
Safetensors
Model size
11B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ai-sage/GFusion-10B-A1.8B