Instructions to use Chinzhu/BirdAgent-Qwen3VL-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Chinzhu/BirdAgent-Qwen3VL-4B with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-4B-Instruct") model = PeftModel.from_pretrained(base_model, "Chinzhu/BirdAgent-Qwen3VL-4B") - Notebooks
- Google Colab
- Kaggle
BirdAgent · Qwen3-VL-4B
A 4B vision–language agent that identifies birds by orchestrating domain tools,
beating much larger models that are handed the very same tools.
📄 Paper (under review) · 💻 Code · 📊 Benchmarks · 🧩 Base model
TL;DR — Fine-grained bird ID is hard for every model class because the deciding evidence often is not in the image being looked at (it is in a diagnostic call, a range prior, or a magnified detail crop). Instead of scaling the model, we teach a small one to orchestrate the tools that recover that evidence. This repo is the final GSPO checkpoint of a three-stage recipe (SFT → DPO → GSPO). Head-to-head against API models under the same tools, our approach reaches a pooled solve rate of 0.34 (measured at the DPO stage — see the note below), beating every same-tool API model (Qwen3-235B + tools = 0.00, Doubao-2.1-Pro + tools = 0.17), beating Sonnet, and trailing only Opus (0.46), at ~50–60× fewer parameters. The final GSPO stage (this checkpoint) improves further — matched +0.02–0.03 pooled solve with overclaim staying 0.00.

Pooled solve rate (correct at the declared taxonomic grain). Our 4B agent
(orange/vermilion) beats every same-tool API model and Sonnet, and trails only
Opus. Dashed line = pure-recognition floor (T0).
Highlights
- Orchestration beats scale. A trained 4B agent outperforms much larger API models (up to 235B) given the identical tool interface.
- Having tools ≠ using tools. Qwen3-235B calls the exact same tools yet scores 0.00 — tool access does not confer tool use.
- Calibrated, not reckless. The agent names a species when the evidence supports it and gracefully degrades to genus/family when it does not; species-declaration precision rises to 1.0 with overclaim = 0.00.
- Three-modality. Image, sound-only, or image+sound (audio is tool-mediated).
- Reproducible & honest. All numbers are Δ-over-base under one identical tool harness; no leaderboard gaming, no closed-model distillation.
Model at a glance
| Base model | Qwen/Qwen3-VL-4B-Instruct (Apache-2.0) |
| This repo | LoRA adapter (the released checkpoint = GSPO ckpt-200) |
| Parameters | 4B base + LoRA (r=64, α=128, dropout=0.05, bf16) |
| Adapter targets | language-model q/k/v/o/gate/up/down_proj (peft ≥ 0.19) |
| Training | SFT cold-start → on-policy DPO → GSPO (RLVR) |
| Task | agentic fine-grained bird identification with calibrated abstention |
| Modalities | image · sound · image+sound |
| License | Apache-2.0 (adapter); base is Apache-2.0 |
⚠️ This repository is the policy only. To run the full agent you also need the tool servers (Grounding-DINO, BioCLIP-2, Perch-2, SINR) and the evaluation harness — see the code repository. Loaded standalone, the model emits Hermes-format
<tool_call>turns that expect tool responses to be fed back; it is not a plain image→label classifier.
Results
Two self-built, tier-stratified benchmarks — an agentic (information-gap) set and a calibration set — evaluated under an identical tool harness for every model. Metric = solve: correct at the declared grain (a genus verdict counts iff the genus is right; over-committed species are penalized).
Which stage is in this table? This release is the final GSPO checkpoint. The head-to-head comparison below was measured at the SFT and DPO stages; the final GSPO stage was evaluated separately as a clean apples-to-apples ablation against its DPO initialization (it was not re-run against every API baseline on the full sets). GSPO improves on DPO, so the DPO row (0.34) is a conservative floor for this released model — the extra GSPO gain is shown in the Training recipe figures below (right panel).
| # | model / condition | common | uncommon | rare | overall | n |
|---|---|---|---|---|---|---|
| 1 | Opus · bare | 0.75 | 0.25 | 0.38 | 0.46 | 24 |
| 2 | Opus · web | 0.62 | 0.25 | 0.25 | 0.38 | 24 |
| 3 | BirdAgent-DPO (4B) · tools | 0.61 | 0.23 | 0.22 | 0.34 | 80 |
| 4 | Sonnet · bare | 0.62 | 0.12 | 0.25 | 0.33 | 24 |
| 5 | BirdAgent-SFT (4B) · tools | 0.42 | 0.25 | 0.22 | 0.29 | 90 |
| 6 | Sonnet · web | 0.50 | 0.25 | 0.12 | 0.29 | 24 |
| 7 | Doubao-2.1-Pro · bare | 0.41 | 0.21 | 0.08 | 0.23 | 70 |
| 8 | T0 tool floor | 0.46 | 0.10 | 0.03 | 0.20 | 900 |
| 9 | Doubao-2.1-Pro · our-tools | 0.45 | 0.04 | 0.04 | 0.17 | 70 |
| 10 | Qwen3-235B · bare | 0.14 | 0.00 | 0.08 | 0.07 | 70 |
| 11 | base-4B · tools | 0.12 | 0.06 | 0.00 | 0.06 | 90 |
| 12 | Qwen3-235B · web | 0.09 | 0.04 | 0.00 | 0.04 | 70 |
| 13 | Qwen3-235B · our-tools | 0.00 | 0.00 | 0.00 | 0.00 | 70 |
Big models: C0 = bare, N = vendor web search, C1 = our exact tools. API baselines run on cost-bounded stratified subsets; our models on the full sets. Vendor web search does not help image ID (C0 ≈ N).
Quickstart
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel
base = "Qwen/Qwen3-VL-4B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(
base, torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(model, "Chinzhu/BirdAgent-Qwen3VL-4B")
processor = AutoProcessor.from_pretrained(base)
# BirdAgent is an *agent*: give it the BirdAgent system prompt + the nine tool
# schemas, then run a loop that answers its <tool_call> turns with real tool
# observations (detection boxes, classifier top-k, geo prior, ...). The full
# harness + tool servers are in the code repo:
# https://github.com/xinzhuwang-wxz/Bird-Agent
How it works
The nine tools
| group | tools |
|---|---|
| Perception (local, deterministic) | detect_bird (Grounding-DINO) · quality_gate · zoom_in · crop · enhance · audio_quality |
| Recognition (served models) | classify_image (BioCLIP-2) · classify_sound (Perch-2) |
| External | geo_prior (SINR range × month) |
The agent plans a sequence of calls, accumulates the returns in an explicit evidence ledger, and commits a verdict at a calibrated taxonomic grain. All tools are Apache/MIT-licensed and deterministic given input (so their outputs can be pre-cached for single-GPU RL).
Training recipe (SFT → DPO → GSPO)

Left: DPO sharpens calibration (species-declaration precision 0.33→1.0),
then GSPO adds orchestration; overclaim stays 0.00 throughout. Right: GSPO lifts
both benchmark sets over its DPO initialization, apples-to-apples.
- SFT cold-start on code-authored blueprint tool-use trajectories (loss on assistant + tool-call tokens only; tool observations masked).
- On-policy DPO on soft preferences only (ledger discipline, call parsimony) — sharpens calibration (species-declaration precision 0.33 → 1.0).
- GSPO (sequence-level importance weighting;
β=0.04,lr=1e-6,num_generations=4) with a grain-graded, reachability-aware reward: species+1.0/−0.5, genus+0.4/−0.3, family+0.2/−0.1, abstain0, with full species credit gated on the truth being present in some tool's top-k. This released checkpoint (GSPO-200) improves over its DPO initialization in a matched apples-to-apples comparison — agentic 0.20 → 0.23 (common 0.22 → 0.33), calibration 0.45 → 0.47 (uncommon 0.31 → 0.39), pooled ≈ 0.325 → 0.35 — while keeping overclaim = 0.00 (bolder but not reckless). The gain saturates by ~50 steps; what remains is the tool ceiling.
Limitations & responsible use

The common→rare cliff is universal — even Opus collapses on the tail —
because the true species leaves the classifiers' top-k. This is the tool
ceiling, orthogonal to the agent.
- Tool ceiling. Species accuracy is capped by the classifiers: on hard rare-tier items ~76% of errors are cases where the truth is absent from every classifier's top-k. A stronger fine-grained recognizer is an orthogonal lever (the agent already degrades honestly in this regime).
- General-VLM regression (format lock). Agentic training locks the model into emitting tool calls; on generic MCQ probes it drops sharply (MMStar 0.51→0.04, MMBench 0.89→0.15). Use it as a bird agent, not a general VLM, unless you mix general trajectories back in.
- Evaluation n. API baselines were run on cost-bounded stratified subsets (n = 12–40); numbers are Δ-over-base under one harness, not leaderboard ranks.
- Responsible use. Research / non-commercial for v0. The API models above are evaluation controls, not teachers — no closed-model output was distilled into this policy.
Related releases
Chinzhu/BirdAgent-Qwen3VL-4B-DPO— on-policy DPO adapter (this model's GSPO initialization).Chinzhu/BirdAgent-Qwen3VL-4B-SFT— SFT cold-start adapter.Chinzhu/BirdAgent-Benchmarks— the agentic (500) and calibration (400) evaluation sets.Chinzhu/BirdAgent-Train— the training data: SFT (3,953) + DPO (98) + GRPO (10,129).
Citation
@inproceedings{wang2026birdagent,
title = {BirdAgent: A Small Vision--Language Model that Orchestrates
Domain Tools Beats Large Models that Merely Hold Them},
author = {Wang, Xinzhu},
booktitle = {Under review},
year = {2026}
}
Acknowledgements
Built on Qwen3-VL, BioCLIP-2, Perch-2, Grounding-DINO, and SINR. Trained with ms-swift.
- Downloads last month
- -
Model tree for Chinzhu/BirdAgent-Qwen3VL-4B
Base model
Qwen/Qwen3-VL-4B-Instruct