BirdAgent · Qwen3-VL-4B

A 4B vision–language agent that identifies birds by orchestrating domain tools,
beating much larger models that are handed the very same tools.

📄 Paper (under review) · 💻 Code · 📊 Benchmarks · 🧩 Base model

TL;DR — Fine-grained bird ID is hard for every model class because the deciding evidence often is not in the image being looked at (it is in a diagnostic call, a range prior, or a magnified detail crop). Instead of scaling the model, we teach a small one to orchestrate the tools that recover that evidence. This repo is the final GSPO checkpoint of a three-stage recipe (SFT → DPO → GSPO). Head-to-head against API models under the same tools, our approach reaches a pooled solve rate of 0.34 (measured at the DPO stage — see the note below), beating every same-tool API model (Qwen3-235B + tools = 0.00, Doubao-2.1-Pro + tools = 0.17), beating Sonnet, and trailing only Opus (0.46), at ~50–60× fewer parameters. The final GSPO stage (this checkpoint) improves further — matched +0.02–0.03 pooled solve with overclaim staying 0.00.

Pooled solve rate: our 4B agent beats every same-tool API model and Sonnet, second only to Opus.
Pooled solve rate (correct at the declared taxonomic grain). Our 4B agent (orange/vermilion) beats every same-tool API model and Sonnet, and trails only Opus. Dashed line = pure-recognition floor (T0).

Highlights

Orchestration beats scale. A trained 4B agent outperforms much larger API models (up to 235B) given the identical tool interface.
Having tools ≠ using tools. Qwen3-235B calls the exact same tools yet scores 0.00 — tool access does not confer tool use.
Calibrated, not reckless. The agent names a species when the evidence supports it and gracefully degrades to genus/family when it does not; species-declaration precision rises to 1.0 with overclaim = 0.00.
Three-modality. Image, sound-only, or image+sound (audio is tool-mediated).
Reproducible & honest. All numbers are Δ-over-base under one identical tool harness; no leaderboard gaming, no closed-model distillation.

Model at a glance


Base model	`Qwen/Qwen3-VL-4B-Instruct` (Apache-2.0)
This repo	LoRA adapter (the released checkpoint = GSPO ckpt-200)
Parameters	4B base + LoRA (`r=64`, `α=128`, `dropout=0.05`, bf16)
Adapter targets	language-model `q/k/v/o/gate/up/down_proj` (`peft ≥ 0.19`)
Training	SFT cold-start → on-policy DPO → GSPO (RLVR)
Task	agentic fine-grained bird identification with calibrated abstention
Modalities	image · sound · image+sound
License	Apache-2.0 (adapter); base is Apache-2.0

⚠️ This repository is the policy only. To run the full agent you also need the tool servers (Grounding-DINO, BioCLIP-2, Perch-2, SINR) and the evaluation harness — see the code repository. Loaded standalone, the model emits Hermes-format <tool_call> turns that expect tool responses to be fed back; it is not a plain image→label classifier.

Results

Two self-built, tier-stratified benchmarks — an agentic (information-gap) set and a calibration set — evaluated under an identical tool harness for every model. Metric = solve: correct at the declared grain (a genus verdict counts iff the genus is right; over-committed species are penalized).

Which stage is in this table? This release is the final GSPO checkpoint. The head-to-head comparison below was measured at the SFT and DPO stages; the final GSPO stage was evaluated separately as a clean apples-to-apples ablation against its DPO initialization (it was not re-run against every API baseline on the full sets). GSPO improves on DPO, so the DPO row (0.34) is a conservative floor for this released model — the extra GSPO gain is shown in the Training recipe figures below (right panel).

#	model / condition	common	uncommon	rare	overall	n
1	Opus · bare	0.75	0.25	0.38	0.46	24
2	Opus · web	0.62	0.25	0.25	0.38	24
3	BirdAgent-DPO (4B) · tools	0.61	0.23	0.22	0.34	80
4	Sonnet · bare	0.62	0.12	0.25	0.33	24
5	BirdAgent-SFT (4B) · tools	0.42	0.25	0.22	0.29	90
6	Sonnet · web	0.50	0.25	0.12	0.29	24
7	Doubao-2.1-Pro · bare	0.41	0.21	0.08	0.23	70
8	T0 tool floor	0.46	0.10	0.03	0.20	900
9	Doubao-2.1-Pro · our-tools	0.45	0.04	0.04	0.17	70
10	Qwen3-235B · bare	0.14	0.00	0.08	0.07	70
11	base-4B · tools	0.12	0.06	0.00	0.06	90
12	Qwen3-235B · web	0.09	0.04	0.00	0.04	70
13	Qwen3-235B · our-tools	0.00	0.00	0.00	0.00	70

Big models: C0 = bare, N = vendor web search, C1 = our exact tools. API baselines run on cost-bounded stratified subsets; our models on the full sets. Vendor web search does not help image ID (C0 ≈ N).

Quickstart

from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel

base = "Qwen/Qwen3-VL-4B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(
    base, torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(model, "Chinzhu/BirdAgent-Qwen3VL-4B")
processor = AutoProcessor.from_pretrained(base)

# BirdAgent is an *agent*: give it the BirdAgent system prompt + the nine tool
# schemas, then run a loop that answers its <tool_call> turns with real tool
# observations (detection boxes, classifier top-k, geo prior, ...). The full
# harness + tool servers are in the code repo:
#   https://github.com/xinzhuwang-wxz/Bird-Agent

How it works

The nine tools

group	tools
Perception (local, deterministic)	`detect_bird` (Grounding-DINO) · `quality_gate` · `zoom_in` · `crop` · `enhance` · `audio_quality`
Recognition (served models)	`classify_image` (BioCLIP-2) · `classify_sound` (Perch-2)
External	`geo_prior` (SINR range × month)

The agent plans a sequence of calls, accumulates the returns in an explicit evidence ledger, and commits a verdict at a calibrated taxonomic grain. All tools are Apache/MIT-licensed and deterministic given input (so their outputs can be pre-cached for single-GPU RL).

Training recipe (SFT → DPO → GSPO)

Stage chain: calibration solve and species-declaration precision across Base/SFT/DPO/GSPO. GSPO-200 vs DPO initialization on both benchmark sets.
Left: DPO sharpens calibration (species-declaration precision 0.33→1.0), then GSPO adds orchestration; overclaim stays 0.00 throughout. Right: GSPO lifts both benchmark sets over its DPO initialization, apples-to-apples.

SFT cold-start on code-authored blueprint tool-use trajectories (loss on assistant + tool-call tokens only; tool observations masked).
On-policy DPO on soft preferences only (ledger discipline, call parsimony) — sharpens calibration (species-declaration precision 0.33 → 1.0).
GSPO (sequence-level importance weighting; β=0.04, lr=1e-6, num_generations=4) with a grain-graded, reachability-aware reward: species +1.0/−0.5, genus +0.4/−0.3, family +0.2/−0.1, abstain 0, with full species credit gated on the truth being present in some tool's top-k. This released checkpoint (GSPO-200) improves over its DPO initialization in a matched apples-to-apples comparison — agentic 0.20 → 0.23 (common 0.22 → 0.33), calibration 0.45 → 0.47 (uncommon 0.31 → 0.39), pooled ≈ 0.325 → 0.35 — while keeping overclaim = 0.00 (bolder but not reckless). The gain saturates by ~50 steps; what remains is the tool ceiling.

Limitations & responsible use

Tier-stratified solve: the common/uncommon/rare cliff is universal — even Opus collapses on the tail.
The common→rare cliff is universal — even Opus collapses on the tail — because the true species leaves the classifiers' top-k. This is the tool ceiling, orthogonal to the agent.

Tool ceiling. Species accuracy is capped by the classifiers: on hard rare-tier items ~76% of errors are cases where the truth is absent from every classifier's top-k. A stronger fine-grained recognizer is an orthogonal lever (the agent already degrades honestly in this regime).
General-VLM regression (format lock). Agentic training locks the model into emitting tool calls; on generic MCQ probes it drops sharply (MMStar 0.51→0.04, MMBench 0.89→0.15). Use it as a bird agent, not a general VLM, unless you mix general trajectories back in.
Evaluation n. API baselines were run on cost-bounded stratified subsets (n = 12–40); numbers are Δ-over-base under one harness, not leaderboard ranks.
Responsible use. Research / non-commercial for v0. The API models above are evaluation controls, not teachers — no closed-model output was distilled into this policy.

Related releases

Chinzhu/BirdAgent-Qwen3VL-4B-DPO — on-policy DPO adapter (this model's GSPO initialization).
Chinzhu/BirdAgent-Qwen3VL-4B-SFT — SFT cold-start adapter.
Chinzhu/BirdAgent-Benchmarks — the agentic (500) and calibration (400) evaluation sets.
Chinzhu/BirdAgent-Train — the training data: SFT (3,953) + DPO (98) + GRPO (10,129).

Citation

@inproceedings{wang2026birdagent,
  title     = {BirdAgent: A Small Vision--Language Model that Orchestrates
               Domain Tools Beats Large Models that Merely Hold Them},
  author    = {Wang, Xinzhu},
  booktitle = {Under review},
  year      = {2026}
}

Acknowledgements

Built on Qwen3-VL, BioCLIP-2, Perch-2, Grounding-DINO, and SINR. Trained with ms-swift.

Downloads last month: -

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chinzhu/BirdAgent-Qwen3VL-4B

Base model

Qwen/Qwen3-VL-4B-Instruct

Adapter

(76)

this model