BirdAgent · Qwen3-VL-4B

A 4B vision–language agent that identifies birds by orchestrating domain tools,
beating much larger models that are handed the very same tools.

📄 Paper (under review)  ·  💻 Code  ·  📊 Benchmarks  ·  🧩 Base model


TL;DR — Fine-grained bird ID is hard for every model class because the deciding evidence often is not in the image being looked at (it is in a diagnostic call, a range prior, or a magnified detail crop). Instead of scaling the model, we teach a small one to orchestrate the tools that recover that evidence. This repo is the final GSPO checkpoint of a three-stage recipe (SFT → DPO → GSPO). Head-to-head against API models under the same tools, our approach reaches a pooled solve rate of 0.34 (measured at the DPO stage — see the note below), beating every same-tool API model (Qwen3-235B + tools = 0.00, Doubao-2.1-Pro + tools = 0.17), beating Sonnet, and trailing only Opus (0.46), at ~50–60× fewer parameters. The final GSPO stage (this checkpoint) improves further — matched +0.02–0.03 pooled solve with overclaim staying 0.00.

Pooled solve rate: our 4B agent beats every same-tool API model and Sonnet, second only to Opus.
Pooled solve rate (correct at the declared taxonomic grain). Our 4B agent (orange/vermilion) beats every same-tool API model and Sonnet, and trails only Opus. Dashed line = pure-recognition floor (T0).

Highlights

  • Orchestration beats scale. A trained 4B agent outperforms much larger API models (up to 235B) given the identical tool interface.
  • Having tools ≠ using tools. Qwen3-235B calls the exact same tools yet scores 0.00 — tool access does not confer tool use.
  • Calibrated, not reckless. The agent names a species when the evidence supports it and gracefully degrades to genus/family when it does not; species-declaration precision rises to 1.0 with overclaim = 0.00.
  • Three-modality. Image, sound-only, or image+sound (audio is tool-mediated).
  • Reproducible & honest. All numbers are Δ-over-base under one identical tool harness; no leaderboard gaming, no closed-model distillation.

Model at a glance

Base model Qwen/Qwen3-VL-4B-Instruct (Apache-2.0)
This repo LoRA adapter (the released checkpoint = GSPO ckpt-200)
Parameters 4B base + LoRA (r=64, α=128, dropout=0.05, bf16)
Adapter targets language-model q/k/v/o/gate/up/down_proj (peft ≥ 0.19)
Training SFT cold-start → on-policy DPO → GSPO (RLVR)
Task agentic fine-grained bird identification with calibrated abstention
Modalities image · sound · image+sound
License Apache-2.0 (adapter); base is Apache-2.0

⚠️ This repository is the policy only. To run the full agent you also need the tool servers (Grounding-DINO, BioCLIP-2, Perch-2, SINR) and the evaluation harness — see the code repository. Loaded standalone, the model emits Hermes-format <tool_call> turns that expect tool responses to be fed back; it is not a plain image→label classifier.

Results

Two self-built, tier-stratified benchmarks — an agentic (information-gap) set and a calibration set — evaluated under an identical tool harness for every model. Metric = solve: correct at the declared grain (a genus verdict counts iff the genus is right; over-committed species are penalized).

Which stage is in this table? This release is the final GSPO checkpoint. The head-to-head comparison below was measured at the SFT and DPO stages; the final GSPO stage was evaluated separately as a clean apples-to-apples ablation against its DPO initialization (it was not re-run against every API baseline on the full sets). GSPO improves on DPO, so the DPO row (0.34) is a conservative floor for this released model — the extra GSPO gain is shown in the Training recipe figures below (right panel).

# model / condition common uncommon rare overall n
1 Opus · bare 0.75 0.25 0.38 0.46 24
2 Opus · web 0.62 0.25 0.25 0.38 24
3 BirdAgent-DPO (4B) · tools 0.61 0.23 0.22 0.34 80
4 Sonnet · bare 0.62 0.12 0.25 0.33 24
5 BirdAgent-SFT (4B) · tools 0.42 0.25 0.22 0.29 90
6 Sonnet · web 0.50 0.25 0.12 0.29 24
7 Doubao-2.1-Pro · bare 0.41 0.21 0.08 0.23 70
8 T0 tool floor 0.46 0.10 0.03 0.20 900
9 Doubao-2.1-Pro · our-tools 0.45 0.04 0.04 0.17 70
10 Qwen3-235B · bare 0.14 0.00 0.08 0.07 70
11 base-4B · tools 0.12 0.06 0.00 0.06 90
12 Qwen3-235B · web 0.09 0.04 0.00 0.04 70
13 Qwen3-235B · our-tools 0.00 0.00 0.00 0.00 70

Big models: C0 = bare, N = vendor web search, C1 = our exact tools. API baselines run on cost-bounded stratified subsets; our models on the full sets. Vendor web search does not help image ID (C0 ≈ N).

Quickstart

from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel

base = "Qwen/Qwen3-VL-4B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(
    base, torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(model, "Chinzhu/BirdAgent-Qwen3VL-4B")
processor = AutoProcessor.from_pretrained(base)

# BirdAgent is an *agent*: give it the BirdAgent system prompt + the nine tool
# schemas, then run a loop that answers its <tool_call> turns with real tool
# observations (detection boxes, classifier top-k, geo prior, ...). The full
# harness + tool servers are in the code repo:
#   https://github.com/xinzhuwang-wxz/Bird-Agent

How it works

The nine tools

group tools
Perception (local, deterministic) detect_bird (Grounding-DINO) · quality_gate · zoom_in · crop · enhance · audio_quality
Recognition (served models) classify_image (BioCLIP-2) · classify_sound (Perch-2)
External geo_prior (SINR range × month)

The agent plans a sequence of calls, accumulates the returns in an explicit evidence ledger, and commits a verdict at a calibrated taxonomic grain. All tools are Apache/MIT-licensed and deterministic given input (so their outputs can be pre-cached for single-GPU RL).

Training recipe (SFT → DPO → GSPO)

Stage chain: calibration solve and species-declaration precision across Base/SFT/DPO/GSPO. GSPO-200 vs DPO initialization on both benchmark sets.
Left: DPO sharpens calibration (species-declaration precision 0.33→1.0), then GSPO adds orchestration; overclaim stays 0.00 throughout. Right: GSPO lifts both benchmark sets over its DPO initialization, apples-to-apples.

  • SFT cold-start on code-authored blueprint tool-use trajectories (loss on assistant + tool-call tokens only; tool observations masked).
  • On-policy DPO on soft preferences only (ledger discipline, call parsimony) — sharpens calibration (species-declaration precision 0.33 → 1.0).
  • GSPO (sequence-level importance weighting; β=0.04, lr=1e-6, num_generations=4) with a grain-graded, reachability-aware reward: species +1.0/−0.5, genus +0.4/−0.3, family +0.2/−0.1, abstain 0, with full species credit gated on the truth being present in some tool's top-k. This released checkpoint (GSPO-200) improves over its DPO initialization in a matched apples-to-apples comparison — agentic 0.20 → 0.23 (common 0.22 → 0.33), calibration 0.45 → 0.47 (uncommon 0.31 → 0.39), pooled ≈ 0.325 → 0.35 — while keeping overclaim = 0.00 (bolder but not reckless). The gain saturates by ~50 steps; what remains is the tool ceiling.

Limitations & responsible use

Tier-stratified solve: the common/uncommon/rare cliff is universal — even Opus collapses on the tail.
The common→rare cliff is universal — even Opus collapses on the tail — because the true species leaves the classifiers' top-k. This is the tool ceiling, orthogonal to the agent.

  • Tool ceiling. Species accuracy is capped by the classifiers: on hard rare-tier items ~76% of errors are cases where the truth is absent from every classifier's top-k. A stronger fine-grained recognizer is an orthogonal lever (the agent already degrades honestly in this regime).
  • General-VLM regression (format lock). Agentic training locks the model into emitting tool calls; on generic MCQ probes it drops sharply (MMStar 0.51→0.04, MMBench 0.89→0.15). Use it as a bird agent, not a general VLM, unless you mix general trajectories back in.
  • Evaluation n. API baselines were run on cost-bounded stratified subsets (n = 12–40); numbers are Δ-over-base under one harness, not leaderboard ranks.
  • Responsible use. Research / non-commercial for v0. The API models above are evaluation controls, not teachers — no closed-model output was distilled into this policy.

Related releases

Citation

@inproceedings{wang2026birdagent,
  title     = {BirdAgent: A Small Vision--Language Model that Orchestrates
               Domain Tools Beats Large Models that Merely Hold Them},
  author    = {Wang, Xinzhu},
  booktitle = {Under review},
  year      = {2026}
}

Acknowledgements

Built on Qwen3-VL, BioCLIP-2, Perch-2, Grounding-DINO, and SINR. Trained with ms-swift.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chinzhu/BirdAgent-Qwen3VL-4B

Adapter
(76)
this model