DuoTeach: Dual Role Self-Teaching for Coarse-to-Fine Decision Coordination in Vision--Language Models

Wei Yang; Yiran Zhu; Zilin Li; Xunjia Zhang; Jun Xia; Hongtao Wang

arXiv:2511.18415·cs.MM·March 20, 2026

DuoTeach: Dual Role Self-Teaching for Coarse-to-Fine Decision Coordination in Vision--Language Models

Wei Yang, Yiran Zhu, Zilin Li, Xunjia Zhang, Jun Xia, Hongtao Wang

PDF

Open Access

TL;DR

This paper introduces DuoTeach, a self-distillation framework that enhances vision-language models' ability to make coherent, multi-level taxonomy decisions in a coarse-to-fine manner, significantly improving accuracy and zero-shot generalization.

Contribution

DuoTeach is a novel dual-role self-teaching distillation method that improves cross-level decision coordination in vision-language models without requiring ground-truth labels.

Findings

01

Up to 30.24-point improvement in DWPA on benchmarks.

02

Zero-shot performance on unseen taxonomies increased from 17.17% to 43.66%.

03

Enhanced within-call multi-level decision coordination.

Abstract

Coarse-to-fine path decision-making requires predicting a valid taxonomy path in which earlier decisions constrain later ones. However, existing benchmarks score each level independently, obscuring cross-level validity and consistency. To better align evaluation with this setting, we introduce a Joint Path Decision (JPD) protocol that requires predicting the full path in one call, together with Depth-Weighted Prefix Accuracy (DWPA), a metric family that measures path reliability with tunable emphasis on deeper levels. Under JPD, strong vision-language models (VLMs) frequently produce invalid parent-child pairs and brittle full-path predictions, suggesting that their failures stem not only from incomplete taxonomic knowledge but also from unstable cross-level decision coordination. To address this problem, we propose DuoTeach, a dual-role self-teaching distillation framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications