LVLM-Aided Alignment of Task-Specific Vision Models

Alexander Koebler; Lukas Kuhn; Ingo Thon; Florian Buettner

arXiv:2512.21985·cs.CV·May 5, 2026

LVLM-Aided Alignment of Task-Specific Vision Models

Alexander Koebler, Lukas Kuhn, Ingo Thon, Florian Buettner

PDF

TL;DR

This paper introduces LVLM-VA, a novel method that uses large vision language models to align small, task-specific vision models with human knowledge, reducing reliance on spurious features.

Contribution

The paper presents a new LVLM-aided approach for aligning vision models with human knowledge, improving interpretability and robustness without fine-grained feedback.

Findings

01

Significant improvement in model alignment with human specifications.

02

Reduces dependence on spurious features and biases.

03

Validated on synthetic and real-world datasets.

Abstract

In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.