Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models

Hulingxiao He; Zhi Tan; Yuxin Peng

arXiv:2603.00431·cs.CV·March 24, 2026

Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models

Hulingxiao He, Zhi Tan, Yuxin Peng

PDF

Open Access 1 Datasets

TL;DR

This paper introduces TARA, a method that improves hierarchical visual recognition in large multimodal models by aligning visual features with biological taxonomy knowledge, enhancing accuracy for known and novel categories.

Contribution

TARA is a novel approach that injects taxonomic knowledge into LMMs using hierarchical contrastive learning with biological foundation models, improving hierarchical consistency and recognition accuracy.

Findings

01

Enhances LMMs' hierarchical consistency.

02

Improves leaf node accuracy for known and novel categories.

03

Effective in complex biological taxonomies.

Abstract

A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

StevenHH2000/iNat21-1shot-fewshots
dataset· 33 dl
33 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Smart Agriculture and AI · Multimodal Machine Learning Applications