Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Xudong Yan; Songhe Feng; Yang Zhang; Jian Yang; Yueguan Lin; Haojun Fei

arXiv:2411.12584·cs.CV·June 10, 2025

Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Xudong Yan, Songhe Feng, Yang Zhang, Jian Yang, Yueguan Lin, Haojun Fei

PDF

Open Access 1 Repo 5 Reviews

TL;DR

This paper introduces a novel CZSL framework utilizing MLLM embeddings and attribute smoothing to improve the recognition of unseen attribute-object compositions, addressing background interference, semantic limitations of word embeddings, and overconfidence issues.

Contribution

It proposes a new disentanglement method using MLLM embeddings and attribute smoothing with LLM-generated auxiliary attributes for better generalization in CZSL.

Findings

01

Achieves state-of-the-art results on three datasets.

02

Effectively mitigates background interference in disentanglement.

03

Addresses overconfidence in models for unseen compositions.

Abstract

Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attributes and objects by extracting shared and exclusive parts between the image pair sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) The efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attributes with objects in the same parts. (2) Existing word embeddings fail to capture complex multimodal semantic information. (3) Overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

1) A simple pipeline that works well for the problem of Compositional Zero-Shot Learning. 2) Experiment results are good compared with some existing methods.

Weaknesses

1) Novelty is very limited. The pipeline consists of three modules: feature extractor/aggregator; the so-called Attribute-Object Disentanglement by using a LLM to generate some potential adjective attributes; and feature alignment. The only novelty is the use of LLM to generate potential attributes. This is to me somewhat very simple. While I understand it might lead to better generaization by using LLM than to train an attribute classifier as seen in the literature, this is very simple. 2) It

Reviewer 02Rating 5Confidence 4

Strengths

1. This paper considers the impact of the background noise of the CZSL datasets, which is a critical problem, and proposes a solution using feature adaptive aggregation (FAA) modules. Ablation experiments show that the module is effective. 2. This paper points out the problem that objects in CZSL datasets naturally have multiple attributes, while there is only one label, which is also critical, and uses a large language model to solve this problem. Attribute smoothing is also proposed. By using

Weaknesses

1. Since the author believes that word embeddings, such as Word2Vec (Mikolov, 2013) and GloVe (Pennington et al., 2014) have a poor ability to capture cross-modal information, why not use CLIP (Nayak et al., 2023)? CLIP is trained on image-text pairs and thus can solve this problem. 2. Missing comparisons with several recent papers [1,2,3] which are based on CLIP. CLIP-based methods outperform TRIDENT in Table 1. Comparative experiments between using the last hidden states of MLLM as word embe

Reviewer 03Rating 5Confidence 3

Strengths

1. The paper is well-organized and easy to follow. 2. This paper conducts comprehensive research on CZSL, with a clear and straightforward motivation.

Weaknesses

1. Some annotations could be simplified. For example, in Eq. (5), certain parts of the equation appear to be duplicated. Simplifying these would improve clarity. 2. In Section 3.2.2, the approach of using a weighted disentanglement module to separate object and attribute features, while elegant, is somewhat difficult to follow. Adding a small figure to illustrate the mechanism would enhance understanding. Additionally, this section provides limited evidence to demonstrate that these designs are

Reviewer 04Rating 5Confidence 4

Strengths

1. This paper introduces a novel method of attribute-object disentanglement with adaptive aggregation and learnable masks. 2. The framework’s effectiveness is substantiated through extensive experiments. 3. Attribute smoothing using auxiliary attributes generated by LLM shows promise in reducing overconfidence and enhancing model generalization.

Weaknesses

1. Leveraging MLLMs like LLaVA to extract attribute embeddings raises potential concerns of data leakage, especially if the LLaVA model was trained on images from unseen pairs. This could inadvertently influence performance in the zero-shot setting. 2. While the paper claims to address overconfidence in seen compositions, Table 1 suggests that the primary performance improvements are concentrated in the seen classes, which appears to contradict this claim. 3. The performance gains over previou

Reviewer 05Rating 5Confidence 4

Strengths

**1.** This work innovatively leverages multimodal large language model for CZSL, the idea is novel. **2.** The organization of this article is reasonable and well-written. **3.** Extensive experiments on three benchmarks show that the improvement in performance is noteworthy.

Weaknesses

**1.** The Figure 2 is ambiguous: the training and frozen modules are not clearly labeled, for example, the last hidden states of MLLM is trained but not the LLM, and the image embedder is trained but not the visual backbone; the graphical representation is inconsistent, for example, the network module image embedder is represented by a rectangle, but FAA and MLP are represented by text lines, which can easily be confused with other text such as “patches”; in the attribute-object disentanglement

Code & Models

Repositories

xud-yan/Trident
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Geophysical Methods and Applications

MethodsAttentive Walk-Aggregating Graph Neural Network