Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning
Xudong Yan, Songhe Feng, Yang Zhang, Jian Yang, Yueguan Lin, Haojun Fei

TL;DR
This paper introduces a novel CZSL framework utilizing MLLM embeddings and attribute smoothing to improve the recognition of unseen attribute-object compositions, addressing background interference, semantic limitations of word embeddings, and overconfidence issues.
Contribution
It proposes a new disentanglement method using MLLM embeddings and attribute smoothing with LLM-generated auxiliary attributes for better generalization in CZSL.
Findings
Achieves state-of-the-art results on three datasets.
Effectively mitigates background interference in disentanglement.
Addresses overconfidence in models for unseen compositions.
Abstract
Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attributes and objects by extracting shared and exclusive parts between the image pair sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) The efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attributes with objects in the same parts. (2) Existing word embeddings fail to capture complex multimodal semantic information. (3) Overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1) A simple pipeline that works well for the problem of Compositional Zero-Shot Learning. 2) Experiment results are good compared with some existing methods.
1) Novelty is very limited. The pipeline consists of three modules: feature extractor/aggregator; the so-called Attribute-Object Disentanglement by using a LLM to generate some potential adjective attributes; and feature alignment. The only novelty is the use of LLM to generate potential attributes. This is to me somewhat very simple. While I understand it might lead to better generaization by using LLM than to train an attribute classifier as seen in the literature, this is very simple. 2) It
1. This paper considers the impact of the background noise of the CZSL datasets, which is a critical problem, and proposes a solution using feature adaptive aggregation (FAA) modules. Ablation experiments show that the module is effective. 2. This paper points out the problem that objects in CZSL datasets naturally have multiple attributes, while there is only one label, which is also critical, and uses a large language model to solve this problem. Attribute smoothing is also proposed. By using
1. Since the author believes that word embeddings, such as Word2Vec (Mikolov, 2013) and GloVe (Pennington et al., 2014) have a poor ability to capture cross-modal information, why not use CLIP (Nayak et al., 2023)? CLIP is trained on image-text pairs and thus can solve this problem. 2. Missing comparisons with several recent papers [1,2,3] which are based on CLIP. CLIP-based methods outperform TRIDENT in Table 1. Comparative experiments between using the last hidden states of MLLM as word embe
1. The paper is well-organized and easy to follow. 2. This paper conducts comprehensive research on CZSL, with a clear and straightforward motivation.
1. Some annotations could be simplified. For example, in Eq. (5), certain parts of the equation appear to be duplicated. Simplifying these would improve clarity. 2. In Section 3.2.2, the approach of using a weighted disentanglement module to separate object and attribute features, while elegant, is somewhat difficult to follow. Adding a small figure to illustrate the mechanism would enhance understanding. Additionally, this section provides limited evidence to demonstrate that these designs are
1. This paper introduces a novel method of attribute-object disentanglement with adaptive aggregation and learnable masks. 2. The framework’s effectiveness is substantiated through extensive experiments. 3. Attribute smoothing using auxiliary attributes generated by LLM shows promise in reducing overconfidence and enhancing model generalization.
1. Leveraging MLLMs like LLaVA to extract attribute embeddings raises potential concerns of data leakage, especially if the LLaVA model was trained on images from unseen pairs. This could inadvertently influence performance in the zero-shot setting. 2. While the paper claims to address overconfidence in seen compositions, Table 1 suggests that the primary performance improvements are concentrated in the seen classes, which appears to contradict this claim. 3. The performance gains over previou
**1.** This work innovatively leverages multimodal large language model for CZSL, the idea is novel. **2.** The organization of this article is reasonable and well-written. **3.** Extensive experiments on three benchmarks show that the improvement in performance is noteworthy.
**1.** The Figure 2 is ambiguous: the training and frozen modules are not clearly labeled, for example, the last hidden states of MLLM is trained but not the LLM, and the image embedder is trained but not the visual backbone; the graphical representation is inconsistent, for example, the network module image embedder is represented by a rectangle, but FAA and MLP are represented by text lines, which can easily be confused with other text such as “patches”; in the attribute-object disentanglement
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Geophysical Methods and Applications
MethodsAttentive Walk-Aggregating Graph Neural Network
