Feasibility with Language Models for Open-World Compositional Zero-Shot Learning

Jae Myung Kim; Stephan Alaniz; Cordelia Schmid; Zeynep Akata

arXiv:2505.11181·cs.AI·May 19, 2025

Feasibility with Language Models for Open-World Compositional Zero-Shot Learning

Jae Myung Kim, Stephan Alaniz, Cordelia Schmid, Zeynep Akata

PDF

Open Access 3 Reviews

TL;DR

This paper introduces FLM, a method leveraging large language models to assess the feasibility of state-object pairs, significantly improving open-world compositional zero-shot learning performance.

Contribution

The work demonstrates how external LLMs can effectively determine feasibility of unseen combinations, enhancing zero-shot learning in open-world scenarios.

Findings

01

FLM improves OW-CZSL performance across benchmarks.

02

Vicuna and ChatGPT are identified as top-performing LLMs.

03

In-context learning is crucial for LLM effectiveness in feasibility assessment.

Abstract

Humans can easily tell if an attribute (also called state) is realistic, i.e., feasible, for an object, e.g. fire can be hot, but it cannot be wet. In Open-World Compositional Zero-Shot Learning, when all possible state-object combinations are considered as unseen classes, zero-shot predictors tend to perform poorly. Our work focuses on using external auxiliary knowledge to determine the feasibility of state-object combinations. Our Feasibility with Language Model (FLM) is a simple and effective approach that leverages Large Language Models (LLMs) to better comprehend the semantic relationships between states and objects. FLM involves querying an LLM about the feasibility of a given pair and retrieving the output logit for the positive answer. To mitigate potential misguidance of the LLM given that many of the state-object compositions are rare or completely infeasible, we observe that…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

S1: The studied problem about open-world compositional zero-shot learning is significant important and can apply to the real-world scene. S2: The large-language models are used to reduce the gap between machines and humans. S3: Extensive experiments on many prompt variants and six LLMs shows the best performence.

Weaknesses

W1: Is this the first paper to solve the CZSL problem by using the LLMs? If yes, I am curious about the motivation or some motivation experiments to demonstrate the effectiveness of LLMs? If no, I tend to see some differents compared with other published related works. W2: This method in this paper is not novel and performance improvement depends entirely on the language model. If the language model introduces biases, such as racial discrimination, during training, will this also affect downst

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

1. Figure 1 is well designed and helps the reader to understand the content. 2. The proposed method is simple and easy to understand.

Weaknesses

1. The main concern of this work is its contribution. The paper basically uses the existing LLM to determine the feasibility of a state-object combination. This only shows that the existing LLM is able to determine the feasibility of a state-object combination, but what is the author’s contribution throughout the process? 2. Since different threshold will affect the binary classification performance, wouldn’t a metric like ROC curve suits the tasks better? 3. For Figure 2, it seems that both gr

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

To the best of the reviewer’s knowledge, this method proposed in this paper is novel. This paper is clearly motivated and the intuition behind the proposed methods are also very clear. The idea of using LLMs for solving feasibility conflicts is simple yet quite effective. The authors also show that as an orthogonal component to existing compositional zero shot learning methods, LLM-guided feasibility calibration can clearly boost the performance for most of the scenarios.

Weaknesses

Despite the work’s obvious merit, the idea itself is very simple. Within the ablations, it would be helpful if the authors are to thoroughly examine more variants of prompts since LLMs output can vary a lot. The performance variations under such scenarios would be very informative to the community.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications