Data Descriptions from Large Language Models with Influence Estimation

Chaeri Kim; Jaeyeon Bae; Taehwan Kim

arXiv:2511.07897·cs.AI·November 12, 2025

Data Descriptions from Large Language Models with Influence Estimation

Chaeri Kim, Jaeyeon Bae, Taehwan Kim

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a novel method to generate and select textual data descriptions from large language models using influence estimation, enhancing interpretability and improving zero-shot classification performance across multiple datasets.

Contribution

The paper presents a new pipeline for explaining data with language models, incorporating influence estimation and a cross-modal transferability benchmark to evaluate effectiveness.

Findings

01

Textual descriptions outperform baselines in zero-shot classification

02

Boosts performance of image-only trained models across nine datasets

03

Evaluation with GPT-4o supports the interpretability of the approach

Abstract

Deep learning models have been successful in many areas but understanding their behaviors still remains a black-box. Most prior explainable AI (XAI) approaches have focused on interpreting and explaining how models make predictions. In contrast, we would like to understand how data can be explained with deep learning model training and propose a novel approach to understand the data via one of the most common media - language - so that humans can easily understand. Our approach proposes a pipeline to generate textual descriptions that can explain the data with large language models by incorporating external knowledge bases. However, generated data descriptions may still include irrelevant information, so we introduce to exploit influence estimation to choose the most informative textual descriptions, along with the CLIP score. Furthermore, based on the phenomenon of cross-modal…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 3· reject, not good enoughConfidence 5

Strengths

- Good coverage of datasets - both commonly used for classification tasks, and more niche ones - in a variety of domains. - Thorough re-implementation of the baseline (Menon & Vondrick), with both reported performance, reproduced performance, and performance of the baseline method using GPT-3.5 instead of the original GPT-3 for a fair comparison. - Interesting problem and relevant to practitioners of computer vision for a variety of tasks at scale: labelling new datasets, object classificatio

Weaknesses

- Paper is generally poorly structured and has some grammatical errors that make the reading flow difficult. It would be helpful if the abstract was shorter and to the point, if the introduction was clearer on the problem the paper addresses and why it is important, and if the experiment section had small introductions to each subsection to assist the flow. - The two-step process for extracting textual descriptions with GPT-3.5 seems difficult to reproduce: the first step is to get the componen

Reviewer 02Rating 1· strong rejectConfidence 5

Strengths

I think the idea of using text to enhance classification performance of images is intriguing.

Weaknesses

I find the paper poorly written and contributions marginal relative to prior work. Thus, I recommend rejection of the manuscript in its current form. Detailed reasons in questions.

Reviewer 03Rating 3· reject, not good enoughConfidence 3

Strengths

- Multiple datasets and baselines are comprehensively compared in evaluating the effectiveness of the proposed framework. - The proposed method is straightforward, intuitive, and easy-to-follow. Readers can easily understand the purposes and follow the content of the paper.

Weaknesses

- The hallucination of GPT-3.5 can undermine the credibility of explanations from the suggested framework. The paper overlooks the topic of countering the effects of hallucination, rendering the paper less persuasive. Although the paper claims to use external knowledge to mitigate the misinformation aspects, there's a lack of discussion on how to deal with implicit hallucinations, which are even more challenging than preventing explicit hallucinations. - The entire framework appears to pipe vari

Reviewer 04Rating 3· reject, not good enoughConfidence 5

Strengths

- Incorporating external knowledge in Step 1 and selecting the most informative class description using scoring functions in Step 2 are innovative approaches that could offer deeper insights into model predictions. - Experimental results show improvements in model performance when the proposed method is used, thereby demonstrating its practical value.

Weaknesses

- The paper is challenging to follow, particularly the underlying motivation. I have to read many times to understand the motivation, methods, and results (see my summary). It requires significant revision for clarity before being published. - The improvements over the [1] baseline are modest. In the most critical comparison of "Baseline (GPT-3.5)" versus "Ours Zero-shot" (Table 1), which compares the quality of class descriptions, the gains on 7 out of 9 datasets are relatively minor (typicall

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis