Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
Qingni Wang, Tiantian Geng, Zhiyuan Wang, Teng Wang, Bo Fu, Feng Zheng

TL;DR
This paper introduces TRON, a versatile framework for risk control and assessment in multimodal large language models, enabling reliable open-ended and closed-ended response prediction with statistical guarantees.
Contribution
The paper presents TRON, a novel two-step conformal prediction framework that generalizes risk control methods to any MLLM supporting sampling, including open-ended scenarios.
Findings
TRON achieves controlled error rates bounded by user-defined risk levels.
Deduplicated prediction sets improve efficiency and stability.
Semantic redundancy analysis offers a new evaluation metric for MLLMs.
Abstract
Multimodal Large Language Models (MLLMs) exhibit promising advancements across various tasks, yet they still encounter significant trustworthiness issues. Prior studies apply Split Conformal Prediction (SCP) in language modeling to construct prediction sets with statistical guarantees. However, these methods typically rely on internal model logits or are restricted to multiple-choice settings, which hampers their generalizability and adaptability in dynamic, open-ended environments. In this paper, we introduce TRON, a two-step framework for risk control and assessment, applicable to any MLLM that supports sampling in both open-ended and closed-ended scenarios. TRON comprises two main components: (1) a novel conformal score to sample response sets of minimum size, and (2) a nonconformity score to identify high-quality responses based on self-consistency theory, controlling the error…
Peer Reviews
Decision·ICLR 2025 Spotlight
- The paper's two-step risk control methodology addresses general shortcoming of the conformal prediction method and provides statistical guarantees for error rates, increasing the reliability of MLLM responses. - Since open ended tasks are more challenging due to large number of possible generation, this method (two step approach, use of semantic similarity) seems to dynamically adapt well to provide flexible prediction set sizes for complex and generic generative scenarios (although we need mo
- Authors already mention under limitations that guarantees are not conditional to individual data points but marginal over the test set. With this limitation, it may still be a bottleneck where risk compliance guarantees are needed for critical applications requiring more stringent guarantees and/or compliance requirements. - more open-ended evaluations and experiments on the open-ended datasets would have shed more light on the strengths and weaknesses of the methods (like Fig 4b). This is a k
1. TRON’s two-step approach combines conformal scores and self-consistency theory to establish a flexible and robust risk assessment framework for MLLMs, particularly in open-ended scenarios, where traditional SCP methods fall short. 2. The paper presents extensive experiments across four VideoQA datasets and eight MLLMs, showcasing TRON's effectiveness and consistency in different VideoQA tasks. 3. By avoiding reliance on model logits, TRON is adaptable for API-restricted MLLMs, expanding its u
I raise the concern that although TRON is evaluated on diverse datasets, it primarily focuses on VideoQA tasks. Could it be tested on additional multimodal tasks to enhance the generalizability of its risk management capabilities?
- The paper is easy to follow and understand. - The proposed method extends SCP to open-ended scenarios by estimating the confidence from frequency.
- The confidence estimation in Step 2 relies on the prediction of another model(e.g. DeBERTa-large-mnli). Then it should be at least discussed on the reliability as the semantic classifier. Otherwise, it makes the identification of risk control less convincing. - It is unclear how the silence percentage is conducted on the audio, and how the conclusion ‘introduce audio modality enhances the confidence level’ (Line371-372) is made. It is shown in Fig. 4 that increasing SPs leads to higher APSS.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsTRON · Sparse Evolutionary Training
