Unified Multimodal Uncertain Inference
Dengjia Zhang, Alexander Martin, William Jurayj, Kenton Murray, Benjamin Van Durme, Reno Kriz

TL;DR
This paper presents UMUI, a new multimodal inference framework for probabilistic reasoning across text, audio, and video, with a calibration method and a human-annotated evaluation set.
Contribution
It introduces UMUI for probabilistic multimodal inference, a calibration technique, and a comprehensive evaluation dataset across multiple modalities.
Findings
The 3B-parameter model matches or exceeds larger baselines in all modalities.
The evaluation set enables fine-grained probabilistic reasoning assessment.
The CLUE calibration method improves prediction reliability.
Abstract
We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
