Listening to the Unspoken: Exploring "365" Aspects of Multimodal Interview Performance Assessment
Jia Li, Yang Wang, Wenhao Qian, Jialong Hu, Zhenzhen Hu, Richang Hong, Meng Wang

TL;DR
This paper introduces a comprehensive multimodal framework for interview performance assessment that integrates video, audio, and text data across multiple responses and evaluation dimensions, achieving state-of-the-art results in the AVI Challenge 2025.
Contribution
It presents a novel multimodal assessment framework with modality-specific feature extraction, shared compression, and ensemble learning, advancing automated interview evaluation methods.
Findings
Achieved a multi-dimensional average MSE of 0.1824.
Secured first place in the AVI Challenge 2025.
Demonstrated robustness and effectiveness in multimodal assessment.
Abstract
Interview performance assessment is essential for determining candidates' suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365'' aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
