# Structured matching models in multimodal information fusion: An optimized Kuhn-Munkres algorithm

**Authors:** Qingnan Ji, Jinxia Wang, Lixian Wang, Yosi Kristian, Yosi Kristian, Yosi Kristian

PMC · DOI: 10.1371/journal.pone.0326662 · PLOS One · 2025-11-21

## TL;DR

This paper improves how different types of data like speech, vision, and text are combined by optimizing a matching algorithm, leading to better accuracy and user experience.

## Contribution

An improved Kuhn-Munkres algorithm with dynamic weighting and cross-modal constraints for efficient multimodal information fusion.

## Key findings

- The algorithm improves matching accuracy by 28.2% over the baseline method.
- Integration efficiency increases by 18.7%, and computation time decreases by 15.4%.
- User satisfaction with interaction experiences rises by 19.5%.

## Abstract

In modern multimodal interaction design, integrating information from diverse modalities—such as speech, vision, and text—presents a significant challenge. These modalities differ in structure, timing, and data volume, often leading to mismatches, low computational efficiency, and suboptimal user experiences during the integration process. This study aims to enhance both the efficiency and accuracy of multimodal information fusion. To achieve this, publicly available datasets—Carnegie Mellon University Multimodal Opinion Sentiment Intensity (CMU-MOSI) and Interactive Emotional Dyadic Motion Capture (IEMOCAP)—are employed to collect speech, visual, and textual data relevant to multimodal interaction scenarios. The data undergo preprocessing steps including noise reduction, feature extraction (e.g., Mel Frequency Cepstral Coefficients and keypoint detection), and temporal alignment. An improved Kuhn-Munkres algorithm is then proposed, extending the traditional bipartite graph matching model to support weighted multimodal matching. The algorithm dynamically adjusts weight coefficients based on the importance scores of each modality, while also incorporating a cross-modal correlation matrix as a constraint to improve the robustness of the matching process. The enhanced algorithm’s performance is validated through information matching efficiency tests and user interaction satisfaction surveys. Experimental results show that it improves multimodal information matching accuracy by 28.2% over the baseline method. Integration efficiency increases by 18.7%, and computational complexity is significantly reduced, with average computation time decreased by 15.4%. User satisfaction also improves, with a 19.5% increase in experience ratings. Ablation studies further confirm the critical contribution of both the dynamic weighting mechanism and the correlation matrix constraint to the overall performance. This study introduces a novel optimization strategy for multimodal information integration, offering substantial theoretical value and broad applicability in intelligent interaction design and human-computer collaboration. These advancements contribute meaningfully to the development of next-generation multimodal interaction systems.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12637911/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12637911/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC12637911/full.md

---
Source: https://tomesphere.com/paper/PMC12637911