# TBKIN: Threshold-based explicit selection for enhanced cross-modal semantic alignments

**Authors:** Zihan Guo, Xiang Shen, Chongqing Chen, Yongjie Li, Yongjie Li, Yongjie Li

PMC · DOI: 10.1371/journal.pone.0325543 · PLOS One · 2025-06-10

## TL;DR

TBKIN is a new vision-language model that improves cross-modal understanding by reducing irrelevant information and enhancing semantic alignment between images and text.

## Contribution

TBKIN introduces a threshold-based fine-tuning strategy and unified scene graphs to reduce interference and improve semantic alignment in vision-language tasks.

## Key findings

- TBKIN achieved state-of-the-art accuracy of 73.90% on the VQA 2.0 dataset.
- The model reached 84.60% accuracy on the RefCOCO dataset.
- Attention visualization confirmed TBKIN's robustness in handling vision-language tasks.

## Abstract

Vision-language models aim to seamlessly integrate visual and linguistic information for multi-modal tasks, demanding precise semantic alignments between image-text pairs while minimizing the influence of irrelevant data. While existing methods leverage intra-modal and cross-modal knowledge to enhance alignments, they often fall short in sufficiently reducing interference, which ultimately constrains model performance. To address this gap, we propose a novel vision-language model, the threshold-based knowledge integration network (TBKIN), designed to effectively capture intra-modal and cross-modal knowledge while systematically mitigating the impact of extraneous information. TBKIN employs unified scene graph structures and advanced masking strategies to strengthen semantic alignments and introduces a fine-tuning strategy based on threshold selection to eliminate noise. Comprehensive experimental evaluations demonstrate the efficacy of TBKIN, with our best model achieving state-of-the-art accuracy of 73.90% on the VQA 2.0 dataset and 84.60% on the RefCOCO dataset. Attention visualization and detailed result analysis further validate the robustness of TBKIN in tackling vision-language tasks. The model’s ability to reduce interference while enhancing semantic alignments underscores its potential for advancing multi-modal learning. Extensive experiments across four widely-used benchmark datasets confirm its superior performance on two typical vision-language tasks, offering a practical and effective solution for real-world applications.

## Full-text entities

- **Diseases:** REC (MESH:D053591), TBKIN (MESH:D000081042)
- **Chemicals:** PONE-D-24-60420 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12151420/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12151420/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/PMC12151420/full.md

---
Source: https://tomesphere.com/paper/PMC12151420