# Mm-VitnNet: a gated image-text interaction network for soybean salt tolerance recognition using chlorophyll fluorescence phenotypes

**Authors:** Wenxiang Liang, Xiaoyan Zhang, Ziqiu Luo, Qingyang Li, Hao Wang, Yixin Feng, Licheng Zhao, Ziyan Lu, Xiaotian Yuan, Xiouxiou Zhou, Lu Huang, Xin Chen, Zhe Yan, Shangbing Gao, Chenchen Xue

PMC · DOI: 10.3389/fpls.2026.1721287 · 2026-03-17

## TL;DR

This paper introduces Mm-VitnNet, a new model that uses both images and text data to accurately identify salt tolerance in soybean varieties using chlorophyll fluorescence data.

## Contribution

The novel contribution is a gated image-text interaction network that improves accuracy and efficiency in soybean salt tolerance recognition.

## Key findings

- Mm-VitnNet achieves 98.97% accuracy, outperforming existing models like EfficientNetV2-s and MobileNetV2.
- The model balances accuracy and efficiency with 10.22M parameters and 1.84G FLOPs.
- It enables non-destructive, precise identification of soybean salt tolerance levels.

## Abstract

Traditional methods for identifying salt tolerance levels in soybean varieties are often cumbersome, time-consuming, and labor-intensive. These challenges are further exacerbated by the limited utility of chlorophyll fluorescence imaging phenotype data, which are insufficiently diverse and difficult to analyze. Additionally, the corresponding parameter text data have not been fully explored and utilized. In this study, salt stress experiments were conducted on 178 soybean varieties, and a multimodal dataset comprising chlorophyll fluorescence images and corresponding textual data was constructed using a chlorophyll fluorescence imaging instrument. A novel gated mechanism network for learnable image-text interaction (Mm-VitnNet) is proposed, which enables global cross-modal interaction between image and text data. The model introduces a gated mechanism to dynamically regulate the fusion intensity of cross-modal information and incorporates two learnable tokens that focus on feature learning for each individual modality. This approach effectively mitigates interference between modalities while preserving modality-specific features, thereby enhancing model performance. The proposed model demonstrates an accuracy rate of 98.97%, significantly outperforming typical models: it improves by 1.09 and 2.33 percentage points compared to CNN-based models such as EfficientNetV2-s (97.88%) and MobileNetV2 (96.64%), respectively, and by 3.21 and 2.60 percentage points compared to Transformer-based Swin Transformer_tiny (95.76%) and hybrid models like MobileViT_S (96.37%), respectively. The model has 10.22M parameters and a computational cost (FLOPs) of 1.84G, which is significantly lower than models like VGG and ResNet50, and only slightly higher than some lightweight CNNs, achieving an effective balance between accuracy and efficiency. The improved model demonstrates notable performance in identifying samples with varying salt tolerance levels, even under limited computational resources, ensuring reliable classification performance. Moreover, this multimodal non-destructive identification method based on chlorophyll fluorescence technology offers an efficient and feasible approach for assessing the salt tolerance levels of soybeans, while also advancing agricultural phenotyping towards greater precision and intelligence.

## Linked entities

- **Species:** Glycine max (taxon 3847)

## Full-text entities

- **Chemicals:** chlorophyll (MESH:D002734), salt (MESH:D012492)
- **Species:** Glycine max (soybean, species) [taxon 3847]

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13036226/full.md

---
Source: https://tomesphere.com/paper/PMC13036226