# T-ECBM: a deep learning-based text-image multimodal model for tourist attraction recommendation

**Authors:** Jianfu Chen, Jiaxu Cong, Mingxiao Li, Yan Sun, Junying Zhang

PMC · DOI: 10.1038/s41598-025-25630-z · Scientific Reports · 2025-11-24

## TL;DR

This paper introduces T-ECBM, a deep learning model that combines text and images to recommend tourist attractions in Northwest China more accurately than existing methods.

## Contribution

The novel T-ECBM model integrates textual and visual data for personalized and intelligent tourist attraction recommendations.

## Key findings

- T-ECBM achieved 96.71% Top-1 accuracy, significantly outperforming text-only and image-only models.
- The model effectively reduces information asymmetry and supports personalized decision-making for tourists.
- Experimental results show T-ECBM's superiority in capturing both subjective preferences and visual elements of attractions.

## Abstract

In recent years, tourism revenue and visitor numbers in Northwest China have increased steadily. However, many tourists still have limited knowledge of scenic destinations across the five northwestern provinces. When travelers intend to visit the region but have not yet decided on specific destinations, an intelligent recommendation system is urgently needed to assist their decision-making. Based on collaborative filtering, content matching, or knowledge graphs existing systems primarily face three major challenges: Due to reliance on historical data, the recommendation performance for new users and new attractions is weak; limited ability to capture tourists’ current intentions and personalized needs; insufficient utilization of multimodal information. To address these challenges, We propose a novel deep learning-based multimodal recommendation model, T-ECBM. A dataset comprising 23,488 user reviews and 4160 images of 52 attractions was collected. BERT was employed to extract semantic features from reviews, capturing subjective preferences and sentiment, while an improved EfficientNet-CA model extracted visual features from images to identify key scenic elements. The two feature sets were fused and fed into a multilayer perceptron, formulating the recommendation task as a multi-class classification problem. Experimental results demonstrate that text-only BERT achieved a Top-1 accuracy of 82.67%, while image-only EfficientNet-CA reached 83.68%. In contrast, the proposed T-ECBM achieved 96.71% Top-1 accuracy, 99.82% Top-5 accuracy, and an F1-score of 96.70%, proving its significant superiority over unimodal approaches. By integrating textual and visual modalities, T-ECBM effectively reduces information asymmetry, enriches decision-making support, and delivers intelligent, efficient, and personalized recommendations for tourists exploring northwestern China.

The online version contains supplementary material available at 10.1038/s41598-025-25630-z.

## Full-text entities

- **Chemicals:** BERT (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12644614/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12644614/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/PMC12644614/full.md

---
Source: https://tomesphere.com/paper/PMC12644614