# Scene Graph and Natural Language-Based Semantic Image Retrieval Using Vision Sensor Data

**Authors:** Jaehoon Kim, Byoung Chul Ko

PMC · DOI: 10.3390/s25113252 · Sensors (Basel, Switzerland) · 2025-05-22

## TL;DR

This paper introduces a new image retrieval method using semantic graphs and natural language to better match images with complex scene descriptions.

## Contribution

The novel approach uses graph neural networks and semantic graphs to improve retrieval accuracy without relying on image metadata.

## Key findings

- The proposed method achieves a top nDCG@50 score of 0.745 on the Visual Genome dataset.
- It improves retrieval performance by approximately 7.7 percentage points compared to random sampling with full graphs.

## Abstract

Text-based image retrieval is one of the most common approaches for searching images acquired from vision sensors such as cameras. However, this method suffers from limitations in retrieval accuracy, particularly when the query contains limited information or involves previously unseen sentences. These challenges arise because keyword-based matching fails to adequately capture contextual and semantic meanings. To address these limitations, we propose a novel approach that transforms sentences and images into semantic graphs and scene graphs, enabling a quantitative comparison between them. Specifically, we utilize a graph neural network (GNN) to learn features of nodes and edges and generate graph embeddings, enabling image retrieval through natural language queries without relying on additional image metadata. We introduce a contrastive GNN-based framework that matches semantic graphs with scene graphs to retrieve semantically similar images. In addition, we incorporate a hard negative mining strategy, allowing the model to effectively learn from more challenging negative samples. The experimental results on the Visual Genome dataset show that the proposed method achieves a top nDCG@50 score of 0.745, improving retrieval performance by approximately 7.7 percentage points compared to random sampling with full graphs. This confirms that the model effectively retrieves semantically relevant images by structurally interpreting complex scenes.

## Full-text entities

- **Diseases:** UPS (MESH:C567162), hallucination (MESH:D006212), HNM (MESH:D018804), injury to (MESH:D014947)
- **Chemicals:** Sentence (-)
- **Species:** Felis catus (cat, species) [taxon 9685], Homo sapiens (human, species) [taxon 9606], Mus musculus (house mouse, species) [taxon 10090], Canis lupus familiaris (dog, subspecies) [taxon 9615]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12158163/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12158163/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/PMC12158163/full.md

---
Source: https://tomesphere.com/paper/PMC12158163