DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

Shaoqing Lin; Chong Teng; Fei Li; Donghong Ji; Lizhen Qu; Zhuang Li

arXiv:2506.15583·cs.CL·October 27, 2025

DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

Shaoqing Lin, Chong Teng, Fei Li, Donghong Ji, Lizhen Qu, Zhuang Li

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces DiscoSG, a new discourse-level text scene graph parsing task and dataset, along with a lightweight iterative graph refinement method that significantly improves parsing accuracy and efficiency for complex multi-sentence visual descriptions.

Contribution

The paper presents DiscoSG-DS dataset, a novel task for discourse-level scene graph parsing, and proposes DiscoSG-Refiner, an open-source iterative refinement model that outperforms baselines in accuracy and speed.

Findings

01

Fine-tuning GPT-4o improves SPICE by over 40% but has high inference costs.

02

Smaller models perform well on simple graphs but struggle with complex ones.

03

DiscoSG-Refiner achieves 30% higher SPICE and 86x faster inference than GPT-4o.

Abstract

Vision-Language Models (VLMs) generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers built for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. We introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), and release DiscoSG-DS, a dataset of 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. Fine-tuning GPT-4o on DiscoSG-DS yields over 40% higher SPICE metric than the best sentence-merging baseline. However, its high inference cost and licensing restrict open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer · GPT-4 · Balanced Selection