Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Margarita Bugue\~no; Gerard de Melo

arXiv:2508.00864·cs.CL·August 5, 2025

Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Margarita Bugue\~no, Gerard de Melo

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a data-driven method for constructing graph structures in document classification, replacing heuristic approaches with learned dependencies, leading to improved accuracy and robustness.

Contribution

It proposes a novel approach to learn graph structures using self-attention, reducing reliance on heuristics and domain-specific rules in document classification.

Findings

01

Learned graphs outperform heuristic-based graphs in accuracy and F1 score.

02

Statistical filtering enhances classification robustness.

03

The approach generalizes well across multiple datasets.

Abstract

In document classification, graph-based models effectively capture document structure, overcoming sequence length limitations and enhancing contextual understanding. However, most existing graph document representations rely on heuristics, domain-specific rules, or expert knowledge. Unlike previous approaches, we propose a method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy aims to retain only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three document classification datasets demonstrate that learned graphs consistently outperform heuristic-based graphs,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

1. This paper introduces a self-attention-based approach that eliminates the dependency on heuristics and domain expertise. 2. Experiment results have verified the the effectiveness of proposed model.

Weaknesses

1. The novelty is not enough as only applying graph attention neural networks to document graph tasks.

Reviewer 02Rating 2Confidence 4

Strengths

- Simplicity and generality. - Empirical evidence.

Weaknesses

- Limited task scope. - Weak theoretical justification.

Reviewer 03Rating 2Confidence 4

Strengths

- The proposed graph inference method is indeed more adaptable to different datasets than the different heuristic graph constructions that are listed by the authors. - The overview Figure 2 provides a good understanding of the approach.

Weaknesses

- Evaluating on only three datasets is not a lot and the transformer baselines you use change for each dataset. I think the empirical evidence in favour of your method should be extended for the method to really be of proven practical relevance. - I am not convinced that the methodological contribution or the empirical work offer sufficient novelty to warrant publication at the ICLR conference. - You say that heterogeneous graphs are "not comparable" to your homogenous construction and are t

Reviewer 04Rating 2Confidence 4

Strengths

1. This paper pointed out some limitations of existing works on text classification, especially graph-based frameworks. 2. The proposed framework could achieve better performance on document classification tasks on selected benchmark dataset.

Weaknesses

- Lack of Novelty: The proposed framework applies self-attention to model correlations between sentences within a document, which is a well-established approach. Moreover, the handling of repeated sentences ignores contextual information, and the method treats sentence order in a bag-of-words manner without modeling reading sequence. - Limited Evaluation: The experimental validation is insufficient. The framework is only tested on limited text classification settings. Broader evaluation is expe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Text and Document Classification Technologies · Topic Modeling