Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology
Thierry Despeyroux (INRIA Rocquencourt / INRIA Sophia Antipolis), Yves, Lechevallier (INRIA Rocquencourt / INRIA Sophia Antipolis), Brigitte Trousse, (INRIA Rocquencourt / INRIA Sophia Antipolis), Anne-Marie Vercoustre (INRIA, Rocquencourt / INRIA Sophia Antipolis)

TL;DR
This study explores clustering homogeneous XML documents, specifically Inria activity reports, to validate an existing organizational classification by analyzing the impact of feature selection on clustering outcomes.
Contribution
It introduces an approach combining structured and textual feature selection for XML document clustering to assess alignment with predefined organizational themes.
Findings
Clustering results vary significantly with different feature selections.
The approach effectively groups reports into themes consistent with official classifications.
Feature selection impacts the quality of the resulting document clusters.
Abstract
This paper presents some experiments in clustering homogeneous XMLdocuments to validate an existing classification or more generally anorganisational structure. Our approach integrates techniques for extracting knowledge from documents with unsupervised classification (clustering) of documents. We focus on the feature selection used for representing documents and its impact on the emerging classification. We mix the selection of structured features with fine textual selection based on syntactic characteristics.We illustrate and evaluate this approach with a collection of Inria activity reports for the year 2003. The objective is to cluster projects into larger groups (Themes), based on the keywords or different chapters of these activity reports. We then compare the results of clustering using different feature selections, with the official theme structure used by Inria.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Web Data Mining and Analysis · Advanced Database Systems and Queries
