A Retrieval Framework and Implementation for Electronic Documents with Similar Layouts
Hyunji Chung

TL;DR
This paper presents a novel framework and tool for retrieving electronic documents with similar visual layouts, enhancing digital forensic investigations by supplementing traditional keyword searches especially when metadata is unavailable.
Contribution
The study introduces a new framework for layout-based document similarity and implements a tool for finding similar Microsoft OOXML files, addressing a gap in layout-focused digital forensics research.
Findings
Effective retrieval of documents with similar layouts demonstrated.
The tool improves digital investigation efficiency.
Layout similarity complements keyword search results.
Abstract
As the number of digital documents requiring investigation increases, it has become more important to identify relevant documents to a given case. There have been continual demands for finding relevant files in order to overcome this kind of issues. Regarding finding similar files, there can be a situation where there is no available metadata such as timestamp, file size, title, subject, template, author, etc. In this situation, investigators will focus on searching document files having specific keywords related to a given case. Although the traditional keyword search with elaborate regular expressions is useful for digital forensics, there is a possibility that closely related documents are missing because they have totally different body contents. In this paper, we introduce a recent actual case on handling large amounts of document files. This case suggests that similar layout…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Web Data Mining and Analysis · Advanced Database Systems and Queries
