Back-of-the-Book Index Automation for Arabic Documents

Nawal Haidar; Fadi A. Zaraket

arXiv:2410.10286·cs.CL·October 15, 2024

Back-of-the-Book Index Automation for Arabic Documents

Nawal Haidar, Fadi A. Zaraket

PDF

Open Access

TL;DR

This paper presents an automated method for verifying and identifying index term occurrences in Arabic books, using NLP techniques and similarity metrics, achieving high accuracy and facilitating index creation and review.

Contribution

It introduces a novel automated approach for back-of-the-book index verification in Arabic, combining noun phrase extraction, vector similarity, and heuristic scoring.

Findings

01

Achieved an F1-score of 0.966 in index term occurrence identification.

02

Demonstrated effective use of lexical and semantic similarity metrics.

03

Facilitated automation in Arabic book indexing processes.

Abstract

Back-of-the-book indexes are crucial for book readability. Their manual creation is laborious and error prone. In this paper, we consider automating back-of-the-book index extraction for Arabic books to help simplify both the creation and review tasks. Given a back-of-the-book index, we aim to check and identify the accurate occurrences of index terms relative to the associated pages. To achieve this, we first define a pool of candidates for each term by extracting all possible noun phrases from paragraphs appearing on the relevant index pages. These noun phrases, identified through part-of-speech analysis, are stored in a vector database for efficient retrieval. We use several metrics, including exact matches, lexical similarity, and semantic similarity, to determine the most appropriate occurrence. The candidate with the highest score based on these metrics is chosen as the occurrence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing