Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Luigi Curini; Alfio Ferrara; Giovanni Pagano; Sergio Picascia

arXiv:2603.28103·cs.DL·May 21, 2026

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Luigi Curini, Alfio Ferrara, Giovanni Pagano, Sergio Picascia

PDF

TL;DR

This paper introduces a novel Vision-Language Model pipeline for improved transcription, semantic segmentation, and entity linking of Italian parliamentary speeches from scanned documents, outperforming traditional OCR methods.

Contribution

The authors develop a specialized pipeline combining OCR and Vision-Language Models for accurate transcription and semantic analysis of parliamentary speeches, including speaker identification and linking.

Findings

01

Significant improvements in transcription accuracy over traditional OCR methods.

02

Enhanced speaker tagging accuracy through linked knowledge base queries.

03

Effective semantic segmentation and entity linking in complex document layouts.

Abstract

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.