Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG-based Correction and Predicted BLEU
Vincenzo Timmel, Manfred Vogel, Daniel Perruchoud, Reza Kakooee

TL;DR
This paper introduces an enhanced Swiss Parliaments Corpus with high-quality transcriptions achieved through a multi-step correction process using GPT-4o and BLEU-based filtering, resulting in a large, accurate speech-text dataset.
Contribution
It presents a novel pipeline combining Whisper ASR, GPT-4o correction, and BLEU-based filtering to produce a high-quality, long-form Swiss German speech corpus.
Findings
Achieved a 6-point BLEU score improvement over previous corpus versions.
Filtered 555 hours of high-quality transcriptions from 801 hours of audio.
Demonstrated the effectiveness of LLM-based correction and filtering for low-resource domains.
Abstract
This paper presents a new long-form release of the Swiss Parliaments Corpus, converting entire multi-hour Swiss German debate sessions (each aligned with the official session protocols) into high-quality speech-text pairs. Our pipeline starts by transcribing all session audio into Standard German using Whisper Large-v3 under high-compute settings. We then apply a two-step GPT-4o correction process: first, GPT-4o ingests the raw Whisper output alongside the official protocols to refine misrecognitions, mainly named entities. Second, a separate GPT-4o pass evaluates each refined segment for semantic completeness. We filter out any segments whose Predicted BLEU score (derived from Whisper's average token log-probability) and GPT-4o evaluation score fall below a certain threshold. The final corpus contains 801 hours of audio, of which 555 hours pass our quality control. Compared to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
