Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis
Yves Pauli, Jan-Bernard Marsman, Finn Rabe, Victoria Edkins, Roya H\"uppi, Silvia Ciampelli, Akhil Ratan Misra, Nils Lang, Wolfram Hinzen, Iris Sommer, Philipp Homan

TL;DR
This paper introduces a standardized data structure and a Python toolkit to improve reproducibility and transparency in linguistic data analysis workflows, addressing current challenges in standardization and sharing.
Contribution
It proposes the LPDS data structure inspired by BIDS and the pelican nlp toolkit for streamlined, reproducible language processing workflows.
Findings
LPDS provides a standardized folder and file naming convention.
pelican nlp enables end-to-end reproducible linguistic analysis.
Workflow specifications are shareable and executable from a single config file.
Abstract
The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Natural Language Processing Techniques · Epilepsy research and treatment
