Multilingual corpora for the study of new concepts in the social sciences and humanities:
Revekka Kyriakoglou (LIASD), Anna Pappa (LIASD)

TL;DR
This paper introduces a hybrid methodology for constructing a multilingual corpus from company websites and reports to study emerging concepts in social sciences and humanities, supporting NLP and classification tasks.
Contribution
It presents a novel, reproducible pipeline for building a multilingual, annotated corpus focused on new social science concepts, combining automated extraction, filtering, and annotation.
Findings
Created a multilingual corpus for emerging social science concepts
Developed a dataset with contextual annotations for machine learning
Enabled analysis of lexical variability and NLP applications
Abstract
This article presents a hybrid methodology for building a multilingual corpus designed to support the study of emerging concepts in the humanities and social sciences (HSS), illustrated here through the case of ``non-technological innovation''. The corpus relies on two complementary sources: (1) textual content automatically extracted from company websites, cleaned for French and English, and (2) annual reports collected and automatically filtered according to documentary criteria (year, format, duplication). The processing pipeline includes automatic language detection, filtering of non-relevant content, extraction of relevant segments, and enrichment with structural metadata. From this initial corpus, a derived dataset in English is created for machine learning purposes. For each occurrence of a term from the expert lexicon, a contextual block of five sentences is extracted (two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Language and cultural evolution · Discourse Analysis in Language Studies
