# A multimodal dataset for automating language vitality and endangerment assessment in south-south Nigeria

**Authors:** Moses Ekpenyong, Imelda Udoh, Eno-Abasi Urua, Nse Udoh, Ebitare Obikudo, Ogbonna Anyanwu, Ahmadu Shehu, Esther Sylvanus, Richard Bassey, Unyime Saturday, Temitope Fakiyesi, Celestina-Predia Kekai, Ememobong Udoh, Stella Ansa, Emeka Ifesieh, Gladys Ikhimwin, Unyime Udoeyo, Emem Alexander, Emmanuel Okon, Mfon Ekpe, Benjamin Okon Nyong, Moses Darah, Akpobome Diffre-Odiete, Lucky Ejobee, William Aigbedo, Francis Imoudu, Chima Manda, Mee-eebari Kiine, Doris Ugwu, Aniefon Akpan

PMC · DOI: 10.1038/s41597-025-05337-6 · Scientific Data · 2025-07-01

## TL;DR

This paper introduces a new dataset from South-South Nigeria to study language vitality and endangerment using household surveys and audio recordings.

## Contribution

The novelty lies in creating a multimodal dataset combining household data and audio recordings for language vitality assessment.

## Key findings

- The dataset includes 543 validated responses from households across five LGAs in South-South Nigeria.
- Audio recordings of 108 Swadesh words with transcriptions and tone patterns were collected for language analysis.
- The dataset supports linguistic and social science research on language sustainability and diversity.

## Abstract

In this paper, a multimodal dataset was collected between July 2023 and April 2024 through purposive sampling from a field survey of proper households (households with at least one parent and one child) in South-South Geopolitical Zone of Nigeria. The dataset includes 543 validated responses captured in real-time using an online survey developed with Google Forms. The survey instrument synthesised attributes derived from the United Nations, Educational, Scientific and Cultural Organisation (UNESCO) 2003 Language Vitality and Endangerment (LVE) framework, to capture household-specific data from five households per Local Government Area (LGA). The dataset also includes audio recordings of 108 words selected from the Swadesh wordlist and a transcription of the gloss, and tone patterns of each word, for proper description of the language’s speech system. The multimodal dataset can support the analysis of LVE patterns, linguistic trends, and complex interactions affecting language sustainability. It is reusable in linguistic, cultural and social science research, providing a robust resource for examining language diversity and preservation.

## Full-text entities

- **Diseases:** LVE (MESH:D007806), LGA (MESH:D004828)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12218391/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12218391/full.md

## References

4 references — full list in the complete paper: https://tomesphere.com/paper/PMC12218391/full.md

---
Source: https://tomesphere.com/paper/PMC12218391