A Hierarchical Approach to exploiting Multiple Datasets from TalkBank
Man Ho Wong

TL;DR
This paper presents a hierarchical pipeline framework for efficient data selection, integration, and analysis across multiple datasets in TalkBank, enhancing research capabilities beyond existing API limitations.
Contribution
It introduces a novel hierarchical search and data integration framework that improves data filtering, indexing, and cross-study analysis in TalkBank and similar platforms.
Findings
Enhanced data filtering and batch processing capabilities.
Facilitated integration of datasets through metadata standardization.
Improved access and analysis of large, complex linguistic datasets.
Abstract
TalkBank is an online database that facilitates the sharing of linguistics research data. However, the existing TalkBank's API has limited data filtering and batch processing capabilities. To overcome these limitations, this paper introduces a pipeline framework that employs a hierarchical search approach, enabling efficient complex data selection. This approach involves a quick preliminary screening of relevant corpora that a researcher may need, and then perform an in-depth search for target data based on specific criteria. The identified files are then indexed, providing easier access for future analysis. Furthermore, the paper demonstrates how data from different studies curated with the framework can be integrated by standardizing and cleaning metadata, allowing researchers to extract insights from a large, integrated dataset. While being designed for TalkBank, the framework can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
