Skip to main content

Datasets

Building the Foundation for Inclusive, Data-Driven Speech and Language Science

SEED (Speech Exemplar and Evaluation Database) and SPROUT (Speech Production Repository for Optimizing Use for AI Technologies) research datasets developed for speech analysis and AI model development. Together, they provide a critical foundation for developing and validating fair, inclusive, and scalable tools for early speech and language evaluation.

SEED: The Speech Exemplar and Evaluation Database

SEED includes thousands of audio recordings from children and adults, along with associated demographic and developmental information.

It supports:

  • Training and evaluation of AI models for speech assessment

  • Research on intelligibility, articulation, and linguistic variation Development of screening tools aligned with clinical decision-making What makes SEED unique?

  • Includes speech from children with speech disorders

  • Captures natural speech variation across dialects and developmental stages

  •  Includes expert classifications

 

 

More about the dataset here

 

Researchers & Teachers Can Request Access through the Credentialed Dataset Here

 

 

 

SPROUT: Speech Production Repository to Optimizing Use of AI Technologies

SPROUT is a growing repository designed for training foundational models of young children’s speech. It expands on SEED by focusing on:

  • Controlled speech production tasks (word lists, phrases, sentences)
  • High-quality, multi-microphone audio capture
  • Representative datasets 
  • Alignment with vocal biomarker research and early screening tools

 

The SPROUT datasets includes ~300 children from different backgrounds — (Black, Latine, White, and those experiencing living in low socioeconomic environments).

 

SPROUT enables:

  • Fine-tuning and benchmarking of child-specific ASR models
  • Exploration of acoustic biomarkers related to developmental concerns
  • Cross-site collaboration across 8 research sites.

 

 

More about the dataset here –> Zenodo

 

 

 

 

 

 

The dataset will become available in March 2026. Apply now to join the SPROUT user community and be notified when it is released. The application can be found under the Data Access and Governance tab.

Go to the SPROUT dashboard

Why These Datasets Matter

Preschool Speech Is a Low-Resource Domain

Most speech technologies are trained on tens of thousands of hours of adult speech. In contrast, preschool-aged speech is scarce, highly variable, and rarely included in training datasets.