Datasets

Building the Foundation for Inclusive, Data-Driven Speech and Language Science

SEED (Speech Exemplar and Evaluation Database) and SPROUT (Speech Production Repository for Optimizing Use for AI Technologies) research datasets developed for speech analysis and AI model development. Together, they provide a critical foundation for developing and validating fair, inclusive, and scalable tools for early speech and language evaluation.

SEED: The Speech Exemplar and Evaluation Database

SEED includes thousands of audio recordings from children and adults, along with associated demographic and developmental information.

It supports:

Training and evaluation of AI models for speech assessment
Research on intelligibility, articulation, and linguistic variation Development of screening tools aligned with clinical decision-making What makes SEED unique?
Includes speech from children with speech disorders
Captures natural speech variation across dialects and developmental stages
Includes expert classifications

More about the dataset here

Researchers & Teachers Can Request Access through the Credentialed Dataset Here

SPROUT: Speech Production Repository to Optimizing Use of AI Technologies

SPROUT is a growing repository designed for training foundational models of young children’s speech. It expands on SEED by focusing on:

Controlled speech production tasks (word lists, phrases, sentences)
High-quality, multi-microphone audio capture
Representative datasets
Alignment with vocal biomarker research and early screening tools

The SPROUT datasets includes ~300 children from different backgrounds — (Black, Latine, White, and those experiencing living in low socioeconomic environments).

SPROUT enables:

Fine-tuning and benchmarking of child-specific ASR models
Exploration of acoustic biomarkers related to developmental concerns
Cross-site collaboration across 8 research sites.

More about the dataset here –> Zenodo

The dataset will become available in March 2026. Apply now to join the SPROUT user community and be notified when it is released. The application can be found under the Data Access and Governance tab.

Go to the SPROUT dashboard

Why These Datasets Matter

Preschool Speech Is a Low-Resource Domain

Most speech technologies are trained on tens of thousands of hours of adult speech. In contrast, preschool-aged speech is scarce, highly variable, and rarely included in training datasets.

Building the Foundation for Inclusive, Data-Driven Speech and Language Science

SEED includes thousands of audio recordings from children and adults, along with associated demographic and developmental information.

It supports:

Training and evaluation of AI models for speech assessment

Research on intelligibility, articulation, and linguistic variation Development of screening tools aligned with clinical decision-making What makes SEED unique?

Includes speech from children with speech disorders

Captures natural speech variation across dialects and developmental stages

Includes expert classifications

Researchers & Teachers Can Request Access through the Credentialed Dataset Here

SPROUT is a growing repository designed for training foundational models of young children’s speech. It expands on SEED by focusing on:

Why These Datasets Matter