Question 1

What recording types are supported?

Accepted Answer

Scripted utterances (phonetically balanced sentences read from a prompt), prompted voice-phrase capture (wake-words, short voice-control phrases), conversational dialog between two or more speakers, read-speech for transcription pair training, and in-the-wild spontaneous capture with consent. The recording type is confirmed during the program scoping so the audio specification and the speaker recruitment match what the model team actually needs to train against.

Question 2

How is speaker recruitment handled for rare languages and accents?

Accepted Answer

Speaker recruitment for rare and refugee-resettlement languages and for under-represented accent pairs (regional accents within a language, L2 speakers learning the language, code-switching speakers fluent across two locales) is scoped per program with a target speaker count per demographic cell. For ultra-rare pairs, the available speaker pool size and the target completion window are recorded in writing rather than promised generically.

Question 3

What audio specifications are supported?

Accepted Answer

Common recording specifications include 16 kHz, 22.05 kHz, 44.1 kHz, or 48 kHz sample rates at 16-bit or 24-bit depth, mono or stereo channel layouts, microphone class (broadcast condenser for studio recording, calibrated headset for prompted voice-phrase capture, smartphone microphone for in-the-wild capture), and a target ambient noise floor. The audio specification is confirmed in the scoping so the recorded clips match what the model training pipeline expects.

Question 4

How is consent and license handled for commercial training use?

Accepted Answer

Each speaker signs a consent and license form confirming commercial training use, redistribution scope (internal training only, or licensed redistribution to a partner), and opt-out process. The consent record is timestamped and archived for compliance review. Speakers who request opt-out have their audio removed from the corpus and a replacement clip is recorded at no additional cost when the program scope requires demographic balance.

Question 5

What metadata is delivered alongside the audio files?

Accepted Answer

Standard speech corpus metadata includes a speaker ID (pseudonymized), locale code, demographic fields (age band, gender, regional accent), prompt text, transcription pair when the program scope requires it, audio file checksum for integrity verification, and the consent timestamp. The metadata schema can be adjusted in the scoping to match the model team's data pipeline expectations.

Question 6

Is transcription delivered alongside the audio?

Accepted Answer

Yes when the program scope requires it. Transcription pairs (audio file plus verbatim transcript in the source language) are produced with reviewer-level QA against the audio. For STT and ASR training the transcript is verified character-by-character against the recording. For TTS training the transcript matches the prompt source. Transcription is scoped separately when the audio corpus is the only requirement.

Question 7

How is audio QA done on every clip?

Accepted Answer

Audio QA covers signal level (no clipping, no excessive low signal), ambient noise floor (within the agreed range), speaker identification (clip attributed to the correct speaker ID), prompt match (the audio matches the source prompt when the recording type is scripted), and clip integrity (no truncation, no pops, no gain artifacts). Clips that fail QA are re-recorded at no additional speaker cost rather than shipped as-is.

Question 8

Can existing audio sets be augmented or re-recorded?

Accepted Answer

Yes. Existing audio sets can be augmented with additional speakers, accents, or recording types to match a wider data balance requirement, or re-recorded against a tighter audio specification when the model training pipeline changes (sample rate change, channel layout change, ambient noise tightening). Augmentation programs match the source set's speaker profile so the combined corpus remains statistically clean.

Scope multilingual speech data collection with language, accent, and audio specification settled first.

What DD can show before a buyer commits.

How the work runs

Scope the program

Recruit speakers

Record against specification

Run audio QA on every clip

Release the corpus

What this page helps you send

What you receive

Questions teams ask first

What recording types are supported?

How is speaker recruitment handled for rare languages and accents?

What audio specifications are supported?

How is consent and license handled for commercial training use?

What metadata is delivered alongside the audio files?

Is transcription delivered alongside the audio?

How is audio QA done on every clip?

Can existing audio sets be augmented or re-recorded?

Get the right scope in writing.

Scope multilingual speech data collection with language, accent, and audio specification settled first.

What DD can show before a buyer commits.

How the work runs

Scope the program

Recruit speakers

Record against specification

Run audio QA on every clip

Release the corpus

What this page helps you send

What you receive

Questions teams ask first

What recording types are supported?

How is speaker recruitment handled for rare languages and accents?

What audio specifications are supported?

How is consent and license handled for commercial training use?

What metadata is delivered alongside the audio files?

Is transcription delivered alongside the audio?

How is audio QA done on every clip?

Can existing audio sets be augmented or re-recorded?

Related solution pages

Get the right scope in writing.