Understanding the Gladia Transcription Process Breakdown
Introduction
When using Gladia’s transcription API, understanding how your audio is processed helps you interpret response times and optimize performance.
Each transcription request passes through multiple internal stages before the final text output is generated.
Knowing these steps helps you understand where time is spent and how to optimize your requests — for example, by defining languages in advance or managing concurrency according to your plan.
See also: Optimizing transcription performance by specifying probable languages
The Transcription Pipeline: Step by Step
Each transcription request follows a predictable sequence of stages.
1. Queue (Job Scheduling)
Before processing begins, each transcription request enters a queue.
This initial waiting time depends on:
The number of active transcriptions currently running on Gladia’s clusters
The overall system load
Your plan’s concurrency limits
This queuing mechanism ensures fair resource allocation across users.
During high-demand periods or when account limits are reached, short queue delays may appear — especially visible on short audio files where queue time can represent a significant share of the total duration.
For more information, see the official documentation on Concurrency and Rate Limits.
2. Pre-processing
Once a job is dispatched, the audio is prepared for transcription.
This step includes:
Audio normalization (format, codec, bitrate, sample rate). The type of audio format can slightly impact processing time, as some formats require longer conversion before transcription begins.
You can find detailed estimates here: Conversion Time by Audio FormatVoice Activity Detection (VAD) to isolate speech from silence
Pre-processing ensures consistent and clean input for accurate transcription.
3. Language Discovery
If no language is specified in your request, the system automatically detects which language(s) are spoken in the audio.
This step ensures accuracy but can increase total processing time.
If you already know the language, you can skip this phase entirely by defining it in your request.
Learn more here: Specifying probable languages
4. Inference
This is the core phase where speech is converted into text.
The inference time scales naturally with audio length and content complexity and is typically the most stable and predictable part of the process.
5. Post-processing and Formatting
After inference, the transcription is refined and formatted.
Depending on your configuration, this may include:
Diarization (speaker separation)
Sentence structuring, Custom Vocabulary / Custom Spelling
Other enabled add-ons
These enhancements improve readability and deliver a clear, well-structured result.
Account-Level Concurrency and Plan Limits
Processing speed and queue time can also vary depending on your account type.
Here’s an overview of concurrency limits per plan:
Enterprise plan → Unlimited usage, on-demand concurrency
Paid plan (Self Serve, Scalling) → Unlimited usage, up to 25 concurrent pre-recorded
Free plan → 10 hours per month, 3 concurrent pre-recorded transcriptions, and 1 live
When concurrency limits are reached, additional requests are queued until processing slots become available.
Full details are available in the Gladia Docs – Concurrency and Rate Limits.
Multi-language and Code-Switching Support
If your audio contains multiple speakers or languages, Gladia can automatically detect and transcribe them.
You can also define several probable languages manually to improve accuracy and consistency.
Learn more in this article: Handling audio with multi-language speakers
Conclusion
Every transcription request follows the same pipeline, from queue management to post-processing.
A short queue at the start is normal and depends on both system demand and your plan’s concurrency limits.
By specifying probable languages, understanding concurrency behavior, and optimizing your configuration, you can achieve faster, more predictable, and consistent transcription performance.