Skip to content
Gladia Help Center home
Gladia Help Center home

Understanding the Gladia Transcription Process Breakdown

Introduction

When using Gladia’s transcription API, understanding how your audio is processed helps you interpret response times and optimize performance.
Each transcription request passes through multiple internal stages before the final text output is generated.

Knowing these steps helps you understand where time is spent and how to optimize your requests — for example, by defining languages in advance or managing concurrency according to your plan.

See also: Optimizing transcription performance by specifying probable languages

The Transcription Pipeline: Step by Step

Each transcription request follows a predictable sequence of stages.

1. Queue (Job Scheduling)

Before processing begins, each transcription request enters a queue.
This initial waiting time depends on:

  • The number of active transcriptions currently running on Gladia’s clusters

  • The overall system load

  • Your plan’s concurrency limits

This queuing mechanism ensures fair resource allocation across users.
During high-demand periods or when account limits are reached, short queue delays may appear — especially visible on short audio files where queue time can represent a significant share of the total duration.

For more information, see the official documentation on Concurrency and Rate Limits.

2. Pre-processing

Once a job is dispatched, the audio is prepared for transcription.
This step includes:

  • Audio normalization (format, codec, bitrate, sample rate). The type of audio format can slightly impact processing time, as some formats require longer conversion before transcription begins.
    You can find detailed estimates here: Conversion Time by Audio Format

  • Voice Activity Detection (VAD) to isolate speech from silence

Pre-processing ensures consistent and clean input for accurate transcription.

3. Language Discovery

If no language is specified in your request, the system automatically detects which language(s) are spoken in the audio.
This step ensures accuracy but can increase total processing time.

If you already know the language, you can skip this phase entirely by defining it in your request.
Learn more here: Specifying probable languages

4. Inference

This is the core phase where speech is converted into text.
The inference time scales naturally with audio length and content complexity and is typically the most stable and predictable part of the process.

5. Post-processing and Formatting

After inference, the transcription is refined and formatted.
Depending on your configuration, this may include:

  • Diarization (speaker separation)

  • Sentence structuring, Custom Vocabulary / Custom Spelling

  • Other enabled add-ons

These enhancements improve readability and deliver a clear, well-structured result.

Account-Level Concurrency and Plan Limits

Processing speed and queue time can also vary depending on your account type.
Here’s an overview of concurrency limits per plan:

  • Enterprise plan → Unlimited usage, on-demand concurrency

  • Paid plan (Self Serve, Scalling) → Unlimited usage, up to 25 concurrent pre-recorded

  • Free plan → 10 hours per month, 3 concurrent pre-recorded transcriptions, and 1 live

When concurrency limits are reached, additional requests are queued until processing slots become available.
Full details are available in the Gladia Docs – Concurrency and Rate Limits.

Multi-language and Code-Switching Support

If your audio contains multiple speakers or languages, Gladia can automatically detect and transcribe them.
You can also define several probable languages manually to improve accuracy and consistency.

Learn more in this article: Handling audio with multi-language speakers

Conclusion

Every transcription request follows the same pipeline, from queue management to post-processing.
A short queue at the start is normal and depends on both system demand and your plan’s concurrency limits.

By specifying probable languages, understanding concurrency behavior, and optimizing your configuration, you can achieve faster, more predictable, and consistent transcription performance.