Speech to Text
EnableX Speech to Text (STT) converts spoken audio from phone calls into accurate text transcripts. STT is already live as an integrated capability within the EnableX Voice platform — you configure it from the Portal, link it to a phone number, and transcripts are delivered to your application over webhook. A standalone STT API — for submitting any audio source independently of a call — is under development and coming soon.
STT in Voice Calls — Available Now
If you are running inbound call flows through EnableX Voice, STT is available to you today without any custom integration work. The setup is entirely Portal-driven: you configure an STT provider against a phone number, and EnableX handles transcription automatically when a call arrives on that number.
How it works
When an incoming call arrives on a phone number that has STT configured, EnableX captures the caller's speech during the call and passes it through the linked STT provider. Once transcription is complete, EnableX delivers the transcript to your application via a webhook notification. Your server receives the text — along with the call context — and can act on it immediately: log it, route it, analyse it, or trigger a workflow.
This means transcription runs in the background as a platform service. You do not need to record the call yourself, extract the audio, or call a third-party transcription API. The pipeline from call audio to webhook-delivered transcript is managed entirely within EnableX.
Portal setup
STT is configured per phone number from the EnableX Portal under your project's AI Services settings. You select the STT provider, set the language, and associate it with one or more phone numbers. Once configured, every call arriving on those numbers will automatically trigger transcription — no code changes required on your Voice API integration.
What you receive over webhook
When transcription is ready, EnableX sends a webhook notification to your configured event URL. The payload includes the transcript text, the associated call reference, and language metadata. Your application processes this exactly as it would any other EnableX webhook — authenticate the request, read the payload, and route the transcript to wherever it needs to go in your system.
Interactive Speech Capture with asr: true
Beyond passive full-call transcription, the Voice API Play method supports a targeted speech capture mode using the asr: true flag. This is the mechanism for building interactive spoken conversations over a live call — where the caller responds verbally to a prompt and your application receives the spoken input as text.
How it works
When you call the Play API with asr: true, EnableX plays the prompt to the caller and then enters a listening state. It waits for the caller to speak. The speech is captured, passed through the configured ASR (speech recognition) engine, and the resulting transcript is delivered to your webhook. Your application reads the transcribed text, decides what to do next, and plays the next prompt — with or without another asr: true — to continue the conversation.
This is fundamentally different from passive transcription. Passive transcription captures the entire call audio in the background and delivers one transcript at the end. asr: true is targeted — it listens specifically after a prompt, captures the caller's response to that prompt, and delivers the text to your application in near-real-time so the conversation can continue.
Timeout control
Because the system is waiting for the caller to speak, a timeout is configurable. If the caller does not respond within the timeout window, EnableX fires the webhook with a no-input event rather than waiting indefinitely. Your application handles this just like a DTMF timeout — play a reprompt, try again, or route accordingly. The timeout prevents stalled calls when a caller is silent or has stepped away.
The interactive conversation loop
Combining Play with asr: true and TTS in webhook responses creates a self-contained interactive voice conversation on a live phone call:
- EnableX calls your webhook when the call connects
- Your webhook responds: play a prompt, set
asr: true - EnableX plays the prompt, then listens for the caller's spoken response
- The caller speaks — ASR transcribes the speech
- EnableX delivers the transcript to your webhook
- Your application reads the transcript, processes it (NLU, LLM, rule logic, database lookup), and responds with the next prompt
- Loop continues for as many turns as the conversation requires
asr: true pattern is what makes spoken IVR and voice bot conversations possible on EnableX. Instead of the caller pressing keys, they speak — and your application receives their words as text to act on. The caller never needs to touch their keypad.
Common use cases
Spoken IVR: Instead of "Press 1 for sales", the caller hears "Say the department you need." EnableX captures the response, ASR transcribes it, and your webhook routes accordingly. No keypad required.
Voice bot conversations: A full multi-turn spoken conversation — greet the caller, ask questions, process their spoken answers, respond with information, confirm details, and close the interaction — all driven by the asr: true loop over a live call.
Guided data collection: Collect account numbers, reference codes, addresses, or any spoken input — transcribed and delivered as structured text to your application. Works alongside DTMF collection for hybrid flows where some inputs are spoken and others are keypad.
Contact centre transcription: For passive full-call transcription — where you want the entire conversation captured rather than individual responses — the Portal-configured STT on the phone number handles this without asr: true. Transcripts are delivered to your webhook at the end of the call.
Compliance and audit: Regulated industries require searchable records of customer interactions. Transcripts delivered via webhook can be indexed and stored in your own systems to satisfy retention requirements.
Standalone STT API — Coming Soon
The Voice-integrated STT described above works only within the context of an active EnableX Voice call. The standalone STT service extends transcription to any audio source — uploaded recordings, archived files, audio from third-party systems — without requiring an EnableX voice call.
This is intended for teams that need transcription as a general-purpose service: post-processing call recordings made on other platforms, transcribing meeting recordings, building compliance archives from mixed audio sources, or running batch transcription jobs against large libraries of historical audio.
What the standalone API will enable
File and URL submission: Submit audio by uploading a file directly or passing a publicly accessible URL. Supported formats will include common telephony and media formats (MP3, WAV, OGG, M4A).
Webhook delivery: Transcripts are delivered asynchronously to your webhook URL when processing completes — the same pattern already used for Voice-integrated transcription.
Speaker diarisation: Identify and separate speakers in multi-party recordings, so the transcript attributes each segment to a distinct speaker rather than producing a single undifferentiated stream.
Word-level timestamps: Each word is tagged with its position in the audio timeline. Useful for building audio search, playback synchronisation, or clip extraction tools.
Multi-language support: Language coverage will reflect the markets where EnableX Voice is most active — English (India and US), Hindi, Arabic, Bahasa Indonesia, Tamil, Telugu, Bengali, and Tagalog, with broader coverage added over time.
Get notified at launch
To register interest or request early access to the standalone STT API:
- Sign in to the EnableX Portal and look for the AI Services section
- Contact your EnableX account team or reach out via the website