Add AI Voice to Your App | EnableX Text to Speech API

EnableX Text to Speech (TTS) synthesises natural-sounding speech from text and plays it directly into live phone calls. TTS is already live as an integrated capability within the EnableX Voice platform — you configure a TTS provider against a phone number from the Portal, and your Play API calls can pass plain text or SSML for real-time speech synthesis during the call. A standalone TTS API — for generating audio files independently of a call — is under development and coming soon.

TTS in Voice Calls — Available Now

If you are building IVR flows, voice bots, or outbound call experiences with EnableX Voice, TTS is available to you today. Setup is Portal-driven: you configure a TTS provider and link it to your phone number. Once configured, the Play API on active calls can synthesise and play any text or SSML string you pass — without pre-recorded audio files.

Two ways to use TTS in a call

TTS in the Voice platform works in two modes. Both require a TTS provider to be configured against the phone number in the Portal, but they differ in when and how the speech is triggered.

Developer-controlled playback via webhook: This is the primary mode. During an active call, your application controls exactly what gets spoken and when. Your webhook response instructs EnableX to play a piece of text or SSML — EnableX synthesises it through the configured TTS provider and plays it to the caller. Every prompt, menu option, confirmation, and dynamic value is spoken at the explicit direction of your application. Nothing plays unless your code says so.

Portal-configured auto-welcome: Optionally, a welcome message can be configured directly in the Portal and associated with a phone number. When an incoming call arrives on that number, EnableX automatically plays the configured welcome message before any webhook processing begins. This is useful for a fixed greeting — "Thank you for calling, please hold" — that you want to play immediately on answer without a webhook round-trip. It is not mandatory. Developers who prefer to control the welcome message through their own webhook response can do so and skip the auto-welcome configuration entirely.

ℹ️

Auto-welcome is optional. The Portal-configured welcome message is a convenience feature for a fixed greeting on call arrival. All other TTS playback in the call — prompts, menus, responses — is controlled through your webhook. There is no requirement to use the auto-welcome; the webhook is always in control of the call flow after the initial answer.

How developer-controlled playback works

When a call is active and you want to speak something to the caller, your webhook response passes the text or SSML content to EnableX through the Play action. EnableX routes it through the TTS provider configured for that phone number, synthesises the audio in real time, and plays it into the call. The caller hears natural speech. Your application supplies only the text — EnableX handles the synthesis, buffering, and playback.

This eliminates the audio file management problem. There is no recording studio, no audio hosting, no file format conversion, and no re-recording cycle when the script changes. Updating what the caller hears is a text change in your application.

Portal setup

TTS is configured per phone number from the EnableX Portal under your project's AI Services settings. You select the TTS provider, language, and voice — male or female — and associate the configuration with one or more phone numbers. All calls on those numbers will use the configured voice for any text playback instruction your application sends. The optional auto-welcome message is configured in the same section.

💡

Already using the Play API? If your Voice API integration uses the Play action to speak content to callers, enabling TTS through the Portal is the step that activates text and SSML playback on that number. Your API call structure does not change.

Text and SSML support

The Play API accepts both plain text and SSML (Speech Synthesis Markup Language). Plain text is synthesised with the voice engine's default rendering. SSML gives you explicit control over how the speech sounds — inserting pauses, controlling reading speed for specific phrases, spelling out characters individually (useful for OTPs and reference numbers), and adjusting emphasis on key words.

SSML is particularly important for telephony use cases where clarity matters: reading a six-digit code digit by digit, announcing a currency amount with correct inflection, or slowing down a date to ensure the caller catches it correctly.

Common use cases

IVR menus and prompts: Welcome messages, menu options, error prompts, and confirmation messages are all synthesised from text at call time. Changing the script requires no audio re-recording — edit the text, and the next call hears the updated version.

OTP and verification calls: Security codes are passed as SSML with character-by-character reading instructions, ensuring each digit is spoken clearly and individually.

Personalised outbound messages: Outbound alert calls that include the customer's name, appointment time, or account balance cannot use static audio. TTS synthesises the full message dynamically from a template with runtime values filled in.

Voice bot responses: In a live voice bot, TTS is the output half of the conversation loop. Your NLU or LLM layer generates a text response; TTS speaks it back to the caller in real time. The caller experiences a fluid spoken conversation.

Voice Campaigns: The EnableX Voice Campaign tool in the Portal already uses TTS to synthesise welcome messages, main pitch content, and IVR menu prompts for outbound campaigns — using the same underlying TTS infrastructure available through the Voice API.

Standalone TTS API — Coming Soon

🚧

Under development. The standalone TTS API — for synthesising audio files outside of a live call — is not yet generally available. Full documentation will be published at launch.

The Voice-integrated TTS described above synthesises speech within the context of an active EnableX Voice call. The standalone TTS service extends synthesis to any context — generating reusable audio files from text that can be hosted, distributed, or injected into systems outside the EnableX call flow.

This is intended for teams that need text-to-speech as a general-purpose generation service: pre-building a library of prompt audio for use across multiple call flows, generating audio content for non-telephony applications, producing voice-over assets, or synthesising audio in batch ahead of a campaign rather than at call time.

What the standalone API will enable

Audio file generation: Submit text or SSML and receive a synthesised audio file (MP3 or WAV) as output. The file can be downloaded, hosted on your CDN, and served as a URL in Voice API playback instructions.

Reusable prompt audio: Pre-synthesise static prompts — hold messages, fixed announcements, onboarding audio — once, store them, and serve them repeatedly without re-synthesising on every call. Reduces per-call synthesis overhead for content that does not change.

Voice selection: Choose from a catalogue of neural voices across multiple languages, with male and female options per language. The catalogue is managed through the Portal and will expand over time.

Multi-language support: Language coverage will reflect the markets where EnableX Voice is most active — English (India and US), Hindi, Arabic, Bahasa Indonesia, Tamil, Telugu, Bengali, and Tagalog.

Get notified at launch

To register interest or request early access to the standalone TTS API:

Sign in to the EnableX Portal and look for the AI Services section
Contact your EnableX account team or reach out via the website