Voice API Technical Specifications
This page covers the full technical profile of the EnableX Voice platform — the protocols it speaks, the codecs it supports, how it handles security, and the AI services it integrates for speech synthesis and recognition. Use this reference when planning your infrastructure, configuring firewalls, or choosing the right TTS/STT provider for your use case.
Protocol & Transport
EnableX Voice is built on industry-standard signalling and media protocols. Whether you are placing a call through the REST API or connecting via SIP, the platform uses well-established RFCs that your existing telephony infrastructure is already familiar with.
| Protocol | Standard | Purpose |
|---|---|---|
| SIP | RFC 3261 | Session initiation and teardown for voice calls |
| HTTPS | TLS 1.2+ | REST API transport — all management API calls |
| SIPS | RFC 3261 §26 | Secure SIP signalling over TLS |
| SRTP | RFC 3711 | Encrypted real-time media transport |
| DTMF | RFC 2833 | In-band tone signalling for IVR and keypad input |
Audio Codecs
EnableX Voice supports a range of audio codecs. G.711 (both A-law and U-law variants) is the preferred codec for standard telephony — it offers near-lossless audio quality at a fixed 64 kbps bitrate and is natively supported by virtually all SIP endpoints and PSTN gateways.
| Codec | Variant / Notes | Typical Use |
|---|---|---|
| G.711 | A-law (PCMA) and U-law (PCMU) — preferred | Standard PSTN-quality telephony; broadest compatibility |
| G.722 | Wideband (7 kHz audio) | HD voice over IP networks |
| G.723 | Low-bitrate narrowband | Low-bandwidth constrained environments |
| iLBC | Internet Low Bitrate Codec | Resilient to packet loss over lossy links |
| Speex | Open-source narrowband/wideband | VoIP clients with variable bitrate needs |
| OPUS | Wideband adaptive, 6–510 kbps | WebRTC clients, Voice Bot SDK, streaming use cases |
If your SIP infrastructure requires a codec not listed here, or specific codec priority ordering, contact EnableX support to discuss custom configuration options for your account.
Security
The EnableX Voice platform applies encryption at every layer — from the REST API down to the media stream. Here is how each surface is protected.
| Surface | Mechanism | Details |
|---|---|---|
| REST API | HTTPS (TLS 1.2+) | All management API calls are made over HTTPS. Plain HTTP is not supported. |
| SIP Signalling | SIPS | SIP sessions can be established over SIPS (SIP over TLS) for encrypted call signalling. |
| Media | SRTP (RFC 3711) | Audio streams are encrypted end-to-end using SRTP. |
| Webhooks | Encrypted delivery | Webhook payloads are delivered over HTTPS. Your webhook endpoint must accept HTTPS. |
| WebSocket / Voice Bot SDK | JWT authentication | Browser and bot clients authenticate using short-lived JWT tokens issued by your server. App Key is never exposed to the client. |
Your webhook receiver must be reachable over HTTPS on a publicly accessible URL. EnableX does not deliver webhooks to plain HTTP endpoints or to private/local addresses.
Connectivity
EnableX Voice supports both SIP-based connectivity for existing telephony systems and REST API access for application-level integration. IP allowlisting is available for accounts that require firewall-controlled access.
| Connectivity Type | Details |
|---|---|
| SIP Trunking | SIPConnect-compliant. Connect your SIP infrastructure directly to the EnableX Voice platform. |
| IP Allowlisting | Restrict API access to specific IP addresses or CIDR ranges via the EnableX Portal project settings. |
| REST API Base URL | https://api.enablex.io/voice/v1/ — all Voice API endpoints are under this base. |
SIP trunk configuration, including authentication credentials, SIP domain, and port assignments, is provisioned through the EnableX Portal. Contact EnableX support if you need custom SIP routing or failover configuration.
Recording
EnableX Voice supports call recording in two modes. Recordings are stored on the EnableX platform and are accessible via the API after the call ends. See the Call Recording page for the full API reference.
| Attribute | Details |
|---|---|
| Supported Formats | WAV, MP3 |
| On-Demand Recording | Start and stop recording at any point during an active call using the record action in your Voice API response. |
| Automatic Recording | Configure automatic recording for all calls in a project via the EnableX Portal. Recording begins as soon as the call is connected. |
Scalability
The EnableX Voice platform is designed to scale with your application load. Capacity scales horizontally across the infrastructure — there is no single-node bottleneck for call handling or API throughput.
If you are planning a large-scale deployment — such as a voice broadcast campaign with thousands of simultaneous calls, or an always-on IVR with sustained traffic — contact EnableX support to discuss dedicated capacity, rate limit adjustments, and SLA options for your account.
Text-to-Speech (TTS)
EnableX Voice integrates with multiple TTS providers. You select a provider at the API level when using the play action with a text field. Each provider has different strengths — from ultra-low latency to deep regional language coverage — so the right choice depends on your application's geography, voice quality requirements, and latency budget.
The table below summarises each available provider.
| Provider | Identifier | Strengths | Languages / Voices | Notes |
|---|---|---|---|---|
| Azure Cognitive Speech | azure |
Industry-leading neural voices; consistent prosody; broad language coverage | 500+ voices, 140+ languages | Good default choice for global, multilingual deployments |
| ElevenLabs | elevenlabs |
Ultra-realistic neural voices; natural-sounding speech with fine emotional control | 29 languages | Best for customer-facing voice experiences where naturalness is critical |
| Shunyata / Shunya Labs | — | Indian language specialist; high accuracy on regional scripts and pronunciation | Hindi, Tamil, Telugu, Kannada, Bengali, Marathi, Gujarati, and more | Recommended for India-focused deployments requiring regional language quality |
| Deepgram Aura | — | Fast, low-latency TTS optimised for real-time conversational applications | English-optimised | Well-suited for AI voice agents and bots where first-byte latency matters |
| Murf AI | — | Studio-quality voices; wide selection of styles and accents | 120+ voices, 20+ languages | Good fit for branded voice experiences and IVR systems requiring polished audio |
| EnableX Native TTS | — | Built-in engine; no external dependency; available by default on all accounts | 117 languages, Male and Female voice types | Reliable fallback; broad language reach without additional provider setup |
Not all TTS providers are enabled by default on every account. If a provider you need is not available in your project, contact EnableX support to request access.
Speech-to-Text (STT)
EnableX Voice supports multiple STT engines for real-time transcription during calls. The engine you choose affects transcription accuracy, latency, language coverage, and your ability to handle domain-specific vocabulary. Match the provider to your use case: latency-sensitive voice bots benefit from low-latency engines, while post-call analytics can prioritise accuracy over speed.
| Provider | Accuracy | Languages | Latency | Use Case |
|---|---|---|---|---|
| Azure Cognitive Speech | High; supports custom acoustic and language models | 100+ languages | Real-time and batch modes | Enterprise deployments needing custom vocabulary or domain adaptation |
| Deepgram Nova-2 | High; optimised for telephony audio | 30+ languages | Low (~300 ms) | Real-time voice bots and conversational AI where response speed is critical |
| Google Speech-to-Text | High; strong on diverse accents and noisy environments | 125+ languages | Real-time streaming | Broad multilingual support; good for global customer service deployments |
| AWS Transcribe | High; supports medical and custom vocabulary | 70+ languages | Real-time streaming | Healthcare applications, compliance recordings, domain-specific transcription |
| EnableX Native STT | Standard | Varies by configuration | Real-time | Built-in default; available on all accounts without additional provider setup |
Third-party STT providers (Azure, Deepgram, Google, AWS) may require separate account credentials or enablement on your EnableX project. Contact EnableX support to configure a specific STT provider for your account.
Custom Voice Prompts
Instead of generating speech via TTS on every call, you can upload pre-recorded audio files as named prompts and reference them by name in your Voice API responses. This gives you full control over audio quality, timing, and delivery — and avoids TTS generation latency for frequently played messages.
Format Requirements
Custom prompts must be uploaded as WAV files. The platform does not accept MP3 or other formats for prompt uploads. Ensure your audio is recorded at a sample rate appropriate for telephony (typically 8 kHz mono for PSTN calls, or 16 kHz for wideband).
Using a Prompt in the Voice API
Once a prompt is uploaded via the Portal or API, you reference it by its prompt_name in the play action of your Voice API response. The platform streams the pre-recorded audio to the caller instead of invoking a TTS engine.
{
"play": {
"prompt_name": "welcome_message"
}
}
The prompt_name value must exactly match the name assigned when the prompt was uploaded. Prompt names are case-sensitive.
Use custom prompts for messages that are played on every call — greeting messages, hold music, legal disclaimers, and menu options. Pre-recording these avoids TTS latency on the critical call-connection path and gives you precise control over voice, tone, and pacing.
Prompts are scoped to your EnableX project. Managing uploads, listing available prompts, and deleting prompts is done via the EnableX Portal or through the prompt management endpoints in the Voice API. See the API Reference for the full prompt management endpoint details.