Voice API Technical Specifications

This page covers the full technical profile of the EnableX Voice platform — the protocols it speaks, the codecs it supports, how it handles security, and the AI services it integrates for speech synthesis and recognition. Use this reference when planning your infrastructure, configuring firewalls, or choosing the right TTS/STT provider for your use case.

Protocol & Transport

EnableX Voice is built on industry-standard signalling and media protocols. Whether you are placing a call through the REST API or connecting via SIP, the platform uses well-established RFCs that your existing telephony infrastructure is already familiar with.

Protocol	Standard	Purpose
SIP	RFC 3261	Session initiation and teardown for voice calls
HTTPS	TLS 1.2+	REST API transport — all management API calls
SIPS	RFC 3261 §26	Secure SIP signalling over TLS
SRTP	RFC 3711	Encrypted real-time media transport
DTMF	RFC 2833	In-band tone signalling for IVR and keypad input

Audio Codecs

EnableX Voice supports a range of audio codecs. G.711 (both A-law and U-law variants) is the preferred codec for standard telephony — it offers near-lossless audio quality at a fixed 64 kbps bitrate and is natively supported by virtually all SIP endpoints and PSTN gateways.

Codec	Variant / Notes	Typical Use
G.711	A-law (PCMA) and U-law (PCMU) — preferred	Standard PSTN-quality telephony; broadest compatibility
G.722	Wideband (7 kHz audio)	HD voice over IP networks
G.723	Low-bitrate narrowband	Low-bandwidth constrained environments
iLBC	Internet Low Bitrate Codec	Resilient to packet loss over lossy links
Speex	Open-source narrowband/wideband	VoIP clients with variable bitrate needs
OPUS	Wideband adaptive, 6–510 kbps	WebRTC clients, Voice Bot SDK, streaming use cases

Custom Codec Negotiation

If your SIP infrastructure requires a codec not listed here, or specific codec priority ordering, contact EnableX support to discuss custom configuration options for your account.

Security

The EnableX Voice platform applies encryption at every layer — from the REST API down to the media stream. Here is how each surface is protected.

Surface	Mechanism	Details
REST API	HTTPS (TLS 1.2+)	All management API calls are made over HTTPS. Plain HTTP is not supported.
SIP Signalling	SIPS	SIP sessions can be established over SIPS (SIP over TLS) for encrypted call signalling.
Media	SRTP (RFC 3711)	Audio streams are encrypted end-to-end using SRTP.
Webhooks	Encrypted delivery	Webhook payloads are delivered over HTTPS. Your webhook endpoint must accept HTTPS.
WebSocket / Voice Bot SDK	JWT authentication	Browser and bot clients authenticate using short-lived JWT tokens issued by your server. App Key is never exposed to the client.

Webhook Endpoint Requirements

Your webhook receiver must be reachable over HTTPS on a publicly accessible URL. EnableX does not deliver webhooks to plain HTTP endpoints or to private/local addresses.

Connectivity

EnableX Voice supports both SIP-based connectivity for existing telephony systems and REST API access for application-level integration. IP allowlisting is available for accounts that require firewall-controlled access.

Connectivity Type	Details
SIP Trunking	SIPConnect-compliant. Connect your SIP infrastructure directly to the EnableX Voice platform.
IP Allowlisting	Restrict API access to specific IP addresses or CIDR ranges via the EnableX Portal project settings.
REST API Base URL	`https://api.enablex.io/voice/v1/` — all Voice API endpoints are under this base.

SIPConnect Integration

SIP trunk configuration, including authentication credentials, SIP domain, and port assignments, is provisioned through the EnableX Portal. Contact EnableX support if you need custom SIP routing or failover configuration.

Recording

EnableX Voice supports call recording in two modes. Recordings are stored on the EnableX platform and are accessible via the API after the call ends. See the Voice API Reference for the full recording API.

Attribute	Details
Supported Formats	WAV, MP3
On-Demand Recording	Start and stop recording at any point during an active call using the `record` action in your Voice API response.
Automatic Recording	Configure automatic recording for all calls in a project via the EnableX Portal. Recording begins as soon as the call is connected.

Scalability

The EnableX Voice platform is designed to scale with your application load. Capacity scales horizontally across the infrastructure — there is no single-node bottleneck for call handling or API throughput.

High-Volume Use Cases

If you are planning a large-scale deployment — such as a voice broadcast campaign with thousands of simultaneous calls, or an always-on IVR with sustained traffic — contact EnableX support to discuss dedicated capacity, rate limit adjustments, and SLA options for your account.

Text-to-Speech (TTS)

EnableX Voice integrates with multiple TTS providers. You select a provider at the API level when using the play action with a text field. Each provider has different strengths — from ultra-low latency to deep regional language coverage — so the right choice depends on your application's geography, voice quality requirements, and latency budget.

The table below summarises each available provider.

Provider	Identifier	Strengths	Languages / Voices	Notes
Azure Cognitive Speech	`azure`	Industry-leading neural voices; consistent prosody; broad language coverage	500+ voices, 140+ languages	Good default choice for global, multilingual deployments
ElevenLabs	`elevenlabs`	Ultra-realistic neural voices; natural-sounding speech with fine emotional control	29 languages	Best for customer-facing voice experiences where naturalness is critical
Shunyata / Shunya Labs	—	Indian language specialist; high accuracy on regional scripts and pronunciation	Hindi, Tamil, Telugu, Kannada, Bengali, Marathi, Gujarati, and more	Recommended for India-focused deployments requiring regional language quality
Deepgram Aura	—	Fast, low-latency TTS optimised for real-time conversational applications	English-optimised	Well-suited for AI voice agents and bots where first-byte latency matters
Murf AI	—	Studio-quality voices; wide selection of styles and accents	120+ voices, 20+ languages	Good fit for branded voice experiences and IVR systems requiring polished audio
EnableX Native TTS	—	Built-in engine; no external dependency; available by default on all accounts	117 languages, Male and Female voice types	Reliable fallback; broad language reach without additional provider setup

Provider Availability

Not all TTS providers are enabled by default on every account. If a provider you need is not available in your project, contact EnableX support to request access.

Speech-to-Text (STT)

EnableX Voice supports multiple STT engines for real-time transcription during calls. The engine you choose affects transcription accuracy, latency, language coverage, and your ability to handle domain-specific vocabulary. Match the provider to your use case: latency-sensitive voice bots benefit from low-latency engines, while post-call analytics can prioritise accuracy over speed.

Provider	Accuracy	Languages	Latency	Use Case
Azure Cognitive Speech	High; supports custom acoustic and language models	100+ languages	Real-time and batch modes	Enterprise deployments needing custom vocabulary or domain adaptation
Deepgram Nova-2	High; optimised for telephony audio	30+ languages	Low (~300 ms)	Real-time voice bots and conversational AI where response speed is critical
Google Speech-to-Text	High; strong on diverse accents and noisy environments	125+ languages	Real-time streaming	Broad multilingual support; good for global customer service deployments
AWS Transcribe	High; supports medical and custom vocabulary	70+ languages	Real-time streaming	Healthcare applications, compliance recordings, domain-specific transcription
EnableX Native STT	Standard	Varies by configuration	Real-time	Built-in default; available on all accounts without additional provider setup

STT Provider Access

Third-party STT providers (Azure, Deepgram, Google, AWS) may require separate account credentials or enablement on your EnableX project. Contact EnableX support to configure a specific STT provider for your account.

Custom Voice Prompts

Instead of generating speech via TTS on every call, you can upload pre-recorded audio files as named prompts and reference them by name in your Voice API responses. This gives you full control over audio quality, timing, and delivery — and avoids TTS generation latency for frequently played messages.

Format Requirements

Custom prompts must be uploaded as WAV files. The platform does not accept MP3 or other formats for prompt uploads. Ensure your audio is recorded at a sample rate appropriate for telephony (typically 8 kHz mono for PSTN calls, or 16 kHz for wideband).

Using a Prompt in the Voice API

Once a prompt is uploaded via the Portal or API, you reference it by its prompt_name in the play action of your Voice API response. The platform streams the pre-recorded audio to the caller instead of invoking a TTS engine.

{
  "play": {
    "prompt_name": "welcome_message"
  }
}

The prompt_name value must exactly match the name assigned when the prompt was uploaded. Prompt names are case-sensitive.

When to Use Custom Prompts

Use custom prompts for messages that are played on every call — greeting messages, hold music, legal disclaimers, and menu options. Pre-recording these avoids TTS latency on the critical call-connection path and gives you precise control over voice, tone, and pacing.

Prompt Management

Prompts are scoped to your EnableX project. Managing uploads, listing available prompts, and deleting prompts is done via the EnableX Portal or through the prompt management endpoints in the Voice API. See the API Reference for the full prompt management endpoint details.