Voice Media Streaming API
Overview
The EnableX Audio Streaming API lets your application tap into a live voice call and receive the raw audio in real time — and send audio back — using a standard WebSocket connection. The communication is fully bidirectional: EnableX streams what the caller says to your server, and your server can stream synthesized or bot-generated audio back into the call.
This unlocks a class of integrations that are impossible with post-call recordings alone. Because you receive audio while the call is still in progress, you can react instantly — feeding a live transcription engine, routing the conversation through an AI voice bot, or flagging suspicious speech patterns before the call ends.
What you can build
- Real-Time Voice Analytics — analyse sentiment, tone, or keywords as the conversation unfolds.
- Voice Authentication & Fraud Detection — compare voiceprints or detect scripted fraud patterns in real time.
- AI-Powered Voice Bot Integration — route call audio through a conversational AI model and inject its speech back into the call.
- Call Recording and Monitoring — capture and store audio streams for compliance or quality assurance.
- Real-Time Transcription & Sentiment Analysis — pipe audio to a speech-to-text engine and act on the transcript immediately.
Key capabilities
- Bidirectional WebSocket Streaming — a single persistent WSS connection handles both inbound and outbound audio.
- Real-Time Audio Access — audio is delivered as it is captured with no buffering delay on the EnableX side.
- Secure Authentication — JWT token-based WebSocket authentication prevents unauthorised connections to your server.
- REST API Control — start and stop streaming with standard HTTP calls from your application server.
- RTP Encapsulation — audio is packaged in an industry-compatible format (G.711 ulaw, 8000 Hz, mono) for wide ecosystem compatibility.
Prerequisites & Subscription
Before you can use Media Streaming, make sure the following conditions are met.
Subscription required
The Web Streaming Service is a subscription-based feature. It is not available on the default Voice plan. You must have an active streaming subscription on your EnableX account before the stream API will accept requests. Contact EnableX support or your account manager to enable the feature.
Transactional calls only
EnableX supports streaming exclusively on transactional calls — one-to-one conversations where a real interaction is taking place. Promotional or burst calling campaigns are not supported and will be rejected by the streaming service.
A connected call is required
Media streaming can only be started on a call that is already connected. You must first establish the voice call (either outbound or inbound) and wait for the call-connected webhook event before requesting a stream. Attempting to stream on a call that is still in the ringing or initiated state will fail.
Your WebSocket server must be ready
EnableX connects to you — not the other way around. Your application must be running a Secured WebSocket (WSS) server that is publicly reachable before you invoke the Stream API. EnableX will immediately attempt to open a connection to the host you provide in the request payload.
How It Works
The full media streaming flow combines REST API calls, WebSocket events, and webhook notifications. Here is every step in the order they occur during a live call.
-
A call is established. Your application either places an outbound call via
POST /voice/v1/call, or an inbound call arrives and you accept it viaPUT /voice/v1/call/{voice_id}/accept. Either way, the goal is a connected call with avoice_idyou can reference in all subsequent requests. -
EnableX sends webhook notifications for call state changes. As the call progresses
(initiated → ringing → connected), EnableX posts event payloads to your
event_url. Your server handles these to know when the call is ready for streaming. -
Your application requests WebSocket streaming. Once the call is connected, your server
calls
PUT /voice/v1/call/{voice_id}/streamwith the address of your WSS server. This tells EnableX where to open the WebSocket connection. - EnableX connects to your WebSocket server. EnableX opens a secure WebSocket connection to the host you provided. Your server accepts this connection, optionally validating a JWT token passed in the query string.
-
EnableX sends
connected,start_media, andmediaevents. First, EnableX confirms the WebSocket session with aconnectedevent. Then it sendsstart_mediawith call identifiers. After that, a continuous stream ofmediaevents arrives — each carrying a chunk of the caller's audio encoded as Base64 ulaw. -
Your application processes audio and can send audio back. Your server receives each
mediaevent, decodes the audio, and passes it to your pipeline (transcription, AI model, analytics). When you want to inject audio back into the call — for example, a TTS response from a voice bot — you send amediaevent in the reverse direction using the same ulaw/Base64 format. -
Streaming stops when the call ends. When the call terminates, EnableX stops streaming
and sends a
stream_stoppedevent to your webhook URL, followed by adisconnectedevent with call duration and disconnect reason. -
EnableX sends a
stop_mediaevent over WebSocket. As the final step, EnableX sendsstop_mediaon the WebSocket to signal that no further audio will arrive. Your server should gracefully close the connection after receiving this event.
clear_media event on the WebSocket or calling
DELETE /voice/v1/call/{voice_id}/stream. See the
Stop Media Streaming section for details.
Authentication
All REST API calls in the media streaming flow use the same HTTP Basic Authentication
scheme as the rest of the EnableX Voice API. Your credentials are your App ID and App Key, Base64-encoded
and passed in the Authorization header on every request.
For full details on obtaining your credentials and constructing the Authorization header, see the Voice Authentication guide.
Step 1 — Establish a Voice Call
Media streaming requires an active, connected call. There are two paths to get there: your application places an outbound call, or EnableX receives an inbound call and your application accepts it. Both paths are shown below.
Placing an Outbound Call
To initiate an outbound call, send a POST request to the call endpoint. Provide the
destination number, your caller ID, and the URL where EnableX should send webhook notifications. The
from field must match a number configured on your EnableX account.
The request body carries the following fields:
| Field | Description |
|---|---|
name |
Optional label for this service instance. Useful for identifying calls in logs. |
owner_ref |
A free-form reference string echoed back as-is in every webhook event for this call. Use it to correlate EnableX events with your internal records. |
from |
The caller ID (CLI) to present on the outbound leg. Must match a number configured on your account. |
to |
The destination phone number to dial. |
event_url |
The HTTPS URL on your server that will receive webhook notifications for call state changes. |
POST https://api.enablex.io/voice/v1/call
Authorization: Basic xxxxxx
Content-Type: application/json
{
"name": "TEST_APP",
"owner_ref": "XYZ",
"to": "91XXXXXXXXXX",
"from": "91XXXXXXXXXX",
"event_url": "http://your-server/event"
}
A successful response confirms the call has been initiated. Save the voice_id — you will
need it in every subsequent API call, including the stream request.
{
"voice_id": "<uuid>",
"state": "initiated",
"from": "Originating CLI",
"to": "Called Number",
"timestamp": "2020-02-16T10:52:00Z"
}
connected webhook event before requesting
streaming. The call must reach the connected state first. The connected
webhook payload is shown below in the incoming call flow.
Handling an Incoming Call
When a call arrives on one of your EnableX numbers, EnableX posts an incomingcall webhook
event to your configured event_url. This payload contains the voice_id and
call identifiers you need to accept the call and later start streaming.
Webhook notification — incoming call
This event fires the moment an inbound call arrives. Your server receives it and decides whether to
accept. The voice_id here is the identifier you use for all subsequent API calls related
to this call.
{
"voice_id": "<uuid>",
"state": "incomingcall",
"from": "<from>",
"to": "<to>",
"channel_id": "<uuid>",
"endpoint_type": "PSTN",
"timestamp": "<ts>",
"media_type": "TEXT"
}
Accept the call
Send a PUT request to the accept endpoint, embedding the voice_id from the
webhook payload in the URL path. No request body is required.
PUT https://api.enablex.io/voice/v1/call/{voice_id}/accept
Authorization: Basic xxxxxx
Content-Type: application/json
Webhook notification — call connected
Once the call is answered and both legs are connected, EnableX sends a connected webhook
event to your event_url. This is your signal that the call is ready for media streaming.
The call_answered_by field indicates whether a human or an automated system picked up.
{
"voice_id": "<uuid>",
"from": "<from>",
"to": "<to>",
"timestamp": "<ts>",
"state": "connected",
"call_answered_by": "HUMAN"
}
Step 2 — Start Media Streaming
With a connected call in hand, instruct EnableX to open a WebSocket connection to your streaming server.
You do this with a single PUT call to the stream endpoint. EnableX will immediately attempt
to connect to the WSS host you provide and, once connected, begin sending audio events.
Media Streaming API
The request body contains one field: wss_host, which is the full address of your WebSocket
server including port. EnableX connects to this address — your server must already be running
and accepting connections when you make this call.
| Field | Description |
|---|---|
wss_host |
The fully qualified WSS address of your streaming server, including hostname (or IP) and port. EnableX connects outbound to this endpoint to deliver audio events. |
PUT https://api.enablex.io/voice/v1/call/$voice_id/stream
Authorization: Basic xxxxxx
Content-Type: application/json
{
"wss_host": "wss://your-host:port"
}
wss://your-host:9091?token=$TOKEN. EnableX passes this token when it connects, and your
server can validate it before accepting the connection. See
Securing the WebSocket Connection for the full implementation guide.
Response — Streaming Started
A successful response means EnableX has connected to your WebSocket server and streaming has begun.
The state field confirms this with the value stream_started.
{
"voice_id": "<uuid>",
"from": "<from>",
"to": "<to>",
"timestamp": "<ts>",
"state": "stream_started"
}
Response — Streaming Failed
If EnableX could not connect to your WebSocket server — because it was unreachable, refused the
connection, or failed token validation — the response carries stream_failed.
{
"voice_id": "<uuid>",
"from": "<from>",
"to": "<to>",
"timestamp": "<ts>",
"state": "stream_failed"
}
stream_failed, check that your WebSocket server
is publicly accessible, is using a valid TLS certificate, and is listening on the exact port specified
in wss_host. Common causes are firewall rules blocking inbound connections from EnableX's
IP range and self-signed certificates that EnableX's client cannot verify.
WebSocket Events — From EnableX to Your App
Once the WebSocket connection is open, EnableX sends a defined sequence of events. Your server must
handle each event type. They always arrive in this order:
connected → start_media → (repeated) media → stop_media.
Event: connected
This is the first event EnableX sends after the WebSocket connection is successfully established. It carries no call-specific data — it simply confirms that the WebSocket session is live and EnableX is ready to start sending audio. Your server should use this as a readiness signal before initialising any downstream audio pipelines.
{
"event": "connected"
}
Event: start_media
Immediately after connected, EnableX sends start_media. This event carries
the call identifiers that link this WebSocket stream to the specific voice call. Use the
voice_id and stream_id from this payload when sending events back to
EnableX — both fields are required in outbound media and clear_media events.
The payload has two levels of metadata:
- The top-level
stream_ididentifies this streaming session. - The nested
startobject repeats thestream_id, adds thevoice_id, and includes thefromandtocall numbers for correlation with your records.
{
"event": "start_media",
"stream_id": "<uuid>",
"start": {
"stream_id": "<uuid>",
"voice_id": "<uuid>",
"from": "<from>",
"to": "<to>"
}
}
Event: media (audio from EnableX to your app)
This is the core event — you receive it continuously throughout the call, one packet per audio chunk. Each event contains a single frame of raw audio from the active call, encoded as a Base64 string inside the JSON payload.
Understanding the audio format:
- Encoding: ulaw (G.711 μ-law) — a standard telephony codec widely supported by speech-to-text engines and audio processing libraries. Decode the Base64 payload first, then decode ulaw to PCM if your pipeline requires it.
- Sample rate: 8000 Hz — standard narrowband telephony rate. 8,000 samples per second.
- Channels: 1 (mono) — a single audio channel carrying the mixed call audio.
- Payload: Base64-encoded — the raw ulaw bytes are Base64-encoded for safe transport inside JSON. Always decode the Base64 string before passing to your audio pipeline.
The seq field is a monotonically increasing sequence number. Use it to detect dropped
packets or reorder frames if your pipeline introduces latency. The timestamp records the
capture time of that audio frame.
{
"event": "media",
"voice_id": "<uuid>",
"stream_id": "<uuid>",
"media": {
"seq": 239,
"timestamp": "<ts>",
"format": {
"encoding": "ulaw",
"sample_rate": 8000,
"channels": 1
},
"payload": "<base64 audio>"
}
}
WebSocket Events — From Your App to EnableX
The WebSocket connection is bidirectional. Your application can send audio back into the live call —
for example, the speech output from a voice bot, a TTS response, or any synthesised audio you want the
caller to hear. You do this by sending media events over the same WebSocket connection,
in the direction from your server to EnableX.
Event: media (audio from your app to EnableX)
The format is identical to the inbound media event. You must use the same audio
specification: ulaw encoding, 8000 Hz sample rate, mono channel, Base64-encoded payload. EnableX
decodes your audio and injects it into the active call so the other party hears it immediately.
The voice_id and stream_id must match the values you received in the
start_media event. The seq number should increment with each frame you send —
EnableX uses it to play audio in the correct order.
{
"event": "media",
"voice_id": "<uuid>",
"stream_id": "<uuid>",
"media": {
"seq": 101,
"timestamp": 1763112284802,
"format": {
"encoding": "ulaw",
"sample_rate": 8000,
"channels": 1
},
"payload": "<base64-encoded-ulaw>"
}
}
Stop Media Streaming
Streaming stops automatically when the call ends. However, you may also need to stop streaming before the call terminates — for example, to clear queued bot audio after a user interrupts, or to terminate the stream while keeping the call active. Two mechanisms are available.
WebSocket Event: clear_media (your app to EnableX)
Send clear_media over the WebSocket when you want EnableX to stop playing any audio
already queued and clear its internal buffers. This is useful when a user interrupts a bot response
mid-sentence — you can stop the current audio immediately and begin sending fresh audio straight away.
Include the stream_id and voice_id from the start_media event
so EnableX can identify which stream to clear:
{
"event": "clear_media",
"stream_id": "<uuid>",
"voice_id": "<uuid>"
}
Stop Media Stream API
To stop streaming entirely (not just clear buffers), call the stream endpoint with the
DELETE method. EnableX will close the WebSocket connection and stop sending audio events.
The call itself remains active; only the media stream is terminated.
DELETE https://api.enablex.io/voice/v1/call/$voice_id/stream
Authorization: Basic xxxxxx
Content-Type: application/json
A successful response confirms the stream has been stopped:
{
"voice_id": "<voice id>",
"state": "success",
"timestamp": "2021-06-28T12:16:08.578Z"
}
Webhook Notification — stream_stopped
After streaming is stopped (whether via the DELETE API or because the call ended), EnableX sends a
stream_stopped webhook event to your event_url. This event signals that no
further audio will arrive from this stream session. Your backend should use this to clean up any
resources associated with the stream.
{
"voice_id": "<voice id>",
"from": "<from>",
"to": "<to>",
"timestamp": "<ts>",
"state": "stream_stopped"
}
Webhook Notification — disconnected
When the call itself ends, EnableX sends a disconnected webhook event with full call
metadata. This includes the total call duration, the textual disconnect reason, and the standard SIP
cause code. Use this event for billing reconciliation, analytics, and releasing any resources tied
to this call.
{
"voice_id": "<voice id>",
"from": "<from>",
"to": "<to>",
"timestamp": "<ts>",
"state": "disconnected",
"call_duration": "<cd>",
"disconnect_reason": "Normal Clearing",
"disconnect_cause_code": 16,
"call_answered_by": "HUMAN"
}
WebSocket Event: stop_media (EnableX to your app)
As the final event on the WebSocket connection, EnableX sends stop_media to signal that
the stream has ended and no further media events will follow. Treat this as the graceful
shutdown signal — close the WebSocket connection cleanly on your side after receiving it.
{
"event": "stop_media",
"stream_id": "<uuid>",
"stop": {
"voice_id": "<uuid>"
}
}
Securing the WebSocket Connection
When you provide a wss_host to EnableX, you are instructing it to open an outbound
connection to your server. Without additional security, any client that knows your WebSocket address
could connect and receive call audio. EnableX supports a token-based authentication mechanism to
prevent this.
Why standard Authorization headers do not work
HTTP-based APIs protect endpoints with an Authorization header. WebSocket connections
start as an HTTP upgrade request, but after the upgrade is complete the WebSocket protocol itself
provides no built-in mechanism for carrying standard Authorization headers. Any security must therefore
be embedded in the connection URL or handled at the application layer during the upgrade handshake.
EnableX solves this with the following approach:
- Token as QueryString — the JWT token is passed as a URL query parameter directly
in the
wss_hostvalue. Your server can read it during the HTTP upgrade request before completing the WebSocket handshake. - Signed JWT — the token is cryptographically signed with a secret known only to your server. This proves the connection request originated from an operation your server authorised, not from an external party.
- Single-use token — each token is generated for one streaming session and is invalidated after it has been used once, preventing replay attacks.
- Expiring tokens — the token carries an expiry claim and becomes invalid after a defined time window, automatically limiting exposure if a token is ever intercepted.
- Optional IP verification — your server can additionally check that the connecting IP belongs to EnableX's known range as a second layer of validation.
How to use it
When calling the stream API, embed the signed JWT as a query parameter in the wss_host
URL value:
PUT https://api.enablex.io/voice/v1/call/{voice_id}/stream
Authorization: Basic xxxxxx
Content-Type: application/json
{
"wss_host": "wss://your-websocket-server:9091?token=$TOKEN"
}
Replace $TOKEN with the actual signed JWT your application server generated for this
streaming session immediately before making this call.
Authentication flow
Here is the complete sequence of events when token-based WebSocket authentication is in use:
- Your application server generates a signed JWT token. The token encodes a unique session identifier, an expiry timestamp, and optionally the allowed source IP. It is signed with a secret held only by your server.
-
Your server passes the token in the Stream API call. The
wss_hostURL includes the token in its query string:wss://your-host:9091?token=eyJhbGci... - EnableX Voice Server attempts to connect to the wss_host. It opens an HTTP upgrade request to your server. The token travels in the query string of this upgrade request exactly as you provided it.
-
Your server validates the token before accepting the upgrade. Your WebSocket server
extracts the token from the query string and verifies:
- The cryptographic signature — confirms the token was issued by your server and has not been tampered with.
- The expiry claim — rejects tokens that have passed their expiry time.
- Single-use enforcement — marks the token as consumed, preventing replay.
- Source IP (optional) — confirms the request originates from EnableX's IP range.
-
Connection accepted or denied. If all checks pass, your server completes the WebSocket
upgrade and audio streaming begins. If any check fails, your server rejects the connection. EnableX
receives a failed handshake and returns
stream_failedin the API response.