Voice Media Streaming API

View Release Notes →

Overview

The EnableX Audio Streaming API lets your application tap into a live voice call and receive the raw audio in real time — and send audio back — using a standard WebSocket connection. The communication is fully bidirectional: EnableX streams what the caller says to your server, and your server can stream synthesized or bot-generated audio back into the call.

This unlocks a class of integrations that are impossible with post-call recordings alone. Because you receive audio while the call is still in progress, you can react instantly — feeding a live transcription engine, routing the conversation through an AI voice bot, or flagging suspicious speech patterns before the call ends.

What you can build

  • Real-Time Voice Analytics — analyse sentiment, tone, or keywords as the conversation unfolds.
  • Voice Authentication & Fraud Detection — compare voiceprints or detect scripted fraud patterns in real time.
  • AI-Powered Voice Bot Integration — route call audio through a conversational AI model and inject its speech back into the call.
  • Call Recording and Monitoring — capture and store audio streams for compliance or quality assurance.
  • Real-Time Transcription & Sentiment Analysis — pipe audio to a speech-to-text engine and act on the transcript immediately.

Key capabilities

  • Bidirectional WebSocket Streaming — a single persistent WSS connection handles both inbound and outbound audio.
  • Real-Time Audio Access — audio is delivered as it is captured with no buffering delay on the EnableX side.
  • Secure Authentication — JWT token-based WebSocket authentication prevents unauthorised connections to your server.
  • REST API Control — start and stop streaming with standard HTTP calls from your application server.
  • RTP Encapsulation — audio is packaged in an industry-compatible format (G.711 ulaw, 8000 Hz, mono) for wide ecosystem compatibility.

Prerequisites & Subscription

Before you can use Media Streaming, make sure the following conditions are met.

Subscription required

The Web Streaming Service is a subscription-based feature. It is not available on the default Voice plan. You must have an active streaming subscription on your EnableX account before the stream API will accept requests. Contact EnableX support or your account manager to enable the feature.

Important: Calling the stream API without an active subscription will result in an error response. Verify your subscription status in the EnableX Portal before testing.

Transactional calls only

EnableX supports streaming exclusively on transactional calls — one-to-one conversations where a real interaction is taking place. Promotional or burst calling campaigns are not supported and will be rejected by the streaming service.

A connected call is required

Media streaming can only be started on a call that is already connected. You must first establish the voice call (either outbound or inbound) and wait for the call-connected webhook event before requesting a stream. Attempting to stream on a call that is still in the ringing or initiated state will fail.

Your WebSocket server must be ready

EnableX connects to you — not the other way around. Your application must be running a Secured WebSocket (WSS) server that is publicly reachable before you invoke the Stream API. EnableX will immediately attempt to open a connection to the host you provide in the request payload.

How It Works

The full media streaming flow combines REST API calls, WebSocket events, and webhook notifications. Here is every step in the order they occur during a live call.

  1. A call is established. Your application either places an outbound call via POST /voice/v1/call, or an inbound call arrives and you accept it via PUT /voice/v1/call/{voice_id}/accept. Either way, the goal is a connected call with a voice_id you can reference in all subsequent requests.
  2. EnableX sends webhook notifications for call state changes. As the call progresses (initiated → ringing → connected), EnableX posts event payloads to your event_url. Your server handles these to know when the call is ready for streaming.
  3. Your application requests WebSocket streaming. Once the call is connected, your server calls PUT /voice/v1/call/{voice_id}/stream with the address of your WSS server. This tells EnableX where to open the WebSocket connection.
  4. EnableX connects to your WebSocket server. EnableX opens a secure WebSocket connection to the host you provided. Your server accepts this connection, optionally validating a JWT token passed in the query string.
  5. EnableX sends connected, start_media, and media events. First, EnableX confirms the WebSocket session with a connected event. Then it sends start_media with call identifiers. After that, a continuous stream of media events arrives — each carrying a chunk of the caller's audio encoded as Base64 ulaw.
  6. Your application processes audio and can send audio back. Your server receives each media event, decodes the audio, and passes it to your pipeline (transcription, AI model, analytics). When you want to inject audio back into the call — for example, a TTS response from a voice bot — you send a media event in the reverse direction using the same ulaw/Base64 format.
  7. Streaming stops when the call ends. When the call terminates, EnableX stops streaming and sends a stream_stopped event to your webhook URL, followed by a disconnected event with call duration and disconnect reason.
  8. EnableX sends a stop_media event over WebSocket. As the final step, EnableX sends stop_media on the WebSocket to signal that no further audio will arrive. Your server should gracefully close the connection after receiving this event.
Tip: You can stop streaming before the call ends by either sending a clear_media event on the WebSocket or calling DELETE /voice/v1/call/{voice_id}/stream. See the Stop Media Streaming section for details.

Authentication

All REST API calls in the media streaming flow use the same HTTP Basic Authentication scheme as the rest of the EnableX Voice API. Your credentials are your App ID and App Key, Base64-encoded and passed in the Authorization header on every request.

For full details on obtaining your credentials and constructing the Authorization header, see the Voice Authentication guide.

Note: WebSocket connections use a separate token-based mechanism because the WebSocket protocol does not support standard HTTP Authorization headers after the upgrade handshake. This is covered in detail in the Securing the WebSocket Connection section.

Step 1 — Establish a Voice Call

Media streaming requires an active, connected call. There are two paths to get there: your application places an outbound call, or EnableX receives an inbound call and your application accepts it. Both paths are shown below.

Placing an Outbound Call

To initiate an outbound call, send a POST request to the call endpoint. Provide the destination number, your caller ID, and the URL where EnableX should send webhook notifications. The from field must match a number configured on your EnableX account.

The request body carries the following fields:

Field Description
name Optional label for this service instance. Useful for identifying calls in logs.
owner_ref A free-form reference string echoed back as-is in every webhook event for this call. Use it to correlate EnableX events with your internal records.
from The caller ID (CLI) to present on the outbound leg. Must match a number configured on your account.
to The destination phone number to dial.
event_url The HTTPS URL on your server that will receive webhook notifications for call state changes.
POST https://api.enablex.io/voice/v1/call
Authorization: Basic xxxxxx
Content-Type: application/json

{
    "name": "TEST_APP",
    "owner_ref": "XYZ",
    "to": "91XXXXXXXXXX",
    "from": "91XXXXXXXXXX",
    "event_url": "http://your-server/event"
}

A successful response confirms the call has been initiated. Save the voice_id — you will need it in every subsequent API call, including the stream request.

{
    "voice_id": "<uuid>",
    "state": "initiated",
    "from": "Originating CLI",
    "to": "Called Number",
    "timestamp": "2020-02-16T10:52:00Z"
}
Important: Wait for the connected webhook event before requesting streaming. The call must reach the connected state first. The connected webhook payload is shown below in the incoming call flow.

Handling an Incoming Call

When a call arrives on one of your EnableX numbers, EnableX posts an incomingcall webhook event to your configured event_url. This payload contains the voice_id and call identifiers you need to accept the call and later start streaming.

Webhook notification — incoming call

This event fires the moment an inbound call arrives. Your server receives it and decides whether to accept. The voice_id here is the identifier you use for all subsequent API calls related to this call.

{
    "voice_id": "<uuid>",
    "state": "incomingcall",
    "from": "<from>",
    "to": "<to>",
    "channel_id": "<uuid>",
    "endpoint_type": "PSTN",
    "timestamp": "<ts>",
    "media_type": "TEXT"
}

Accept the call

Send a PUT request to the accept endpoint, embedding the voice_id from the webhook payload in the URL path. No request body is required.

PUT https://api.enablex.io/voice/v1/call/{voice_id}/accept
Authorization: Basic xxxxxx
Content-Type: application/json

Webhook notification — call connected

Once the call is answered and both legs are connected, EnableX sends a connected webhook event to your event_url. This is your signal that the call is ready for media streaming. The call_answered_by field indicates whether a human or an automated system picked up.

{
    "voice_id": "<uuid>",
    "from": "<from>",
    "to": "<to>",
    "timestamp": "<ts>",
    "state": "connected",
    "call_answered_by": "HUMAN"
}

Step 2 — Start Media Streaming

With a connected call in hand, instruct EnableX to open a WebSocket connection to your streaming server. You do this with a single PUT call to the stream endpoint. EnableX will immediately attempt to connect to the WSS host you provide and, once connected, begin sending audio events.

Media Streaming API

The request body contains one field: wss_host, which is the full address of your WebSocket server including port. EnableX connects to this address — your server must already be running and accepting connections when you make this call.

Field Description
wss_host The fully qualified WSS address of your streaming server, including hostname (or IP) and port. EnableX connects outbound to this endpoint to deliver audio events.
PUT https://api.enablex.io/voice/v1/call/$voice_id/stream
Authorization: Basic xxxxxx
Content-Type: application/json

{
    "wss_host": "wss://your-host:port"
}
Tip: To prevent unauthorised servers from receiving your call audio, embed a signed JWT token in the WSS URL as a query parameter — for example, wss://your-host:9091?token=$TOKEN. EnableX passes this token when it connects, and your server can validate it before accepting the connection. See Securing the WebSocket Connection for the full implementation guide.

Response — Streaming Started

A successful response means EnableX has connected to your WebSocket server and streaming has begun. The state field confirms this with the value stream_started.

{
    "voice_id": "<uuid>",
    "from": "<from>",
    "to": "<to>",
    "timestamp": "<ts>",
    "state": "stream_started"
}

Response — Streaming Failed

If EnableX could not connect to your WebSocket server — because it was unreachable, refused the connection, or failed token validation — the response carries stream_failed.

{
    "voice_id": "<uuid>",
    "from": "<from>",
    "to": "<to>",
    "timestamp": "<ts>",
    "state": "stream_failed"
}
Important: If you receive stream_failed, check that your WebSocket server is publicly accessible, is using a valid TLS certificate, and is listening on the exact port specified in wss_host. Common causes are firewall rules blocking inbound connections from EnableX's IP range and self-signed certificates that EnableX's client cannot verify.

WebSocket Events — From EnableX to Your App

Once the WebSocket connection is open, EnableX sends a defined sequence of events. Your server must handle each event type. They always arrive in this order: connectedstart_media → (repeated) mediastop_media.

Event: connected

This is the first event EnableX sends after the WebSocket connection is successfully established. It carries no call-specific data — it simply confirms that the WebSocket session is live and EnableX is ready to start sending audio. Your server should use this as a readiness signal before initialising any downstream audio pipelines.

{
    "event": "connected"
}

Event: start_media

Immediately after connected, EnableX sends start_media. This event carries the call identifiers that link this WebSocket stream to the specific voice call. Use the voice_id and stream_id from this payload when sending events back to EnableX — both fields are required in outbound media and clear_media events.

The payload has two levels of metadata:

  • The top-level stream_id identifies this streaming session.
  • The nested start object repeats the stream_id, adds the voice_id, and includes the from and to call numbers for correlation with your records.
{
    "event": "start_media",
    "stream_id": "<uuid>",
    "start": {
        "stream_id": "<uuid>",
        "voice_id": "<uuid>",
        "from": "<from>",
        "to": "<to>"
    }
}

Event: media (audio from EnableX to your app)

This is the core event — you receive it continuously throughout the call, one packet per audio chunk. Each event contains a single frame of raw audio from the active call, encoded as a Base64 string inside the JSON payload.

Understanding the audio format:

  • Encoding: ulaw (G.711 μ-law) — a standard telephony codec widely supported by speech-to-text engines and audio processing libraries. Decode the Base64 payload first, then decode ulaw to PCM if your pipeline requires it.
  • Sample rate: 8000 Hz — standard narrowband telephony rate. 8,000 samples per second.
  • Channels: 1 (mono) — a single audio channel carrying the mixed call audio.
  • Payload: Base64-encoded — the raw ulaw bytes are Base64-encoded for safe transport inside JSON. Always decode the Base64 string before passing to your audio pipeline.

The seq field is a monotonically increasing sequence number. Use it to detect dropped packets or reorder frames if your pipeline introduces latency. The timestamp records the capture time of that audio frame.

{
    "event": "media",
    "voice_id": "<uuid>",
    "stream_id": "<uuid>",
    "media": {
        "seq": 239,
        "timestamp": "<ts>",
        "format": {
            "encoding": "ulaw",
            "sample_rate": 8000,
            "channels": 1
        },
        "payload": "<base64 audio>"
    }
}

WebSocket Events — From Your App to EnableX

The WebSocket connection is bidirectional. Your application can send audio back into the live call — for example, the speech output from a voice bot, a TTS response, or any synthesised audio you want the caller to hear. You do this by sending media events over the same WebSocket connection, in the direction from your server to EnableX.

Event: media (audio from your app to EnableX)

The format is identical to the inbound media event. You must use the same audio specification: ulaw encoding, 8000 Hz sample rate, mono channel, Base64-encoded payload. EnableX decodes your audio and injects it into the active call so the other party hears it immediately.

The voice_id and stream_id must match the values you received in the start_media event. The seq number should increment with each frame you send — EnableX uses it to play audio in the correct order.

{
    "event": "media",
    "voice_id": "<uuid>",
    "stream_id": "<uuid>",
    "media": {
        "seq": 101,
        "timestamp": 1763112284802,
        "format": {
            "encoding": "ulaw",
            "sample_rate": 8000,
            "channels": 1
        },
        "payload": "<base64-encoded-ulaw>"
    }
}
Important: EnableX only accepts audio encoded as ulaw at 8000 Hz mono. If your source audio is in a different format (PCM, MP3, or a different sample rate), you must transcode it before sending. Sending audio in an unsupported format will result in the injected audio being unintelligible or silently dropped.

Stop Media Streaming

Streaming stops automatically when the call ends. However, you may also need to stop streaming before the call terminates — for example, to clear queued bot audio after a user interrupts, or to terminate the stream while keeping the call active. Two mechanisms are available.

WebSocket Event: clear_media (your app to EnableX)

Send clear_media over the WebSocket when you want EnableX to stop playing any audio already queued and clear its internal buffers. This is useful when a user interrupts a bot response mid-sentence — you can stop the current audio immediately and begin sending fresh audio straight away.

Include the stream_id and voice_id from the start_media event so EnableX can identify which stream to clear:

{
    "event": "clear_media",
    "stream_id": "<uuid>",
    "voice_id": "<uuid>"
}

Stop Media Stream API

To stop streaming entirely (not just clear buffers), call the stream endpoint with the DELETE method. EnableX will close the WebSocket connection and stop sending audio events. The call itself remains active; only the media stream is terminated.

DELETE https://api.enablex.io/voice/v1/call/$voice_id/stream
Authorization: Basic xxxxxx
Content-Type: application/json

A successful response confirms the stream has been stopped:

{
    "voice_id": "<voice id>",
    "state": "success",
    "timestamp": "2021-06-28T12:16:08.578Z"
}

Webhook Notification — stream_stopped

After streaming is stopped (whether via the DELETE API or because the call ended), EnableX sends a stream_stopped webhook event to your event_url. This event signals that no further audio will arrive from this stream session. Your backend should use this to clean up any resources associated with the stream.

{
    "voice_id": "<voice id>",
    "from": "<from>",
    "to": "<to>",
    "timestamp": "<ts>",
    "state": "stream_stopped"
}

Webhook Notification — disconnected

When the call itself ends, EnableX sends a disconnected webhook event with full call metadata. This includes the total call duration, the textual disconnect reason, and the standard SIP cause code. Use this event for billing reconciliation, analytics, and releasing any resources tied to this call.

{
    "voice_id": "<voice id>",
    "from": "<from>",
    "to": "<to>",
    "timestamp": "<ts>",
    "state": "disconnected",
    "call_duration": "<cd>",
    "disconnect_reason": "Normal Clearing",
    "disconnect_cause_code": 16,
    "call_answered_by": "HUMAN"
}

WebSocket Event: stop_media (EnableX to your app)

As the final event on the WebSocket connection, EnableX sends stop_media to signal that the stream has ended and no further media events will follow. Treat this as the graceful shutdown signal — close the WebSocket connection cleanly on your side after receiving it.

{
    "event": "stop_media",
    "stream_id": "<uuid>",
    "stop": {
        "voice_id": "<uuid>"
    }
}

Securing the WebSocket Connection

When you provide a wss_host to EnableX, you are instructing it to open an outbound connection to your server. Without additional security, any client that knows your WebSocket address could connect and receive call audio. EnableX supports a token-based authentication mechanism to prevent this.

Why standard Authorization headers do not work

HTTP-based APIs protect endpoints with an Authorization header. WebSocket connections start as an HTTP upgrade request, but after the upgrade is complete the WebSocket protocol itself provides no built-in mechanism for carrying standard Authorization headers. Any security must therefore be embedded in the connection URL or handled at the application layer during the upgrade handshake.

EnableX solves this with the following approach:

  • Token as QueryString — the JWT token is passed as a URL query parameter directly in the wss_host value. Your server can read it during the HTTP upgrade request before completing the WebSocket handshake.
  • Signed JWT — the token is cryptographically signed with a secret known only to your server. This proves the connection request originated from an operation your server authorised, not from an external party.
  • Single-use token — each token is generated for one streaming session and is invalidated after it has been used once, preventing replay attacks.
  • Expiring tokens — the token carries an expiry claim and becomes invalid after a defined time window, automatically limiting exposure if a token is ever intercepted.
  • Optional IP verification — your server can additionally check that the connecting IP belongs to EnableX's known range as a second layer of validation.

How to use it

When calling the stream API, embed the signed JWT as a query parameter in the wss_host URL value:

PUT https://api.enablex.io/voice/v1/call/{voice_id}/stream
Authorization: Basic xxxxxx
Content-Type: application/json

{
    "wss_host": "wss://your-websocket-server:9091?token=$TOKEN"
}

Replace $TOKEN with the actual signed JWT your application server generated for this streaming session immediately before making this call.

Authentication flow

Here is the complete sequence of events when token-based WebSocket authentication is in use:

  1. Your application server generates a signed JWT token. The token encodes a unique session identifier, an expiry timestamp, and optionally the allowed source IP. It is signed with a secret held only by your server.
  2. Your server passes the token in the Stream API call. The wss_host URL includes the token in its query string: wss://your-host:9091?token=eyJhbGci...
  3. EnableX Voice Server attempts to connect to the wss_host. It opens an HTTP upgrade request to your server. The token travels in the query string of this upgrade request exactly as you provided it.
  4. Your server validates the token before accepting the upgrade. Your WebSocket server extracts the token from the query string and verifies:
    • The cryptographic signature — confirms the token was issued by your server and has not been tampered with.
    • The expiry claim — rejects tokens that have passed their expiry time.
    • Single-use enforcement — marks the token as consumed, preventing replay.
    • Source IP (optional) — confirms the request originates from EnableX's IP range.
  5. Connection accepted or denied. If all checks pass, your server completes the WebSocket upgrade and audio streaming begins. If any check fails, your server rejects the connection. EnableX receives a failed handshake and returns stream_failed in the API response.
Tip: Keep JWT expiry windows short — 30 to 60 seconds is usually sufficient given the time between generating the token and EnableX connecting to your server. A longer expiry window increases your exposure if the token is ever captured in a log.
Note: Token validation logic lives entirely on your server. EnableX does not inspect or interpret the token value — it simply passes it in the query string when connecting. You are free to use any JWT library and signing algorithm supported by your technology stack (HS256, RS256, etc.).