How to build an AI voice agent with OpenAI Realtime API + Asterisk SIP (2025) using Python

I’ve deployed AI call assistants for a few organizations, and realized how scattered and incomplete most resources are. So, I decided to build a replicable framework, something you can actually use without spending hours searching for missing pieces. In this tutorial, I’ll show you the full AI call agent setup using Asterisk and OpenAI Realtime Agent, ready for you to build on.

Goal: a real phone number rings → Asterisk answers → audio is streamed (RTP) to your Python service → your service talks to OpenAI Realtime over WebSocket → the model replies in voice, you inject the audio back into the live call → you log, monitor, and fail safely.

Image from Freepik.com

The full blueprint and Github Repo for this project is provided below.

This is production-ready: NAT-aware, barge-in, DTMF fallback, guardrails, and observability.

If you’ve been burned by cute demos that die the second you add a SIP trunk or NAT, this one’s for you. We’re going through every moving part – Asterisk ARI ExternalMediaRTP framingserver-VADguardrails, and Prometheus metrics, so you can ship, not just tweet.

And, feel free to ask anything about this project.

What we’re actually wiring

Core Architecture

At its core, the system works like this:

  1. Asterisk receives a call.
  2. The ARI app connects and creates a channel.
  3. The channel is bridged to an external RTP session.
  4. The Python Realtime Agent handles live audio and sends responses.
  5. Asterisk plays those responses back to the caller.

Why this approach?
Asterisk’s ExternalMedia channel is purpose-built for plugging outside media engines into a live bridge. It sets UNICASTRTP_LOCAL_ADDRESS / UNICASTRTP_LOCAL_PORT so you know exactly where to send the model’s audio back. It’s the documented, supported path, not a hack.

OpenAI Realtime gives you bidirectional audio over WebSocket: append PCM with input_audio_buffer.append; the server uses server-VAD to decide turns and streams back response.output_audio.delta chunks you can inject as RTP live.

Tooling you need (no, not optional)

  • Ubuntu 22.04/24.04 server with public IP (or NAT + correct port rules).
  • Asterisk 20/22+ (PJSIP + ARI). We’ll use ARI over HTTP.
  • Python 3.10+ with: websockets, aiohttp, python-dotenv, prometheus_client.
  • SIP trunk from any reasonable carrier.
  • OpenAI API key with Realtime enabled (server-side; never expose to clients). Realtime supports voices like alloy/echo/shimmer and newer Ash/Ballad/Coral/Sage/Verse; pick one.
  • Grafana + Prometheus (or your favorite telemetry) to actually observe this thing.

Repo layout (drop-in)

voice-agent-py/

├─ asterisk/
│ ├─ pjsip.conf
│ ├─ extensions.conf
│ ├─ http.conf
│ └─ ari.conf
├─ app/
│ ├─ main.py # entrypoint: ARI session → bridge → ext media → Realtime
│ ├─ ari.py # minimal ARI client (REST + events)
│ ├─ realtime.py # OpenAI Realtime WS session (send/recv audio, barge-in)
│ ├─ rtp.py # RTP packetizer/depacketizer for PCM16@16k (20ms frames)
│ ├─ guardrails.py # prompts, moderation gate, DTMF/handoff policies
│ ├─ tools.py # optional Realtime tool/function schema + handlers
│ ├─ observability.py # Prometheus metrics + HTTP server (/metrics, /healthz)
│ └─ settings.py # env loading, constants
├─ requirements.txt
├─ docker-compose.yml # optional: run Python service + Prometheus└─ .env.example

Asterisk configs

asterisk/http.conf

[general]
enabled = yes
bindaddr = 0.0.0.0
bindport = 8088
asterisk/ari.conf

asterisk/ari.conf

[general]
enabled = yes
pretty = yes
[voiceapp]
type = user
read_only = no
password = CHANGE_ME

asterisk/pjsip.conf (trim; adapt to your SIP provider)

[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0:5060
; If behind NAT, also set:
;local_net=10.0.0.0/24
;external_media_address=203.0.113.10
;external_signaling_address=203.0.113.10

[mytrunk]
type=registration
outbound_auth=mytrunk
server_uri=sip:sip.example.com
client_uri=sip:USER@sip.example.com
retry_interval=60

[mytrunk]
type=auth
auth_type=userpass
username=USER
password=SECRET

[mytrunk]
type=aor
contact=sip:sip.example.com:5060

[mytrunk]
type=endpoint
context=inbound
disallow=all
allow=ulaw,alaw,g722
outbound_auth=mytrunk
aors=mytrunk

NAT variables keep RTP honest in real networks. Set external_media_address & friends if you’re behind NAT.

asterisk/extensions.conf

[inbound]
exten => _X.,1,NoOp(Inbound → AI agent)
same => n,Answer()
same => n,Stasis(voice-agent,${CALLERID(num)},${EXTEN})
same => n,Hangup()

[internal]
exten => 1001,1,NoOp(Operator / fallback queue)
same => n,Dial(PJSIP/1001)
same => n,Hangup()

rtp.conf reminder (open firewall ports):

[general]
rtpstart=10000
rtpend=20000
strictrtp=yes

Why ExternalMedia? Because it exposes where to send audio back via UNICASTRTP_LOCAL_ADDRESS/PORT, and you can fetch these through ARI and inject your model’s PCM in real time. Exactly what we need.

NOTE: At the time of writing this tutorial, OpenAI has added the functionality to connect to SIP trunks directly, you can read about it here: OpenAI Platform

Full Repositoryhttps://github.com/thevysh/AsteriskOpenAIRealtimeAssistant/

app/rtp.py — dead-simple RTP for PCM16@16k

We’ll use 20 ms frames (320 samples per channel @16k → 640 bytes payload).

# app/rtp.py

RTP Packet Handling

app/observability.py — Prometheus + health

# app/observability.py

Observability for Prometheus

Prometheus client patterns are standard; expose /metrics and scrape. (GitHub)


app/guardrails.py — prompts + moderation + DTMF policy

# app/guardrails.py

Guardrails for the AI Call Bot

For stricter cases, run moderation on transcripts before forwarding them; OpenAI’s moderation endpoint is commonly used for that gate.


app/tools.py — optional function tools (server-exec)

Realtime supports “tools” (function calling) so the model can call your backend hooks. Keep them minimal and idempotent.

# app/tools.py

Available tools for the AI Bot

app/realtime.py — OpenAI Realtime session (Python WebSockets)

Key flow:

  • session.update with server-VAD, voice, PCM formats, and system prompt.
  • Stream inbound PCM via input_audio_buffer.append.
  • On outbound, read response.output_audio.delta chunks, repacketize to RTP, and send to Asterisk’s UNICASTRTP_LOCAL_*.
  • On barge-in: send response.cancel.

# app/realtime.py

Realtime Bridge

app/ari.py — tiny ARI wrapper (REST + events)

We only need: listen for StasisStart, create a mixing bridge, add the caller, create ExternalMedia, fetch UNICASTRTP_LOCAL_*, then spin up RealtimeBridge.

# app/ari.py

ARI to Get Asterisk Events

app/settings.py

# app/settings.py

Getting the Creds

app/main.py — glue it together

# app/main.py

main.py

What just happened:

  1. Inbound hits Stasis(voice-agent,…).
  2. We create a mixing bridge and add the caller.
  3. We create ExternalMedia pointing at 127.0.0.1:UDP_PORT with format=slin16.
  4. We ask Asterisk for UNICASTRTP_LOCAL_ADDRESS/PORT of that ExternalMedia channel—this is where we must send the model’s audio back.
  5. We start a RealtimeBridge, which:
    • Reads incoming RTP from Asterisk, strips headers, forwards PCM to input_audio_buffer.append.
    • Reads response.output_audio.delta, packetizes as RTP, and sends back to the Asterisk address/port.
    • Uses server-VAD for reliable turn-taking; you can barge in.

Running it

  1. Install Asterisk; drop the provided configs; set SIP trunk; reload.
    (Use pjsip set logger on and rtp set debug on when debugging.)
  2. python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
  3. cp .env.example .env and fill values.
  4. python -m app.main
  5. Call your DID. You should hear the model’s voice within ~500–800 ms after your first phrase (thanks, server-VAD and buffered warmup). If the first second is wobbly, send a short silence chunk at session start (already done above) to stabilize levels.

Guardrails you’ll actually keep on

  • System prompt: scope your bot, forbid payment details, define escalation after ambiguity or user frustration.
  • Moderation gate: scan transcripts (and/or outputs) before acting; block risky flows or flip to human.
  • DTMF out: teach callers “Press # to reach a human”; in ARI, you can move the live channel back to dialplan (continue) to internal,1001. (I’ve kept it minimal in this article—add an ARI event listener for ChannelDtmfReceived to wire it in.) The ARI “continue” call is the canonical way to hand back to dialplan.
  • No emergencies: detect “911/112/999” and immediately transfer to a human; do not AI this.
  • PII hygiene: don’t log raw audio or sensitive strings; mask and rotate keys.

Observability: what to graph

  • calls_total, active_calls → sanity.
  • rtp_in_bytes vs rtp_out_bytes → media flow.
  • realtime_ws_send_ms → event send latency (proxy for backpressure).
  • Interrupts/turns (log response.completed); handoff ratemean call duration.
    Prometheus + Grafana examples abound; the prom-client pattern used in Node translates 1:1 to Python’s prometheus_client.

Production checklist (a.k.a. how not to page yourself at 03:00)

  • NAT: if the box is behind NAT, set external_media_address & external_signaling_address in pjsip.conf, and open your RTP range (10000–20000 by default). This is the #1 cause of “there is no audio again.”
  • Codec alignment: use slin16 on ExternalMedia so you’re feeding/receiving 16k PCM; mismatches cause artifacts.
  • First-packet weirdness: warm up with a touch of silence to settle VAD—fixes the infamous “first word eaten” effect.
  • Avoid loops: don’t chain ExternalMedia/Snoop incorrectly; circular creation = channel storms. (People do this, ask me how I know.)
  • Fallbacks: DTMF to agent/queue; time-of-day routing; rate-limit tool calls.
  • Compliance: consent lines for recording/transcripts; retention policy; no card collection via bot.

Realtime mode tool list (what you’ll actually use)

When you build with OpenAI Realtime over WebSocket, these are the core levers you’ll exercise:

  • session.update — set instructions (system prompt), voice, turn_detection: server_vad, audio formats, enable input transcription, and register tools (function calling).
  • input_audio_buffer.append — append base64-encoded PCM16@16k; server-VAD commits when it detects end-of-speech. No ack is sent for each append (by design).
  • response.output_audio.delta — streamed base64 PCM16 from model; you re-packetize into RTP.
  • response.cancel — cut model speech on barge-in. (Send when you detect inbound speech or DTMF.)
  • response.tool_call → tool.output — function calling round-trip for CRM tickets, lookups, etc.

(There are more events, transcript deltas, response lifecycle but those five carry most real-time telephony apps.)


Testing

I know you hate this, but we cannot risk not doing it!

  • Asterisk CLI: pjsip set logger on, rtp set debug on.
  • SIPp load tests for multiple concurrent calls; it’s the standard SIP traffic generator.
  • Fail NAT intentionally (block RTP) and confirm alerts fire.
  • Talk over the bot; confirm barge-in cancels TTS promptly.
  • Hammer it with noise; server-VAD should still chunk sanely.

Troubleshooting

  • “No return audio” → You didn’t fetch or you ignored UNICASTRTP_LOCAL_ADDRESS/PORT; or firewall/NAT ate the path.
  • “Output is slow/low” → check 16k PCM framing and 20 ms chunking; ensure you’re not stuffing 2 s frames.
  • “ExternalMedia doesn’t expose UNICASTRTP vars” → double-check you created ARI ExternalMedia with encapsulation=rtp/transport=udp (defaults), and you’re querying the right channel id. There are known misconfig gotchas in community threads.

Appendix — Asterisk + ExternalMedia references (so you don’t have 37 tabs open)

  • External Media + ARI—explains the variables and injection path (UNICASTRTP_LOCAL_*). (docs.asterisk.org)
  • ExternalMedia example—community snippet showing format=slin16, direction=both. (Asterisk Community)
  • ARI externalMedia API args—app, channelId, external_host, format, direction. (asterisk.ctiapps.pro)
  • Asterisk external-media sample repo—end-to-end pattern (different backend). (GitHub)
  • OpenAI Realtime docs—events, flow, server-VAD, voice. (OpenAI Platform)
  • Voices roundup—realtime voices available (and newer ones announced). (OpenAI Developer Community)
  • Prometheus metrics—Node patterns you’ll mirror in Python; general guidance. (GitHub)

Closing

If your plan was “just WebRTC in a browser,” great for a demo. The second you need SIP trunksPSTN, and the glorious mess that is NAT, you need Asterisk + ARI + Realtime. Add your CRM tools, your moderation policy, and your dashboards—and ship.

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.

You’ve been successfully subscribed to our newsletter! See you on the other side!

Sharing is caring!

Leave a Comment

Your email address will not be published.

Exit mobile version