How to Build an AI Voice Agent for Debt Collection
Written by engineers who build voice AI systems for production. Not theory - architecture decisions, technology choices, and the tradeoffs nobody talks about.
6 layers, 5 critical design decisions, and an honest build-vs-buy analysis.
The 6-Layer Architecture
Every production voice AI system has these six layers. Skip one and you have a demo, not a product. Each layer has its own latency budget, failure modes, and technology choices.
Telephony Layer
SIP Trunking, WebRTC & Number Provisioning
The foundation. Your AI needs a way to make and receive phone calls. This means SIP trunking (Telnyx, Twilio, Vonage), WebRTC for browser-based testing, and programmatic number provisioning across area codes. You need media forking to send audio to your AI pipeline while maintaining the call. Telnyx TeXML or Twilio Media Streams give you raw audio via WebSocket - that is your entry point.
Voice Activity Detection
Endpointing, Barge-in & Silence Detection
VAD determines when the debtor starts and stops speaking. Get this wrong and your AI either cuts people off or waits too long before responding. You need configurable silence thresholds (typically 500-800ms for collections), barge-in detection so the debtor can interrupt the AI mid-sentence, and background noise handling. Silero VAD is the open-source standard. OpenAI Realtime API handles this internally with server_vad mode.
Speech-to-Text (ASR)
Deepgram, Whisper, Google STT & Latency
Converting speech to text fast enough for real-time conversation. Latency is the constraint - batch transcription (Whisper) gives you accuracy but adds 1-3 seconds. Streaming ASR (Deepgram Nova-2, Google Chirp) gives you partial results in 200-400ms. For debt collection, you need high accuracy on names, account numbers, and dollar amounts. Deepgram is the industry standard for real-time voice AI. If you use OpenAI Realtime API or Gemini Live, ASR is built into the model.
LLM Orchestration
OpenAI Realtime API vs Gemini Live vs Cascaded Pipeline
The brain. Three architectures: (1) OpenAI Realtime API - audio-in, audio-out with native function calling, lowest latency (~300ms), but expensive and limited voice options. (2) Gemini Live - Google's native audio model with interruption handling, good for high-volume. (3) Cascaded STT + LLM + TTS - most flexible, use any combination, but you own the latency budget. For collections, the LLM needs rock-solid prompt engineering: Mini-Miranda injection, debt context management, objection handling, and strict guardrails against promising outcomes the agent cannot deliver.
Text-to-Speech (TTS)
ElevenLabs, Google TTS, Voice Cloning & Latency
If you are running a cascaded pipeline, TTS is your final latency bottleneck. ElevenLabs gives the most natural voices with streaming output (~150ms to first byte). Google Cloud TTS and Azure Neural TTS are cheaper at scale. For collections, voice tone matters enormously - you need a voice that is professional, calm, and authoritative without being aggressive. Voice cloning can create consistent brand voices. If you use OpenAI Realtime API or Gemini Live, TTS is built into the model output.
Integration Layer
CRM APIs, Payment Processing & Compliance Engine
Where the AI connects to the real world. Function calling lets the LLM look up debtor accounts in your CRM mid-conversation, check payment status, initiate payment links, and update dispositions in real time. Your compliance engine sits here too - timezone checking, DNC lookup, consent verification, and frequency cap enforcement all happen before and during the call. This layer is what separates a demo from a production system.
5 Design Decisions That Define Your System
These are not features on a roadmap. These are architectural choices you make in week one that determine whether your system works in production.
Streaming vs. Batch Processing
In debt collection calls, perceived latency directly affects debtor engagement. If your AI takes 2+ seconds to respond, the debtor hangs up. A cascaded pipeline (STT -> LLM -> TTS) with batch processing at each step accumulates 3-5 seconds of latency. Streaming changes the game: Deepgram streams partial transcripts, the LLM generates token-by-token, and ElevenLabs starts speaking from the first sentence fragment. Target: under 800ms from end of debtor speech to start of AI speech. OpenAI Realtime API achieves ~300ms because the entire pipeline is one model.
Barge-in Handling
Debtors interrupt. They disagree, they ask questions mid-sentence, they get emotional. Your AI must handle this gracefully. When VAD detects the debtor speaking while the AI is outputting audio, you need to: (1) immediately stop TTS playback, (2) flush the current LLM generation, (3) capture what the debtor said, and (4) generate a contextually appropriate response that acknowledges the interruption. OpenAI Realtime API handles this natively. With a cascaded pipeline, you need to manage the cancel/flush logic yourself across three services.
Context Window Management
Debt collection conversations require history. The AI needs to know: who the debtor is, what they owe, their payment history, previous call outcomes, any disputes or complaints, and current account status. That is a lot of context to fit into a prompt. Strategy: load a structured debtor profile at call start (500-1000 tokens), keep rolling conversation history (last 10-15 turns), and use function calling for on-demand CRM lookups rather than pre-loading everything. For multi-debt consumers, only load the relevant debt context - do not dump all accounts into the prompt.
Function Calling for Real-Time CRM Lookups
The LLM should not guess at account balances or payment deadlines. When the debtor asks "how much do I owe?" the AI calls a function that queries your CRM API in real time. This requires: (1) a well-defined function schema the LLM understands, (2) sub-second API response times from your CRM, (3) error handling when the CRM is slow or down, and (4) caching for repeated lookups in the same call. OpenAI Realtime API and Gemini Live both support native function calling. In a cascaded pipeline, you intercept the LLM output, detect function calls, execute them, and feed results back.
Conference Bridge Architecture
What happens when the AI needs to transfer to a human agent? A naive approach disconnects the AI and cold-transfers. A better approach: conference bridge. The AI stays on the line, briefs the human agent (either via whisper or a pre-call summary pushed to their screen), and then the three parties are on one call. The AI can continue to listen, take notes, fill CRM fields, and re-engage if needed. This requires SIP-level conference control - Telnyx call control API supports this natively with conference commands. The architecture is three WebSocket sessions sharing one bridge.
Compliance Engine Architecture
Six subsystems that must exist before your AI makes a single collection call. These are not features you add later - they are pre-dial gates and in-call enforcement. Build them first or do not build at all.
Timezone Checking
Before dialing, resolve the debtor's time zone from their area code or address. Enforce FDCPA's 8 AM - 9 PM window in the consumer's local time. Account for daylight saving transitions. Use a timezone database (IANA/Olson) and carrier lookup API - never rely on the collector's local time. Block the call at the API level if outside permitted hours.
DNC Lookup
Scrub every number against the Federal Do Not Call Registry (refreshed every 31 days), your internal DNC list (immediate effect), state-specific DNC lists, and the FCC Reassigned Numbers Database. Implement this as a pre-dial check in your telephony layer - the call should never be initiated if the number is on any list. Cache results but set short TTLs for internal lists.
Consent Tracking
Maintain a per-consumer, per-channel, per-debt consent record. Store how consent was obtained (written, verbal, web form), when, the exact language used, and whether it has been revoked. The TCPA requires prior express consent for automated calls to cell phones. Your system must refuse to dial if consent status is missing, expired, or revoked. Treat this as a hard gate, not a soft warning.
Mini-Miranda Injection
The LLM's system prompt must include the Mini-Miranda disclosure as a non-negotiable first utterance: identify the company, state that the call is an attempt to collect a debt, and that any information obtained will be used for that purpose. This cannot be optional, skippable, or buried. Engineer the prompt so that the AI delivers this clearly and at natural pace before any other conversation. Validate in your test suite that it never gets dropped.
Recording & Transcription Pipeline
Record every call for compliance and QA. In all-party consent states (California, Florida, Illinois), disclose recording at call start. Store recordings encrypted (AES-256) with role-based access. Run async transcription (Deepgram or Whisper) after the call ends. Index transcripts for searchability - regulators will ask for specific calls. Retain per your policy, typically 3-7 years for debt collection. Build the pipeline to handle thousands of concurrent recordings.
Frequency Cap Enforcement
Regulation F limits collectors to 7 call attempts per debt within a rolling 7-day window. After a live conversation, no further calls for 7 days on that debt. Track attempts at the per-debt level, not per-consumer. Aggregate counts across all channels and agents - AI and human. Implement as a pre-dial database check: query attempt history, calculate rolling window, and block if at limit. This is a hard technical constraint, not a guideline.
Build vs. Buy vs. Custom-Built
There is no universally correct answer. The right choice depends on your engineering capacity, call volume, compliance requirements, and timeline. Here is the honest breakdown.
Build In-House
When it makes sense
Choose this when
- You have a team of 5+ engineers with voice AI experience
- You need deep customization that no platform supports
- Call volume exceeds 100K+ minutes/month (cost savings justify investment)
- You have regulatory requirements that demand full code ownership
- You are building voice AI as a core competency, not a side feature
Tradeoffs
- 6-12 months to production-ready MVP
- $500K-$2M+ in engineering investment before first call
- Ongoing maintenance: model updates, carrier changes, compliance patches
- You own the latency budget - and the debugging when it breaks at 3 AM
Use a Platform (Retell, Vapi, Bland)
When it makes sense
Choose this when
- You need to go live in weeks, not months
- Your call flows are relatively standard (inbound/outbound, simple branching)
- Volume is under 50K minutes/month
- You do not need custom ASR models or specialized voice training
- Your engineering team is small and focused on your core product
Tradeoffs
- Per-minute pricing gets expensive at scale ($0.07-$0.15/min + carrier costs)
- Limited control over latency, voice quality, and model behavior
- Platform outages affect your operations directly
- Compliance responsibility is still yours - the platform does not indemnify you
Custom-Built by Specialists (AInora Approach)
When it makes sense
Choose this when
- You need production-quality voice AI without building an engineering team
- Your use case requires deep CRM integration and custom compliance rules
- You want to own the system but not build it from scratch
- You need European-grade privacy (GDPR/EU AI Act) built into the architecture
- You need ongoing optimization - not just deployment and handoff
Tradeoffs
- Higher upfront cost than a platform, lower than building in-house
- You depend on the builder for deep customization (though you own the code)
- Timeline: 4-8 weeks to production vs 6-12 months in-house
- You get opinionated architecture choices - which is usually a feature, not a bug
Your Latency Budget: 800ms
In a natural phone conversation, the gap between one person finishing and the other starting is roughly 200-500ms. Your AI gets slightly more leeway because callers expect automated systems to be a bit slower, but the ceiling is around 800ms. Beyond that, the conversation feels broken.
Engineering takeaway:
If you choose a cascaded pipeline, every millisecond counts. Use streaming at every stage - streaming ASR, streaming LLM generation, streaming TTS. Pre-warm your connections. Cache debtor profiles. Co-locate your services. The difference between 700ms and 1200ms is the difference between a natural conversation and a debtor who hangs up.
Related Resources
Technical FAQ
Common engineering questions about building AI voice agents for debt collection.
We Already Built This System
Everything in this guide - the 6-layer architecture, the compliance engine, the barge-in handling, the conference bridge - is running in production today. Call the demo number and hear it for yourself.