AInora
Technical Architecture Guide

How to Build an AI Voice Agent for Debt Collection

Written by engineers who build voice AI systems for production. Not theory - architecture decisions, technology choices, and the tradeoffs nobody talks about.

6 layers, 5 critical design decisions, and an honest build-vs-buy analysis.

The 6-Layer Architecture

Every production voice AI system has these six layers. Skip one and you have a demo, not a product. Each layer has its own latency budget, failure modes, and technology choices.

Layer 01

Telephony Layer

SIP Trunking, WebRTC & Number Provisioning

The foundation. Your AI needs a way to make and receive phone calls. This means SIP trunking (Telnyx, Twilio, Vonage), WebRTC for browser-based testing, and programmatic number provisioning across area codes. You need media forking to send audio to your AI pipeline while maintaining the call. Telnyx TeXML or Twilio Media Streams give you raw audio via WebSocket - that is your entry point.

Telnyx SIPTwilio Media StreamsWebRTCRTP/SRTPWebSocket audio streams
Layer 02

Voice Activity Detection

Endpointing, Barge-in & Silence Detection

VAD determines when the debtor starts and stops speaking. Get this wrong and your AI either cuts people off or waits too long before responding. You need configurable silence thresholds (typically 500-800ms for collections), barge-in detection so the debtor can interrupt the AI mid-sentence, and background noise handling. Silero VAD is the open-source standard. OpenAI Realtime API handles this internally with server_vad mode.

Silero VADWebRTC VADServer-side endpointingConfigurable silence thresholds
Layer 03

Speech-to-Text (ASR)

Deepgram, Whisper, Google STT & Latency

Converting speech to text fast enough for real-time conversation. Latency is the constraint - batch transcription (Whisper) gives you accuracy but adds 1-3 seconds. Streaming ASR (Deepgram Nova-2, Google Chirp) gives you partial results in 200-400ms. For debt collection, you need high accuracy on names, account numbers, and dollar amounts. Deepgram is the industry standard for real-time voice AI. If you use OpenAI Realtime API or Gemini Live, ASR is built into the model.

Deepgram Nova-2Google ChirpOpenAI WhisperAzure SpeechStreaming vs batch
Layer 04

LLM Orchestration

OpenAI Realtime API vs Gemini Live vs Cascaded Pipeline

The brain. Three architectures: (1) OpenAI Realtime API - audio-in, audio-out with native function calling, lowest latency (~300ms), but expensive and limited voice options. (2) Gemini Live - Google's native audio model with interruption handling, good for high-volume. (3) Cascaded STT + LLM + TTS - most flexible, use any combination, but you own the latency budget. For collections, the LLM needs rock-solid prompt engineering: Mini-Miranda injection, debt context management, objection handling, and strict guardrails against promising outcomes the agent cannot deliver.

OpenAI Realtime APIGemini 2.5 Flash LiveGPT-4o / ClaudeFunction callingPrompt engineering
Layer 05

Text-to-Speech (TTS)

ElevenLabs, Google TTS, Voice Cloning & Latency

If you are running a cascaded pipeline, TTS is your final latency bottleneck. ElevenLabs gives the most natural voices with streaming output (~150ms to first byte). Google Cloud TTS and Azure Neural TTS are cheaper at scale. For collections, voice tone matters enormously - you need a voice that is professional, calm, and authoritative without being aggressive. Voice cloning can create consistent brand voices. If you use OpenAI Realtime API or Gemini Live, TTS is built into the model output.

ElevenLabs Turbo v2.5Google Cloud TTSAzure Neural TTSOpenAI TTSStreaming audio
Layer 06

Integration Layer

CRM APIs, Payment Processing & Compliance Engine

Where the AI connects to the real world. Function calling lets the LLM look up debtor accounts in your CRM mid-conversation, check payment status, initiate payment links, and update dispositions in real time. Your compliance engine sits here too - timezone checking, DNC lookup, consent verification, and frequency cap enforcement all happen before and during the call. This layer is what separates a demo from a production system.

REST/GraphQL APIsWebhook event systemPayment gateway SDKsCompliance rule engine
Where Projects Succeed or Fail

5 Design Decisions That Define Your System

These are not features on a roadmap. These are architectural choices you make in week one that determine whether your system works in production.

Streaming vs. Batch Processing

In debt collection calls, perceived latency directly affects debtor engagement. If your AI takes 2+ seconds to respond, the debtor hangs up. A cascaded pipeline (STT -> LLM -> TTS) with batch processing at each step accumulates 3-5 seconds of latency. Streaming changes the game: Deepgram streams partial transcripts, the LLM generates token-by-token, and ElevenLabs starts speaking from the first sentence fragment. Target: under 800ms from end of debtor speech to start of AI speech. OpenAI Realtime API achieves ~300ms because the entire pipeline is one model.

Barge-in Handling

Debtors interrupt. They disagree, they ask questions mid-sentence, they get emotional. Your AI must handle this gracefully. When VAD detects the debtor speaking while the AI is outputting audio, you need to: (1) immediately stop TTS playback, (2) flush the current LLM generation, (3) capture what the debtor said, and (4) generate a contextually appropriate response that acknowledges the interruption. OpenAI Realtime API handles this natively. With a cascaded pipeline, you need to manage the cancel/flush logic yourself across three services.

Context Window Management

Debt collection conversations require history. The AI needs to know: who the debtor is, what they owe, their payment history, previous call outcomes, any disputes or complaints, and current account status. That is a lot of context to fit into a prompt. Strategy: load a structured debtor profile at call start (500-1000 tokens), keep rolling conversation history (last 10-15 turns), and use function calling for on-demand CRM lookups rather than pre-loading everything. For multi-debt consumers, only load the relevant debt context - do not dump all accounts into the prompt.

Function Calling for Real-Time CRM Lookups

The LLM should not guess at account balances or payment deadlines. When the debtor asks "how much do I owe?" the AI calls a function that queries your CRM API in real time. This requires: (1) a well-defined function schema the LLM understands, (2) sub-second API response times from your CRM, (3) error handling when the CRM is slow or down, and (4) caching for repeated lookups in the same call. OpenAI Realtime API and Gemini Live both support native function calling. In a cascaded pipeline, you intercept the LLM output, detect function calls, execute them, and feed results back.

Conference Bridge Architecture

What happens when the AI needs to transfer to a human agent? A naive approach disconnects the AI and cold-transfers. A better approach: conference bridge. The AI stays on the line, briefs the human agent (either via whisper or a pre-call summary pushed to their screen), and then the three parties are on one call. The AI can continue to listen, take notes, fill CRM fields, and re-engage if needed. This requires SIP-level conference control - Telnyx call control API supports this natively with conference commands. The architecture is three WebSocket sessions sharing one bridge.

Non-Negotiable Infrastructure

Compliance Engine Architecture

Six subsystems that must exist before your AI makes a single collection call. These are not features you add later - they are pre-dial gates and in-call enforcement. Build them first or do not build at all.

Timezone Checking

Before dialing, resolve the debtor's time zone from their area code or address. Enforce FDCPA's 8 AM - 9 PM window in the consumer's local time. Account for daylight saving transitions. Use a timezone database (IANA/Olson) and carrier lookup API - never rely on the collector's local time. Block the call at the API level if outside permitted hours.

DNC Lookup

Scrub every number against the Federal Do Not Call Registry (refreshed every 31 days), your internal DNC list (immediate effect), state-specific DNC lists, and the FCC Reassigned Numbers Database. Implement this as a pre-dial check in your telephony layer - the call should never be initiated if the number is on any list. Cache results but set short TTLs for internal lists.

Consent Tracking

Maintain a per-consumer, per-channel, per-debt consent record. Store how consent was obtained (written, verbal, web form), when, the exact language used, and whether it has been revoked. The TCPA requires prior express consent for automated calls to cell phones. Your system must refuse to dial if consent status is missing, expired, or revoked. Treat this as a hard gate, not a soft warning.

Mini-Miranda Injection

The LLM's system prompt must include the Mini-Miranda disclosure as a non-negotiable first utterance: identify the company, state that the call is an attempt to collect a debt, and that any information obtained will be used for that purpose. This cannot be optional, skippable, or buried. Engineer the prompt so that the AI delivers this clearly and at natural pace before any other conversation. Validate in your test suite that it never gets dropped.

Recording & Transcription Pipeline

Record every call for compliance and QA. In all-party consent states (California, Florida, Illinois), disclose recording at call start. Store recordings encrypted (AES-256) with role-based access. Run async transcription (Deepgram or Whisper) after the call ends. Index transcripts for searchability - regulators will ask for specific calls. Retain per your policy, typically 3-7 years for debt collection. Build the pipeline to handle thousands of concurrent recordings.

Frequency Cap Enforcement

Regulation F limits collectors to 7 call attempts per debt within a rolling 7-day window. After a live conversation, no further calls for 7 days on that debt. Track attempts at the per-debt level, not per-consumer. Aggregate counts across all channels and agents - AI and human. Implement as a pre-dial database check: query attempt history, calculate rolling window, and block if at limit. This is a hard technical constraint, not a guideline.

Honest Analysis

Build vs. Buy vs. Custom-Built

There is no universally correct answer. The right choice depends on your engineering capacity, call volume, compliance requirements, and timeline. Here is the honest breakdown.

Build In-House

When it makes sense

Choose this when

  • You have a team of 5+ engineers with voice AI experience
  • You need deep customization that no platform supports
  • Call volume exceeds 100K+ minutes/month (cost savings justify investment)
  • You have regulatory requirements that demand full code ownership
  • You are building voice AI as a core competency, not a side feature

Tradeoffs

  • 6-12 months to production-ready MVP
  • $500K-$2M+ in engineering investment before first call
  • Ongoing maintenance: model updates, carrier changes, compliance patches
  • You own the latency budget - and the debugging when it breaks at 3 AM

Use a Platform (Retell, Vapi, Bland)

When it makes sense

Choose this when

  • You need to go live in weeks, not months
  • Your call flows are relatively standard (inbound/outbound, simple branching)
  • Volume is under 50K minutes/month
  • You do not need custom ASR models or specialized voice training
  • Your engineering team is small and focused on your core product

Tradeoffs

  • Per-minute pricing gets expensive at scale ($0.07-$0.15/min + carrier costs)
  • Limited control over latency, voice quality, and model behavior
  • Platform outages affect your operations directly
  • Compliance responsibility is still yours - the platform does not indemnify you

Custom-Built by Specialists (AInora Approach)

When it makes sense

Choose this when

  • You need production-quality voice AI without building an engineering team
  • Your use case requires deep CRM integration and custom compliance rules
  • You want to own the system but not build it from scratch
  • You need European-grade privacy (GDPR/EU AI Act) built into the architecture
  • You need ongoing optimization - not just deployment and handoff

Tradeoffs

  • Higher upfront cost than a platform, lower than building in-house
  • You depend on the builder for deep customization (though you own the code)
  • Timeline: 4-8 weeks to production vs 6-12 months in-house
  • You get opinionated architecture choices - which is usually a feature, not a bug
The Number That Matters Most

Your Latency Budget: 800ms

In a natural phone conversation, the gap between one person finishing and the other starting is roughly 200-500ms. Your AI gets slightly more leeway because callers expect automated systems to be a bit slower, but the ceiling is around 800ms. Beyond that, the conversation feels broken.

OpenAI Realtime API~300ms
Gemini Live~400-600ms
Cascaded (Deepgram + GPT-4o + ElevenLabs)~700-1200ms
Cascaded with batch Whisper~1500-3000ms

Engineering takeaway:

If you choose a cascaded pipeline, every millisecond counts. Use streaming at every stage - streaming ASR, streaming LLM generation, streaming TTS. Pre-warm your connections. Cache debtor profiles. Co-locate your services. The difference between 700ms and 1200ms is the difference between a natural conversation and a debtor who hangs up.

Related Resources

Technical FAQ

Common engineering questions about building AI voice agents for debt collection.

With OpenAI Realtime API, you can achieve approximately 300ms end-to-end latency (debtor finishes speaking to AI starts responding). With a cascaded pipeline (Deepgram STT + GPT-4o + ElevenLabs TTS), expect 700-1200ms depending on your infrastructure. Gemini Live falls in between at roughly 400-600ms. For debt collection, anything under 800ms feels natural. Over 1.5 seconds and debtors start hanging up or talking over the AI.
It depends on your priorities. OpenAI Realtime API gives you the lowest latency and simplest architecture - audio in, audio out, with native function calling and barge-in handling. But you are locked to OpenAI models, voice options are limited, and cost is higher per minute. A cascaded pipeline (e.g., Deepgram + Claude/GPT-4o + ElevenLabs) gives you full control over each component, more voice options, and the ability to swap providers. But you own the latency budget and barge-in logic. For debt collection specifically, we recommend starting with OpenAI Realtime API for speed to market, then evaluating a cascaded pipeline if you need cost optimization at scale.
Barge-in requires coordinating three systems: VAD must detect the debtor speaking while TTS is playing, the TTS audio stream must be immediately stopped, and the current LLM generation must be cancelled and re-prompted with the interruption context. OpenAI Realtime API handles this natively - it detects speech, stops its output, and processes the new input. In a cascaded pipeline, you need to: (1) run VAD continuously during TTS playback, (2) send a cancel signal to your TTS stream, (3) flush the LLM generation buffer, and (4) feed the new transcription to the LLM with context about what was interrupted. This is one of the hardest engineering challenges in voice AI.
Telnyx and Twilio are the two dominant choices. Telnyx offers lower per-minute rates, a call control API that supports conference bridging and media forking natively, and WebSocket-based audio streaming via TeXML. Twilio has a larger ecosystem, Media Streams for audio access, and more documentation. For debt collection specifically, Telnyx's call control API is better suited for conference bridge architectures where the AI stays on the line during transfers. Both support programmatic number provisioning, which you need for scaling across area codes.
Engineer the Mini-Miranda into the system prompt as an absolute first-utterance requirement. The AI's opening must be: identify the company, state this is an attempt to collect a debt, and that any information obtained will be used for that purpose. Make this a non-negotiable instruction at the top of your prompt, above all other behavioral instructions. In your test suite, run automated checks against every call transcript to verify the Mini-Miranda was delivered within the first 30 seconds. If the AI ever skips it, that is a prompt engineering bug, not a one-off error.
The compliance engine is a pre-dial and in-call middleware layer. Pre-dial: timezone check (resolve consumer's local time from area code/address, block if outside 8AM-9PM), DNC scrub (federal registry + state lists + internal list), consent verification (check per-consumer, per-channel, per-debt consent status), and frequency cap check (query attempt history against Regulation F 7-in-7 rule). In-call: Mini-Miranda injection in the prompt, cease-and-desist detection via NLU, recording disclosure for all-party consent states, and real-time call logging. Post-call: async transcription, disposition update, frequency counter increment, and audit trail storage. All of these are hard gates - if any check fails, the call does not happen.
Realistic costs for a production debt collection voice agent: Engineering time for MVP is 6-12 months with a team of 3-5 engineers ($500K-$2M). Per-call costs break down to: telephony ($0.01-0.03/min via Telnyx/Twilio), ASR ($0.005-0.01/min via Deepgram), LLM ($0.01-0.06/min depending on model), TTS ($0.01-0.03/min via ElevenLabs), and infrastructure ($0.001-0.005/min). Total per-minute COGS: $0.04-0.13 for a cascaded pipeline, $0.06-0.10 for OpenAI Realtime API. Using a platform like Retell or Vapi adds their margin on top, typically $0.07-0.15/min total. The build-vs-buy math depends on volume: at 10K minutes/month the platform is cheaper, at 100K+ minutes/month building starts to pay off.
Yes, but with significant tradeoffs. For ASR, Whisper (open-source) gives excellent accuracy but adds latency in real-time scenarios - you need to self-host with GPU infrastructure. For the LLM, open models like Llama or Mistral can work in a cascaded pipeline, but you lose native function calling reliability and voice-specific fine-tuning. For TTS, open-source options like Coqui TTS exist but do not match ElevenLabs quality. The real challenge is latency: self-hosted models require GPU infrastructure ($2-5K/month per GPU instance) and careful optimization. For debt collection, where compliance and reliability are paramount, we recommend starting with commercial APIs and moving to open-source only for cost optimization at very high volume.
Skip the Build. Keep the Control.

We Already Built This System

Everything in this guide - the 6-layer architecture, the compliance engine, the barge-in handling, the conference bridge - is running in production today. Call the demo number and hear it for yourself.