AInora
Technical Architecture Guide

How to Build an AI Voice Agent for Debt Collection

Written by engineers who build voice AI systems for production. Not theory - architecture decisions, technology choices, and the tradeoffs nobody talks about.

6 layers, 5 critical design decisions, and an honest build-vs-buy analysis.

6
Architecture Layers
<800ms
Latency Target
5
Critical Decisions
6
Compliance Gates

The 6-Layer Architecture

Every production voice AI system has these six layers. Skip one and you have a demo, not a product. Each layer has its own latency budget, failure modes, and technology choices.

Layer 01

Telephony Layer

SIP Trunking, WebRTC & Number Provisioning

The foundation. Your AI needs a way to make and receive phone calls. This means SIP trunking with a CPaaS provider, WebRTC for browser-based testing, and programmatic number provisioning across area codes. You need media forking to send audio to your AI pipeline while maintaining the call. Modern CPaaS APIs expose raw audio via WebSocket through TeXML or Media Streams patterns - that is your entry point.

SIP trunkingCPaaS Media StreamsWebRTCRTP/SRTPWebSocket audio streams
Layer 02

Voice Activity Detection

Endpointing, Barge-in & Silence Detection

VAD determines when the debtor starts and stops speaking. Get this wrong and your AI either cuts people off or waits too long before responding. You need configurable silence thresholds (typically 500-800ms for collections), barge-in detection so the debtor can interrupt the AI mid-sentence, and background noise handling. Silero VAD is the open-source standard. Modern realtime voice APIs handle this internally with server-side VAD modes.

Silero VADWebRTC VADServer-side endpointingConfigurable silence thresholds
Layer 03

Speech-to-Text (ASR)

Streaming ASR, Batch Whisper-Class Models & Latency

Converting speech to text fast enough for real-time conversation. Latency is the constraint - batch transcription with Whisper-class models gives you accuracy but adds 1-3 seconds. Streaming ASR providers give you partial results in 200-400ms. For debt collection, you need high accuracy on names, account numbers, and dollar amounts. Streaming ASR is the industry standard for real-time voice AI. If you use a unified realtime voice API, ASR is built into the model.

Streaming ASRWhisper-class batch ASRCloud STT APIsDomain vocabulary tuningStreaming vs batch
Layer 04

LLM Orchestration

Unified Realtime Voice APIs vs Cascaded Pipeline

The brain. Three architectures: (1) Lowest-latency unified realtime voice API - audio-in, audio-out with native function calling, lowest latency (~300ms), but higher per-minute cost and limited voice options. (2) Native multimodal realtime API with interruption handling, good for high-volume. (3) Cascaded STT + LLM + TTS - most flexible, use any combination, but you own the latency budget. For collections, the LLM needs rock-solid prompt engineering: Mini-Miranda injection, debt context management, objection handling, and strict guardrails against promising outcomes the agent cannot deliver.

Unified realtime voice APIMultimodal realtime APIFrontier text LLMFunction callingPrompt engineering
Layer 05

Text-to-Speech (TTS)

Neural TTS, Voice Cloning & Latency

If you are running a cascaded pipeline, TTS is your final latency bottleneck. Premium neural TTS providers deliver the most natural voices with streaming output (~150ms to first byte). Big-cloud neural TTS engines are cheaper at scale. For collections, voice tone matters enormously - you need a voice that is professional, calm, and authoritative without being aggressive. Voice cloning can create consistent brand voices. If you use a unified realtime voice API, TTS is built into the model output.

Premium neural TTSCloud neural TTSVoice cloningFirst-byte streamingStreaming audio
Layer 06

Integration Layer

CRM APIs, Payment Processing & Compliance Engine

Where the AI connects to the real world. Function calling lets the LLM look up debtor accounts in your CRM mid-conversation, check payment status, initiate payment links, and update dispositions in real time. Your compliance engine sits here too - timezone checking, DNC lookup, consent verification, and frequency cap enforcement all happen before and during the call. This layer is what separates a demo from a production system.

REST/GraphQL APIsWebhook event systemPayment gateway SDKsCompliance rule engine
Where Projects Succeed or Fail

5 Design Decisions That Define Your System

These are not features on a roadmap. These are architectural choices you make in week one that determine whether your system works in production.

Streaming vs. Batch Processing

In debt collection calls, perceived latency directly affects debtor engagement. If your AI takes 2+ seconds to respond, the debtor hangs up. A cascaded pipeline (STT -> LLM -> TTS) with batch processing at each step accumulates 3-5 seconds of latency. Streaming changes the game: streaming ASR emits partial transcripts, the LLM generates token-by-token, and streaming TTS starts speaking from the first sentence fragment. Target: under 800ms from end of debtor speech to start of AI speech. Unified realtime voice APIs achieve ~300ms because the entire pipeline is one model.

Barge-in Handling

Debtors interrupt. They disagree, they ask questions mid-sentence, they get emotional. Your AI must handle this gracefully. When VAD detects the debtor speaking while the AI is outputting audio, you need to: (1) immediately stop TTS playback, (2) flush the current LLM generation, (3) capture what the debtor said, and (4) generate a contextually appropriate response that acknowledges the interruption. Unified realtime voice APIs handle this natively. With a cascaded pipeline, you need to manage the cancel/flush logic yourself across three services.

Context Window Management

Debt collection conversations require history. The AI needs to know: who the debtor is, what they owe, their payment history, previous call outcomes, any disputes or complaints, and current account status. That is a lot of context to fit into a prompt. Strategy: load a structured debtor profile at call start (500-1000 tokens), keep rolling conversation history (last 10-15 turns), and use function calling for on-demand CRM lookups rather than pre-loading everything. For multi-debt consumers, only load the relevant debt context - do not dump all accounts into the prompt.

Function Calling for Real-Time CRM Lookups

The LLM should not guess at account balances or payment deadlines. When the debtor asks "how much do I owe?" the AI calls a function that queries your CRM API in real time. This requires: (1) a well-defined function schema the LLM understands, (2) sub-second API response times from your CRM, (3) error handling when the CRM is slow or down, and (4) caching for repeated lookups in the same call. Modern realtime voice APIs all support native function calling. In a cascaded pipeline, you intercept the LLM output, detect function calls, execute them, and feed results back.

Conference Bridge Architecture

What happens when the AI needs to transfer to a human agent? A naive approach disconnects the AI and cold-transfers. A better approach: conference bridge. The AI stays on the line, briefs the human agent (either via whisper or a pre-call summary pushed to their screen), and then the three parties are on one call. The AI can continue to listen, take notes, fill CRM fields, and re-engage if needed. This requires SIP-level conference control - modern CPaaS call control APIs support this natively with conference commands. The architecture is three WebSocket sessions sharing one bridge.

Powered by industry-leading technology

OpenAIAnthropicAWSGoogleElevenLabsTelnyxOpenAIAnthropicAWSGoogleElevenLabsTelnyxOpenAIAnthropicAWSGoogleElevenLabsTelnyx
Non-Negotiable Infrastructure

Compliance Engine Architecture

Six subsystems that must exist before your AI makes a single collection call. These are not features you add later - they are pre-dial gates and in-call enforcement. Build them first or do not build at all.

Timezone Checking

Before dialing, resolve the debtor's time zone from their area code or address. Enforce FDCPA's 8 AM - 9 PM window in the consumer's local time. Account for daylight saving transitions. Use a timezone database (IANA/Olson) and carrier lookup API - never rely on the collector's local time. Block the call at the API level if outside permitted hours.

DNC Lookup

Scrub every number against the Federal Do Not Call Registry (refreshed every 31 days), your internal DNC list (immediate effect), state-specific DNC lists, and the FCC Reassigned Numbers Database. Implement this as a pre-dial check in your telephony layer - the call should never be initiated if the number is on any list. Cache results but set short TTLs for internal lists.

Consent Tracking

Maintain a per-consumer, per-channel, per-debt consent record. Store how consent was obtained (written, verbal, web form), when, the exact language used, and whether it has been revoked. The TCPA requires prior express consent for automated calls to cell phones. Your system must refuse to dial if consent status is missing, expired, or revoked. Treat this as a hard gate, not a soft warning.

Mini-Miranda Injection

The LLM's system prompt must include the Mini-Miranda disclosure as a non-negotiable first utterance: identify the company, state that the call is an attempt to collect a debt, and that any information obtained will be used for that purpose. This cannot be optional, skippable, or buried. Engineer the prompt so that the AI delivers this clearly and at natural pace before any other conversation. Validate in your test suite that it never gets dropped.

Recording & Transcription Pipeline

Record every call for compliance and QA. In all-party consent states (California, Florida, Illinois), disclose recording at call start. Store recordings encrypted (AES-256) with role-based access. Run async transcription (Deepgram or Whisper) after the call ends. Index transcripts for searchability - regulators will ask for specific calls. Retain per your policy, typically 3-7 years for debt collection. Build the pipeline to handle thousands of concurrent recordings.

Frequency Cap Enforcement

Regulation F limits collectors to 7 call attempts per debt within a rolling 7-day window. After a live conversation, no further calls for 7 days on that debt. Track attempts at the per-debt level, not per-consumer. Aggregate counts across all channels and agents - AI and human. Implement as a pre-dial database check: query attempt history, calculate rolling window, and block if at limit. This is a hard technical constraint, not a guideline.

Honest Analysis

Build vs. Buy vs. Custom-Built

There is no universally correct answer. The right choice depends on your engineering capacity, call volume, compliance requirements, and timeline. Here is the honest breakdown.

Build In-House

When it makes sense

Choose this when

  • You have a team of 5+ engineers with voice AI experience
  • You need deep customization that no platform supports
  • Call volume exceeds 100K+ minutes/month (cost savings justify investment)
  • You have regulatory requirements that demand full code ownership
  • You are building voice AI as a core competency, not a side feature

Tradeoffs

  • 6-12 months to production-ready MVP
  • $500K-$2M+ in engineering investment before first call
  • Ongoing maintenance: model updates, carrier changes, compliance patches
  • You own the latency budget - and the debugging when it breaks at 3 AM

Use a Platform (Retell, Vapi, Bland)

When it makes sense

Choose this when

  • You need to go live in weeks, not months
  • Your call flows are relatively standard (inbound/outbound, simple branching)
  • Volume is under 50K minutes/month
  • You do not need custom ASR models or specialized voice training
  • Your engineering team is small and focused on your core product

Tradeoffs

  • Per-minute pricing gets expensive at scale ($0.07-$0.15/min + carrier costs)
  • Limited control over latency, voice quality, and model behavior
  • Platform outages affect your operations directly
  • Compliance responsibility is still yours - the platform does not indemnify you

Custom-Built by Specialists (AInora Approach)

When it makes sense

Choose this when

  • You need production-quality voice AI without building an engineering team
  • Your use case requires deep CRM integration and custom compliance rules
  • You want to own the system but not build it from scratch
  • You need European-grade privacy (GDPR/EU AI Act) built into the architecture
  • You need ongoing optimization - not just deployment and handoff

Tradeoffs

  • Higher upfront cost than a platform, lower than building in-house
  • You depend on the builder for deep customization (though you own the code)
  • Timeline: 4-8 weeks to production vs 6-12 months in-house
  • You get opinionated architecture choices - which is usually a feature, not a bug
The Number That Matters Most

Your Latency Budget: 800ms

In a natural phone conversation, the gap between one person finishing and the other starting is roughly 200-500ms. Your AI gets slightly more leeway because callers expect automated systems to be a bit slower, but the ceiling is around 800ms. Beyond that, the conversation feels broken.

Unified realtime voice API~300ms
Multimodal realtime voice API~400-600ms
Cascaded (streaming ASR + frontier LLM + premium TTS)~700-1200ms
Cascaded with batch ASR~1500-3000ms

Engineering takeaway:

If you choose a cascaded pipeline, every millisecond counts. Use streaming at every stage - streaming ASR, streaming LLM generation, streaming TTS. Pre-warm your connections. Cache debtor profiles. Co-locate your services. The difference between 700ms and 1200ms is the difference between a natural conversation and a debtor who hangs up.

Hear the Architecture in Action

Talk to our AI voice agent and experience the 6-layer system live.

JessicaJessica·English

Click to start a conversation

Call our debt collection demo

24/7 live AI - call anytime

Technical FAQ

Common engineering questions about building AI voice agents for debt collection.

A unified realtime voice API achieves roughly 300ms end-to-end, multimodal realtime APIs sit at 400-600ms, and a cascaded streaming ASR + frontier LLM + premium TTS pipeline lands at 700-1200ms. For debt collection, anything under 800ms feels natural - over 1.5 seconds and debtors hang up.
A unified realtime voice API gives the lowest latency and simplest architecture with native barge-in, but locks you to one vendor at a higher per-minute cost. A cascaded pipeline offers full control and provider flexibility, but you own the latency budget and barge-in logic. Start with a unified API for speed to market, then evaluate cascaded for cost optimization at scale.
Barge-in requires coordinating VAD detection, immediate TTS stoppage, LLM generation flush, and re-prompting with interruption context. Unified realtime voice APIs handle this natively. In a cascaded pipeline, you must manage cancel/flush logic across three services yourself - it is one of the hardest engineering challenges in voice AI.
A handful of large CPaaS providers dominate. Look for: a call control API with native conference bridging, WebSocket audio streaming, programmatic number provisioning across area codes, and mature SIP trunking. For debt collection conference bridge architectures, a provider with first-class call control APIs is the stronger fit than one optimized for messaging.
Engineer the Mini-Miranda as an absolute first-utterance requirement at the top of your system prompt, above all other instructions. Run automated checks against every call transcript to verify delivery within the first 30 seconds. If the AI ever skips it, that is a prompt engineering bug.
A pre-dial and in-call middleware layer. Pre-dial gates: timezone check, DNC scrub, consent verification, and Regulation F frequency cap. In-call enforcement: Mini-Miranda injection, cease-and-desist detection, recording disclosure, and real-time logging. All are hard gates - if any check fails, the call does not happen.
Engineering MVP takes 6-12 months with 3-5 engineers ($500K-$2M). Per-minute COGS: $0.04-0.13 for cascaded, $0.06-0.10 for unified realtime voice APIs. Voice AI platforms run $0.07-0.15/min total. The build-vs-buy math flips around 100K minutes/month.
Yes, but with tradeoffs. Open-source ASR adds latency without GPU infrastructure, open LLMs lose native function calling reliability, and open TTS does not match premium neural TTS quality. Self-hosting requires $2-5K/month per GPU instance. Start with commercial APIs and move to open-source only at very high volume.
Skip the Build. Keep the Control.

We Already Built This System

Everything in this guide - the 6-layer architecture, the compliance engine, the barge-in handling, the conference bridge - is running in production today. Call the demo number and hear it for yourself.

JB
Justas Butkus

Founder & CEO, AInora

Building AI digital administrators that replace front-desk overhead for service businesses across Europe. Previously built voice AI systems for dental clinics, hotels, and restaurants.

View all articles