Skip to main content

Quality Review

Quality Review automatically evaluates your production calls against custom criteria. Define what a successful call looks like, run evaluations, and track quality trends over time.

Concepts

ConceptDescription
QA ConfigA named evaluation setup: which calls to evaluate (cohort) + how to score them (criteria)
AI CriteriaCustom prompts evaluated by AI on each call transcript
Performance MetricsBuilt-in quantitative thresholds (latency, engagement, etc.)
EvaluationThe result of scoring a single call against all criteria
CalibrationManual override of AI scores by a human reviewer

Creating a QA Config

The creation wizard guides you through 3 steps:

Step 1: Define the Cohort

Choose which calls to evaluate:
FilterDescription
Cohort NameA descriptive name (e.g. “Support Calls - Weekly QA”)
AgentsSelect specific agents or leave empty for all agents
Rolling PeriodHow many days back to look (e.g. last 7 days)
Sampling %Percentage of matching calls to evaluate (controls cost)

Step 2: Define Resolution Criteria

Two types of criteria:

AI Evaluated Conditions

Custom prompts evaluated by AI on each call transcript. Each condition has:
  • Name — short identifier (e.g. “Call resolved”)
  • Prompt — detailed description for the LLM evaluator (e.g. “The AI agent was able to fully resolve the user’s query without needing to transfer”)
  • Weight — relative importance (1-10)

Performance Metrics

Built-in quantitative metrics with configurable thresholds:
MetricDescriptionExample threshold
LLM LatencyAverage response time< 1000ms
TTS LatencyVoice synthesis time< 500ms
Call DurationTotal call length> 30s
InteractionsNumber of exchanges> 3
EngagementWhether the caller was engaged= true
Transfer RateWhether the call was transferred= false
LLM TokensTotal tokens consumed< 50000
TTS Cache Hit RatePercentage of cached TTS> 80%
Each metric uses an operator (less than, greater than, etc.) and a threshold value.

Step 3: Review and Create

Review your configuration and click Save & Run QA to create the config and immediately run the first evaluation.

Running Evaluations

Click Run Evaluation on any QA config to evaluate new calls that match the cohort filters. The system:
  1. Queries calls matching the cohort filters (agents, date range, duration, call analysis fields)
  2. Excludes already-evaluated calls
  3. Applies sampling (percentage and weekly cap)
  4. For each call, sends the transcript to the AI evaluator for criteria scoring
  5. Computes performance metric pass/fail from call data
  6. Calculates weighted overall score
  7. Stores results

Dashboard

The dashboard provides a comprehensive view of quality metrics:

KPIs (10+)

Calls Analyzed, Average Score, Pass Rate, Resolution Rate, Failed Count, LLM Latency, TTS Latency, Average Duration, Engagement Rate, Transfer Rate.

Charts

  • Score & Resolution Trend — area chart showing score and pass rate over time
  • Score by Agent — horizontal bar chart comparing agents
  • AI Criteria Scores — progress bars showing average score per criterion with pass rate
  • Performance Metrics — cards showing average value and pass rate per metric

Evaluated Calls

The Evaluated Calls tab lists all scored calls. Each entry shows:
  • Pass/fail status with overall score
  • Expandable detail with:
    • AI Evaluation — per-criterion score (0-10) with explanation
    • Performance Metrics — actual value vs threshold with pass/fail
    • Call Metrics — duration, latency, interactions, engagement, transfer status

Calibration

Calibration allows human reviewers to override AI scores. This is useful for handling edge cases the AI misjudges and improving evaluation accuracy over time.
Use the calibration API to manually mark individual criteria as passed or failed, with optional notes explaining the override.

Best Practices

  • Start with 3-5 AI criteria covering your most important quality dimensions
  • Use performance metrics for objective, measurable thresholds
  • Set sampling to 20-50% initially to control evaluation costs
  • Review failed calls to identify patterns and improve agent prompts
  • Calibrate regularly to catch cases where the AI evaluator is wrong