LLM-as-a-Judge: Evaluating AI SDR Quality Beyond Simple Rules
How to use LLM judges to evaluate AI SDR campaigns with human-like reasoning, going beyond rule-based metrics to assess prospect relevance, personalization, and professional tone.
Vicente Reig
Fractional Engineering Lead •
You’ve built an AI system that finds prospects and writes sales emails. Now you need to evaluate the quality before sending. Simple keyword matching and rule-based metrics miss important nuances like tone, authenticity, and contextual relevance.
LLM-as-a-Judge uses one language model to evaluate another’s output before sending. Instead of hardcoded rules, you get contextual evaluation that considers nuance. DSPy.rb makes this approach straightforward to implement.
Why Current Evaluation Approaches Fall Short
Most AI SDR evaluation falls into two categories:
Rule-based evaluation relies on keyword matching and pattern detection:
# Rule-based approach - limited and brittle
def evaluate_personalization(email, prospect)
score = 0.0
score += 0.2 if email.include?(prospect.first_name)
score += 0.3 if email.include?(prospect.company)
score += 0.5 if email.match?(/recent|news|announcement/i)
score
end
This approach misses crucial nuances:
- Context sensitivity: “John” might be personalized or just a coincidence
- Authenticity: Does the personalization feel natural or forced?
- Professional tone: Is the language appropriate for the prospect’s seniority level?
- Cultural awareness: Does the approach match the prospect’s business culture?
Engagement-based evaluation measures post-send metrics (open rates, reply rates, conversions). Many AI SDR platforms, especially early-stage startups, rely heavily on these metrics:
# Engagement-based approach - reactive, not predictive
def evaluate_campaign_performance(campaign)
{
open_rate: campaign.opens.count / campaign.sends.count,
reply_rate: campaign.replies.count / campaign.sends.count,
positive_replies: campaign.positive_replies.count / campaign.replies.count
}
end
The problem with engagement-only evaluation:
- Delayed feedback: You learn about quality issues after sending
- Sender reputation risk: Poor campaigns can damage deliverability
- Limited insight: Metrics don’t tell you why something didn’t work
- Volume dependency: Need significant send volume for statistical significance
The LLM Judge Approach
LLM judges can evaluate context and nuance that simple rules miss:
# Define structured types for better type safety
class TargetCriteria < T::Struct
const :role, String
const :company, String
const :industry, String
const :seniority_level, T.nilable(String)
const :department, T.nilable(String)
end
class ProspectProfile < T::Struct
const :first_name, String
const :last_name, String
const :title, String
const :company, String
const :industry, String
const :linkedin_url, T.nilable(String)
const :company_size, T.nilable(String)
end
class EmailCampaign < T::Struct
const :subject, String
const :body, String
const :sender_name, String
const :sender_email, String
const :signature, T.nilable(String)
end
class SenderContext < T::Struct
const :company, String
const :value_proposition, String
const :industry_focus, T.nilable(String)
const :case_studies, T.nilable(T::Array[String])
end
# Judge evaluation result structures
class ProspectRelevanceEvaluation < T::Struct
const :score, Float
const :reasoning, String
end
class PersonalizationEvaluation < T::Struct
const :score, Float
const :reasoning, String
end
class ValuePropositionEvaluation < T::Struct
const :score, Float
const :reasoning, String
end
class ProfessionalismEvaluation < T::Struct
const :score, Float
const :reasoning, String
end
class ComplianceEvaluation < T::Struct
const :score, Float
const :reasoning, String
end
# Send recommendation enum
class SendRecommendation < T::Enum
enums do
Send = new('SEND')
Revise = new('REVISE')
Reject = new('REJECT')
end
end
class AISDRJudge < DSPy::Signature
description "Evaluate AI-generated sales outreach like an experienced sales manager"
input do
const :target_criteria, TargetCriteria, description: "Target prospect requirements"
const :prospect_profile, ProspectProfile, description: "Found prospect details"
const :email_campaign, EmailCampaign, description: "Generated email subject and body"
const :sender_context, SenderContext, description: "Sender company and value proposition"
end
output do
const :prospect_relevance, ProspectRelevanceEvaluation, description: "Prospect-target fit assessment"
const :personalization, PersonalizationEvaluation, description: "Email personalization evaluation"
const :value_proposition, ValuePropositionEvaluation, description: "Value proposition assessment"
const :professionalism, ProfessionalismEvaluation, description: "Professional tone evaluation"
const :compliance, ComplianceEvaluation, description: "Legal and ethical compliance check"
const :overall_quality_score, Float, description: "Overall campaign quality (0-1)"
const :send_recommendation, SendRecommendation, description: "Final recommendation: SEND, REVISE, or REJECT"
end
end
Building the LLM Judge Metric
The structured approach using TStructs and TEnums provides several advantages:
# ✅ Type-safe access with structured reasoning
judgment.prospect_relevance.score # Float (0.0-1.0)
judgment.prospect_relevance.reasoning # String with detailed explanation
judgment.send_recommendation # SendRecommendation enum (SEND/REVISE/REJECT)
# ✅ Prevents common errors
case judgment.send_recommendation
when SendRecommendation::Send # Compile-time type checking
when SendRecommendation::Revise # IDE autocomplete support
when SendRecommendation::Reject # No typos like "REJCT"
end
# ❌ Old string-based approach (error-prone)
if judgment.send_recommendation == "SEND" # Typo-prone
# Could be "Send", "send", "SEND", etc.
end
Here’s how to implement an LLM judge as a DSPy.rb custom metric:
# Configure the LLM judge outside the metric for better performance
judge_lm = DSPy::LM.new('openai/gpt-4o-mini',
api_key: ENV['OPENAI_API_KEY'])
judge = DSPy::ChainOfThought.new(AISDRJudge)
judge.configure do |c|
c.lm = judge_lm
end
judgements = []
# LLM-as-a-Judge custom metric for AI SDR evaluation
ai_sdr_llm_judge_metric = ->(example, prediction) do
return 0.0 unless prediction
sdr_output = prediction.sdr_output
campaign_request = example
# Create structured inputs for the LLM judge
target_criteria = TargetCriteria.new(
role: campaign_request.target_role,
company: campaign_request.target_company,
industry: campaign_request.target_industry,
seniority_level: campaign_request.seniority_level,
department: campaign_request.department
)
# Use the structured prospect and email from prediction
prospect_profile = sdr_output.prospect
email_campaign = sdr_output.email
sender_context = SenderContext.new(
company: sdr_output.sender_company,
value_proposition: campaign_request.value_proposition,
industry_focus: campaign_request.industry_focus,
case_studies: campaign_request.case_studies
)
begin
# Get comprehensive judgment from LLM
judgment = judge.call(
target_criteria: target_criteria,
prospect_profile: prospect_profile,
email_campaign: email_campaign,
sender_context: sender_context
)
weights = {
prospect_relevance: 0.25, # Right person, right company
personalization: 0.25, # Authentic, not generic
value_proposition: 0.20, # Clear benefit
professionalism: 0.15, # Builds trust
compliance: 0.15 # Legal/ethical standards
}
weighted_score = (
judgment.prospect_relevance.score * weights[:prospect_relevance] +
judgment.personalization.score * weights[:personalization] +
judgment.value_proposition.score * weights[:value_proposition] +
judgment.professionalism.score * weights[:professionalism] +
judgment.compliance.score * weights[:compliance]
)
judgements << judgement
return weighted_score
rescue => e
# Graceful fallback if LLM judge fails
DSPy.logger.warn("LLM Judge evaluation failed: #{e.message}")
{ score: 0.0, error: e.message }
end
end
Real-World Example: Evaluating AI SDR Campaigns
Let’s see the LLM judge in action with a complete evaluation workflow:
# Define campaign request structure
class CampaignRequest < T::Struct
const :target_role, String
const :target_company, String
const :target_industry, String
const :value_proposition, String
const :seniority_level, T.nilable(String)
const :department, T.nilable(String)
const :industry_focus, T.nilable(String)
const :case_studies, T.nilable(T::Array[String])
end
# Complete SDR output structure
class SDRCampaignOutput < T::Struct
const :prospect, ProspectProfile
const :email, EmailCampaign
const :sender_company, String
const :confidence_score, Float
const :reasoning, T.nilable(String)
end
# Define your AI SDR signature
class AISDRSignature < DSPy::Signature
description "Generate targeted prospect and personalized email campaign"
input do
const :campaign_request, CampaignRequest
end
output do
const :sdr_output, SDRCampaignOutput
end
end
# Create SDR program and evaluator
sdr_program = DSPy::Predict.new(AISDRSignature)
evaluator = DSPy::Evaluate.new(sdr_program, metric: ai_sdr_llm_judge_metric)
# Test campaigns using structured input
test_campaigns = [
CampaignRequest.new(
target_role: "VP of Engineering",
target_company: "TechCorp",
target_industry: "Software",
value_proposition: "Reduce deployment time by 40% with our DevOps platform",
seniority_level: "Executive",
department: "Engineering",
industry_focus: "Enterprise Software",
case_studies: ["TechStartup saved 60% deployment time", "Enterprise Corp reduced incidents by 80%"]
),
CampaignRequest.new(
target_role: "Head of Sales",
target_company: "StartupCo",
target_industry: "FinTech",
value_proposition: "Increase lead conversion by 25% with AI-powered qualification",
seniority_level: "Director",
department: "Sales",
industry_focus: "Financial Technology",
case_studies: ["FinanceFlow increased conversions 35%", "PaymentCorp doubled qualified leads"]
)
]
# Run comprehensive evaluation
result = evaluator.evaluate(test_campaigns, display_progress: true)
puts "🤖 LLM Judge Evaluation Results:"
puts "Overall Quality Score: #{(result.pass_rate * 100).round(1)}%"
puts "Campaigns Ready to Send: #{result.passed_examples}/#{result.total_examples}"
Analyzing Judge Feedback
The real power comes from the detailed reasoning the LLM judge provides:
# Analyze detailed feedback for campaign improvement
result.results.each_with_index do |campaign_result, i|
campaign = test_campaigns[i]
puts "\n📧 Campaign #{i+1}: #{campaign.target_role} at #{campaign.target_company}"
puts "Status: #{campaign_result.passed? ? '✅ APPROVED' : '❌ NEEDS REVISION'}"
# Access judgment data from metrics hash
if campaign_result.metrics.is_a?(Hash) && campaign_result.metrics[:judgment]
judgment = campaign_result.metrics[:judgment]
puts "Recommendation: #{judgment[:recommendation].serialize}"
puts "\nDetailed Analysis:"
puts "• Prospect Fit (#{judgment[:prospect_evaluation].score}): #{judgment[:prospect_evaluation].reasoning}"
puts "• Personalization (#{judgment[:personalization_evaluation].score}): #{judgment[:personalization_evaluation].reasoning}"
puts "• Value Proposition (#{judgment[:value_proposition_evaluation].score}): #{judgment[:value_proposition_evaluation].reasoning}"
puts "• Professional Tone (#{judgment[:professionalism_evaluation].score}): #{judgment[:professionalism_evaluation].reasoning}"
puts "• Compliance (#{judgment[:compliance_evaluation].score}): #{judgment[:compliance_evaluation].reasoning}"
end
end
Advanced: Multi-Judge Consensus
For critical campaigns, you can use multiple judges for consensus:
# Define specialized judges for different aspects
class ProspectRelevanceJudge < DSPy::Signature
description "Evaluate prospect-to-target fit like a sales operations manager"
# ... specialized inputs/outputs
end
class PersonalizationJudge < DSPy::Signature
description "Evaluate email personalization like a copywriting expert"
# ... specialized inputs/outputs
end
# Multi-judge consensus metric
multi_judge_consensus = ->(example, prediction) do
judges = [
DSPy::ChainOfThought.new(ProspectRelevanceJudge),
DSPy::ChainOfThought.new(PersonalizationJudge),
DSPy::ChainOfThought.new(AISDRJudge)
]
scores = judges.map { |judge| judge.call(prepare_inputs(example, prediction)).score }
# Require consensus (majority agreement)
passing_scores = scores.count { |s| s >= 0.7 }
consensus_threshold = judges.length / 2.0
passing_scores > consensus_threshold ? scores.sum / scores.length : 0.0
end
Integration with Production Systems
Here’s how to integrate LLM judges into your SDR workflow using Sidekiq for background processing:
require 'sidekiq'
# Active Record model for campaign requests
class CampaignRequest < ApplicationRecord
validates :target_role, :target_company, :target_industry, :value_proposition, presence: true
enum status: { pending: 0, processing: 1, completed: 2, failed: 3 }
# Convert AR model to DSPy struct for type safety
def to_dspy_struct
CampaignRequestStruct.new(
target_role: target_role,
target_company: target_company,
target_industry: target_industry,
value_proposition: value_proposition,
seniority_level: seniority_level,
department: department,
industry_focus: industry_focus,
case_studies: case_studies&.split(',') # Assuming comma-separated storage
)
end
end
# Rename the T::Struct to avoid naming collision
class CampaignRequestStruct < T::Struct
const :target_role, String
const :target_company, String
const :target_industry, String
const :value_proposition, String
const :seniority_level, T.nilable(String)
const :department, T.nilable(String)
const :industry_focus, T.nilable(String)
const :case_studies, T.nilable(T::Array[String])
end
class SDRCampaignProcessor
include Sidekiq::Worker
sidekiq_options queue: 'sdr_evaluation', retry: 3
def perform(campaign_request_id)
# Load from database
ar_request = CampaignRequest.find(campaign_request_id)
ar_request.update!(status: :processing)
# Convert to type-safe struct for DSPy
campaign_request = ar_request.to_dspy_struct
# Generate campaign and judge quality in async context
# DSPy's LM#chat uses Sync blocks internally for non-blocking I/O
result = Async do |task|
# Generate campaign (non-blocking)
sdr_generator = DSPy::Predict.new(AISDRSignature)
campaign = sdr_generator.call(campaign_request: campaign_request)
# Judge quality (non-blocking)
judgment = judge.call(
target_criteria: build_target_criteria(campaign_request),
prospect_profile: campaign.sdr_output.prospect,
email_campaign: campaign.sdr_output.email,
sender_context: build_sender_context(campaign_request, campaign)
)
{ campaign: campaign, judgment: judgment }
end.wait # Wait for completion before worker finishes
campaign = result[:campaign]
judgment = result[:judgment]
# Route based on judgment
case judgment.send_recommendation
when SendRecommendation::Send
ar_request.update!(status: :completed)
EmailSender.perform_async(campaign.sdr_output.to_h)
log_approved_campaign(campaign, judgment)
when SendRecommendation::Revise
HumanReviewWorker.perform_async(campaign.sdr_output.to_h, judgment.to_h)
when SendRecommendation::Reject
ar_request.update!(status: :failed)
log_rejected_campaign(campaign, judgment)
end
rescue => e
ar_request.update!(status: :failed, error_message: e.message)
raise # Let Sidekiq handle retry logic
end
private
def judge
@judge ||= begin
judge_lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
judge = DSPy::ChainOfThought.new(AISDRJudge)
judge.configure { |c| c.lm = judge_lm }
judge
end
end
def build_target_criteria(campaign_request)
TargetCriteria.new(
role: campaign_request.target_role,
company: campaign_request.target_company,
industry: campaign_request.target_industry,
seniority_level: campaign_request.seniority_level,
department: campaign_request.department
)
end
def build_sender_context(campaign_request, campaign)
SenderContext.new(
company: campaign.sdr_output.sender_company,
value_proposition: campaign_request.value_proposition,
industry_focus: campaign_request.industry_focus,
case_studies: campaign_request.case_studies
)
end
end
# Separate workers for different actions
class EmailSender
include Sidekiq::Worker
sidekiq_options queue: 'email_sending'
def perform(email_data)
# Send via SendGrid, Mailgun, etc.
email_service = EmailService.new
email_service.send_campaign(email_data)
end
end
class HumanReviewWorker
include Sidekiq::Worker
sidekiq_options queue: 'human_review'
def perform(campaign_data, judgment_data)
# Queue for human review with AI feedback
ReviewDashboard.add_campaign_for_review(campaign_data, judgment_data)
end
end
# Usage: Process campaign requests asynchronously
def process_campaign_batch(campaign_request_ids)
campaign_request_ids.each do |request_id|
SDRCampaignProcessor.perform_async(request_id)
end
end
# Example: Create and process campaign requests
campaign_requests = [
CampaignRequest.create!(
target_role: "VP Engineering",
target_company: "TechCorp",
target_industry: "Software",
value_proposition: "Reduce deployment time by 40%",
status: :pending
),
CampaignRequest.create!(
target_role: "Head of Sales",
target_company: "StartupCo",
target_industry: "FinTech",
value_proposition: "Increase lead conversion by 25%",
status: :pending
)
]
# Queue for background processing
process_campaign_batch(campaign_requests.map(&:id))
Concurrent Judge Evaluation
You can even run multiple judges in parallel:
def perform(campaign_id)
# ... setup code
Async do |task|
# Generate campaign first
campaign = sdr_generator.call(campaign_request: campaign_request)
# Run multiple judges concurrently
relevance_task = task.async { relevance_judge.call(campaign_inputs) }
compliance_task = task.async { compliance_judge.call(campaign_inputs) }
personalization_task = task.async { personalization_judge.call(campaign_inputs) }
# Wait for all judgments
relevance = relevance_task.wait
compliance = compliance_task.wait
personalization = personalization_task.wait
# Combine results
final_decision = combine_judgments(relevance, compliance, personalization)
process_final_decision(ar_request, campaign, final_decision)
end.wait
end
This approach significantly improves throughput - instead of 6+ seconds of blocked time, you get concurrent evaluation with much better worker utilization.
Best Practices
1. Judge Calibration
Regularly validate your judges against human evaluations:
# Compare LLM judge to human ratings
human_ratings = load_human_evaluations()
llm_ratings = campaigns.map { |c| judge.call(c).quality_score }
correlation = calculate_correlation(human_ratings, llm_ratings)
puts "Judge-Human correlation: #{correlation}" # Aim for > 0.8
2. Prompt Engineering for Judges
Craft judge descriptions that reflect your quality standards:
class CalibratedSDRJudge < DSPy::Signature
description <<~DESC
You are an experienced B2B sales manager evaluating AI-generated outreach campaigns.
Your standards:
- Personalization should feel authentic, not templated
- Value propositions must be specific and quantifiable
- Professional tone builds trust without being overly formal
- Compliance includes CAN-SPAM, GDPR considerations
Be critical but fair. A score of 0.7+ means ready to send.
DESC
# ... rest of signature
end
3. Continuous Improvement
Use judge feedback to improve your SDR system:
# Analyze common failure patterns
def analyze_judge_feedback(evaluations)
feedback_patterns = evaluations.group_by { |e| e.judge_feedback[:recommendation] }
puts "Common Issues:"
feedback_patterns["REJECT"].each do |rejection|
puts "- #{rejection.judge_feedback[:personalization_reasoning]}"
end
end
Key Takeaways
LLM-as-a-Judge offers an alternative to rigid rule-based evaluation:
- Natural language reasoning can assess subjective qualities like tone
- Detailed feedback provides specific suggestions for improvement
- Consistent evaluation applies the same criteria across all campaigns
- Contextual assessment adapts to different industries and communication styles
This approach requires calibration and ongoing monitoring. When implemented carefully, it can help improve campaign quality and reduce manual review overhead.
Beyond Manual Judge Configuration: Optimization
While this article shows how to manually craft judge signatures and configure evaluation criteria, you don’t have to write these prompts by hand. DSPy.rb’s optimization framework can automatically improve both your SDR generator AND your judge prompts.
Instead of manually tuning the AISDRJudge
signature description and examples, you can:
# Let DSPy optimize your judge prompts automatically
judge_optimizer = DSPy::Teleprompt::MIPROv2.new(
metric: human_validation_metric # Use human ratings as ground truth
)
optimized_judge = judge_optimizer.compile(
DSPy::ChainOfThought.new(AISDRJudge),
trainset: human_rated_campaigns
)
# The optimized judge often performs better than hand-crafted prompts
puts "Optimized judge accuracy: #{optimized_judge.evaluation_score}"
This is especially powerful for complex evaluation tasks where the optimal prompt isn’t obvious. The optimization process discovers better ways to instruct the judge, often finding prompt improvements humans miss.
For high-stakes evaluation like compliance or legal review, consider optimizing your judges against human expert ratings to ensure they align with professional standards.
Ready to implement LLM judges in your AI SDR pipeline? Start with the examples above, then explore prompt optimization to let the machine improve your evaluation prompts automatically.
Want to dive deeper into DSPy.rb’s evaluation capabilities? Check out our comprehensive evaluation guide and custom metrics documentation.