Article

LLM-as-a-Judge: Evaluating AI SDR Quality Beyond Simple Rules

How to use LLM judges to evaluate AI SDR campaigns with human-like reasoning, going beyond rule-based metrics to assess prospect relevance, personalization, and professional tone.

V

Vicente Reig

Fractional Engineering Lead •

You’ve built an AI system that finds prospects and writes sales emails. Now you need to evaluate the quality before sending. Simple keyword matching and rule-based metrics miss important nuances like tone, authenticity, and contextual relevance.

LLM-as-a-Judge uses one language model to evaluate another’s output before sending. Instead of hardcoded rules, you get contextual evaluation that considers nuance. DSPy.rb makes this approach straightforward to implement.

Why Current Evaluation Approaches Fall Short

Most AI SDR evaluation falls into two categories:

Rule-based evaluation relies on keyword matching and pattern detection:

# Rule-based approach - limited and brittle
def evaluate_personalization(email, prospect)
  score = 0.0
  score += 0.2 if email.include?(prospect.first_name)
  score += 0.3 if email.include?(prospect.company)
  score += 0.5 if email.match?(/recent|news|announcement/i)
  score
end

This approach misses crucial nuances:

  • Context sensitivity: “John” might be personalized or just a coincidence
  • Authenticity: Does the personalization feel natural or forced?
  • Professional tone: Is the language appropriate for the prospect’s seniority level?
  • Cultural awareness: Does the approach match the prospect’s business culture?

Engagement-based evaluation measures post-send metrics (open rates, reply rates, conversions). Many AI SDR platforms, especially early-stage startups, rely heavily on these metrics:

# Engagement-based approach - reactive, not predictive
def evaluate_campaign_performance(campaign)
  {
    open_rate: campaign.opens.count / campaign.sends.count,
    reply_rate: campaign.replies.count / campaign.sends.count,
    positive_replies: campaign.positive_replies.count / campaign.replies.count
  }
end

The problem with engagement-only evaluation:

  • Delayed feedback: You learn about quality issues after sending
  • Sender reputation risk: Poor campaigns can damage deliverability
  • Limited insight: Metrics don’t tell you why something didn’t work
  • Volume dependency: Need significant send volume for statistical significance

The LLM Judge Approach

LLM judges can evaluate context and nuance that simple rules miss:

# Define structured types for better type safety
class TargetCriteria < T::Struct
  const :role, String
  const :company, String 
  const :industry, String
  const :seniority_level, T.nilable(String)
  const :department, T.nilable(String)
end

class ProspectProfile < T::Struct
  const :first_name, String
  const :last_name, String
  const :title, String
  const :company, String
  const :industry, String
  const :linkedin_url, T.nilable(String)
  const :company_size, T.nilable(String)
end

class EmailCampaign < T::Struct
  const :subject, String
  const :body, String
  const :sender_name, String
  const :sender_email, String
  const :signature, T.nilable(String)
end

class SenderContext < T::Struct
  const :company, String
  const :value_proposition, String
  const :industry_focus, T.nilable(String)
  const :case_studies, T.nilable(T::Array[String])
end

# Judge evaluation result structures
class ProspectRelevanceEvaluation < T::Struct
  const :score, Float
  const :reasoning, String
end

class PersonalizationEvaluation < T::Struct
  const :score, Float
  const :reasoning, String
end

class ValuePropositionEvaluation < T::Struct
  const :score, Float
  const :reasoning, String
end

class ProfessionalismEvaluation < T::Struct
  const :score, Float
  const :reasoning, String
end

class ComplianceEvaluation < T::Struct
  const :score, Float
  const :reasoning, String
end

# Send recommendation enum
class SendRecommendation < T::Enum
  enums do
    Send = new('SEND')
    Revise = new('REVISE') 
    Reject = new('REJECT')
  end
end

class AISDRJudge < DSPy::Signature
  description "Evaluate AI-generated sales outreach like an experienced sales manager"
  
  input do
    const :target_criteria, TargetCriteria, description: "Target prospect requirements"
    const :prospect_profile, ProspectProfile, description: "Found prospect details" 
    const :email_campaign, EmailCampaign, description: "Generated email subject and body"
    const :sender_context, SenderContext, description: "Sender company and value proposition"
  end
  
  output do
    const :prospect_relevance, ProspectRelevanceEvaluation, description: "Prospect-target fit assessment"
    const :personalization, PersonalizationEvaluation, description: "Email personalization evaluation"
    const :value_proposition, ValuePropositionEvaluation, description: "Value proposition assessment" 
    const :professionalism, ProfessionalismEvaluation, description: "Professional tone evaluation"
    const :compliance, ComplianceEvaluation, description: "Legal and ethical compliance check"
    const :overall_quality_score, Float, description: "Overall campaign quality (0-1)"
    const :send_recommendation, SendRecommendation, description: "Final recommendation: SEND, REVISE, or REJECT"
  end
end

Building the LLM Judge Metric

The structured approach using TStructs and TEnums provides several advantages:

# ✅ Type-safe access with structured reasoning
judgment.prospect_relevance.score      # Float (0.0-1.0)
judgment.prospect_relevance.reasoning  # String with detailed explanation
judgment.send_recommendation          # SendRecommendation enum (SEND/REVISE/REJECT)

# ✅ Prevents common errors
case judgment.send_recommendation
when SendRecommendation::Send    # Compile-time type checking
when SendRecommendation::Revise  # IDE autocomplete support
when SendRecommendation::Reject  # No typos like "REJCT"
end

# ❌ Old string-based approach (error-prone)
if judgment.send_recommendation == "SEND"  # Typo-prone
  # Could be "Send", "send", "SEND", etc.
end

Here’s how to implement an LLM judge as a DSPy.rb custom metric:

# Configure the LLM judge outside the metric for better performance
judge_lm = DSPy::LM.new('openai/gpt-4o-mini', 
                        api_key: ENV['OPENAI_API_KEY']) 

judge = DSPy::ChainOfThought.new(AISDRJudge)
judge.configure do |c|
  c.lm = judge_lm
end

judgements = []

# LLM-as-a-Judge custom metric for AI SDR evaluation
ai_sdr_llm_judge_metric = ->(example, prediction) do
  return 0.0 unless prediction
  
  sdr_output = prediction.sdr_output
  campaign_request = example
  
  # Create structured inputs for the LLM judge
  target_criteria = TargetCriteria.new(
    role: campaign_request.target_role,
    company: campaign_request.target_company,
    industry: campaign_request.target_industry,
    seniority_level: campaign_request.seniority_level,
    department: campaign_request.department
  )
  
  # Use the structured prospect and email from prediction
  prospect_profile = sdr_output.prospect
  email_campaign = sdr_output.email
  
  sender_context = SenderContext.new(
    company: sdr_output.sender_company,
    value_proposition: campaign_request.value_proposition,
    industry_focus: campaign_request.industry_focus,
    case_studies: campaign_request.case_studies
  )
  
  begin
    # Get comprehensive judgment from LLM
    judgment = judge.call(
      target_criteria: target_criteria,
      prospect_profile: prospect_profile, 
      email_campaign: email_campaign,
      sender_context: sender_context
    )
    
    weights = {
      prospect_relevance: 0.25,  # Right person, right company
      personalization: 0.25,     # Authentic, not generic
      value_proposition: 0.20,   # Clear benefit
      professionalism: 0.15,     # Builds trust
      compliance: 0.15           # Legal/ethical standards
    }
    
    weighted_score = (
      judgment.prospect_relevance.score * weights[:prospect_relevance] +
      judgment.personalization.score * weights[:personalization] +
      judgment.value_proposition.score * weights[:value_proposition] +
      judgment.professionalism.score * weights[:professionalism] +
      judgment.compliance.score * weights[:compliance]
    )
    
    judgements << judgement
    
    return weighted_score
  rescue => e
    # Graceful fallback if LLM judge fails
    DSPy.logger.warn("LLM Judge evaluation failed: #{e.message}")
    { score: 0.0, error: e.message }
  end
end

Real-World Example: Evaluating AI SDR Campaigns

Let’s see the LLM judge in action with a complete evaluation workflow:

# Define campaign request structure
class CampaignRequest < T::Struct
  const :target_role, String
  const :target_company, String
  const :target_industry, String
  const :value_proposition, String
  const :seniority_level, T.nilable(String)
  const :department, T.nilable(String)
  const :industry_focus, T.nilable(String)
  const :case_studies, T.nilable(T::Array[String])
end

# Complete SDR output structure
class SDRCampaignOutput < T::Struct
  const :prospect, ProspectProfile
  const :email, EmailCampaign
  const :sender_company, String
  const :confidence_score, Float
  const :reasoning, T.nilable(String)
end

# Define your AI SDR signature
class AISDRSignature < DSPy::Signature
  description "Generate targeted prospect and personalized email campaign"
  
  input do
    const :campaign_request, CampaignRequest
  end
  
  output do
    const :sdr_output, SDRCampaignOutput
  end
end

# Create SDR program and evaluator
sdr_program = DSPy::Predict.new(AISDRSignature)
evaluator = DSPy::Evaluate.new(sdr_program, metric: ai_sdr_llm_judge_metric)

# Test campaigns using structured input
test_campaigns = [
  CampaignRequest.new(
    target_role: "VP of Engineering",
    target_company: "TechCorp",
    target_industry: "Software", 
    value_proposition: "Reduce deployment time by 40% with our DevOps platform",
    seniority_level: "Executive",
    department: "Engineering",
    industry_focus: "Enterprise Software",
    case_studies: ["TechStartup saved 60% deployment time", "Enterprise Corp reduced incidents by 80%"]
  ),
  CampaignRequest.new(
    target_role: "Head of Sales",
    target_company: "StartupCo",
    target_industry: "FinTech",
    value_proposition: "Increase lead conversion by 25% with AI-powered qualification",
    seniority_level: "Director",
    department: "Sales",
    industry_focus: "Financial Technology",
    case_studies: ["FinanceFlow increased conversions 35%", "PaymentCorp doubled qualified leads"]
  )
]

# Run comprehensive evaluation
result = evaluator.evaluate(test_campaigns, display_progress: true)

puts "🤖 LLM Judge Evaluation Results:"
puts "Overall Quality Score: #{(result.pass_rate * 100).round(1)}%"
puts "Campaigns Ready to Send: #{result.passed_examples}/#{result.total_examples}"

Analyzing Judge Feedback

The real power comes from the detailed reasoning the LLM judge provides:

# Analyze detailed feedback for campaign improvement
result.results.each_with_index do |campaign_result, i|
  campaign = test_campaigns[i]
  
  puts "\n📧 Campaign #{i+1}: #{campaign.target_role} at #{campaign.target_company}"
  puts "Status: #{campaign_result.passed? ? '✅ APPROVED' : '❌ NEEDS REVISION'}"
  
  # Access judgment data from metrics hash
  if campaign_result.metrics.is_a?(Hash) && campaign_result.metrics[:judgment]
    judgment = campaign_result.metrics[:judgment]
    
    puts "Recommendation: #{judgment[:recommendation].serialize}"
    puts "\nDetailed Analysis:"
    puts "• Prospect Fit (#{judgment[:prospect_evaluation].score}): #{judgment[:prospect_evaluation].reasoning}"
    puts "• Personalization (#{judgment[:personalization_evaluation].score}): #{judgment[:personalization_evaluation].reasoning}"
    puts "• Value Proposition (#{judgment[:value_proposition_evaluation].score}): #{judgment[:value_proposition_evaluation].reasoning}"
    puts "• Professional Tone (#{judgment[:professionalism_evaluation].score}): #{judgment[:professionalism_evaluation].reasoning}"
    puts "• Compliance (#{judgment[:compliance_evaluation].score}): #{judgment[:compliance_evaluation].reasoning}"
  end
end

Advanced: Multi-Judge Consensus

For critical campaigns, you can use multiple judges for consensus:

# Define specialized judges for different aspects
class ProspectRelevanceJudge < DSPy::Signature
  description "Evaluate prospect-to-target fit like a sales operations manager"
  # ... specialized inputs/outputs
end

class PersonalizationJudge < DSPy::Signature  
  description "Evaluate email personalization like a copywriting expert"
  # ... specialized inputs/outputs
end

# Multi-judge consensus metric
multi_judge_consensus = ->(example, prediction) do
  judges = [
    DSPy::ChainOfThought.new(ProspectRelevanceJudge),
    DSPy::ChainOfThought.new(PersonalizationJudge),
    DSPy::ChainOfThought.new(AISDRJudge)
  ]
  
  scores = judges.map { |judge| judge.call(prepare_inputs(example, prediction)).score }
  
  # Require consensus (majority agreement)
  passing_scores = scores.count { |s| s >= 0.7 }
  consensus_threshold = judges.length / 2.0
  
  passing_scores > consensus_threshold ? scores.sum / scores.length : 0.0
end

Integration with Production Systems

Here’s how to integrate LLM judges into your SDR workflow using Sidekiq for background processing:

require 'sidekiq'

# Active Record model for campaign requests
class CampaignRequest < ApplicationRecord
  validates :target_role, :target_company, :target_industry, :value_proposition, presence: true
  
  enum status: { pending: 0, processing: 1, completed: 2, failed: 3 }
  
  # Convert AR model to DSPy struct for type safety
  def to_dspy_struct
    CampaignRequestStruct.new(
      target_role: target_role,
      target_company: target_company,
      target_industry: target_industry,
      value_proposition: value_proposition,
      seniority_level: seniority_level,
      department: department,
      industry_focus: industry_focus,
      case_studies: case_studies&.split(',') # Assuming comma-separated storage
    )
  end
end

# Rename the T::Struct to avoid naming collision
class CampaignRequestStruct < T::Struct
  const :target_role, String
  const :target_company, String
  const :target_industry, String
  const :value_proposition, String
  const :seniority_level, T.nilable(String)
  const :department, T.nilable(String)
  const :industry_focus, T.nilable(String)
  const :case_studies, T.nilable(T::Array[String])
end

class SDRCampaignProcessor
  include Sidekiq::Worker
  sidekiq_options queue: 'sdr_evaluation', retry: 3
  
  def perform(campaign_request_id)
    # Load from database
    ar_request = CampaignRequest.find(campaign_request_id)
    ar_request.update!(status: :processing)
    
    # Convert to type-safe struct for DSPy
    campaign_request = ar_request.to_dspy_struct
    
    # Generate campaign and judge quality in async context
    # DSPy's LM#chat uses Sync blocks internally for non-blocking I/O
    result = Async do |task|
      # Generate campaign (non-blocking)
      sdr_generator = DSPy::Predict.new(AISDRSignature)
      campaign = sdr_generator.call(campaign_request: campaign_request)
      
      # Judge quality (non-blocking)
      judgment = judge.call(
        target_criteria: build_target_criteria(campaign_request),
        prospect_profile: campaign.sdr_output.prospect,
        email_campaign: campaign.sdr_output.email,
        sender_context: build_sender_context(campaign_request, campaign)
      )
      
      { campaign: campaign, judgment: judgment }
    end.wait  # Wait for completion before worker finishes
    
    campaign = result[:campaign]
    judgment = result[:judgment]
    
    # Route based on judgment
    case judgment.send_recommendation
    when SendRecommendation::Send
      ar_request.update!(status: :completed)
      EmailSender.perform_async(campaign.sdr_output.to_h)
      log_approved_campaign(campaign, judgment)
    when SendRecommendation::Revise
      HumanReviewWorker.perform_async(campaign.sdr_output.to_h, judgment.to_h)
    when SendRecommendation::Reject
      ar_request.update!(status: :failed)
      log_rejected_campaign(campaign, judgment)
    end
    
  rescue => e
    ar_request.update!(status: :failed, error_message: e.message)
    raise  # Let Sidekiq handle retry logic
  end
  
  private
  
  def judge
    @judge ||= begin
      judge_lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
      judge = DSPy::ChainOfThought.new(AISDRJudge)
      judge.configure { |c| c.lm = judge_lm }
      judge
    end
  end
  
  def build_target_criteria(campaign_request)
    TargetCriteria.new(
      role: campaign_request.target_role,
      company: campaign_request.target_company,
      industry: campaign_request.target_industry,
      seniority_level: campaign_request.seniority_level,
      department: campaign_request.department
    )
  end
  
  def build_sender_context(campaign_request, campaign)
    SenderContext.new(
      company: campaign.sdr_output.sender_company,
      value_proposition: campaign_request.value_proposition,
      industry_focus: campaign_request.industry_focus,
      case_studies: campaign_request.case_studies
    )
  end
end

# Separate workers for different actions
class EmailSender
  include Sidekiq::Worker
  sidekiq_options queue: 'email_sending'
  
  def perform(email_data)
    # Send via SendGrid, Mailgun, etc.
    email_service = EmailService.new
    email_service.send_campaign(email_data)
  end
end

class HumanReviewWorker
  include Sidekiq::Worker  
  sidekiq_options queue: 'human_review'
  
  def perform(campaign_data, judgment_data)
    # Queue for human review with AI feedback
    ReviewDashboard.add_campaign_for_review(campaign_data, judgment_data)
  end
end

# Usage: Process campaign requests asynchronously
def process_campaign_batch(campaign_request_ids)
  campaign_request_ids.each do |request_id|
    SDRCampaignProcessor.perform_async(request_id)
  end
end

# Example: Create and process campaign requests
campaign_requests = [
  CampaignRequest.create!(
    target_role: "VP Engineering", 
    target_company: "TechCorp",
    target_industry: "Software",
    value_proposition: "Reduce deployment time by 40%",
    status: :pending
  ),
  CampaignRequest.create!(
    target_role: "Head of Sales",
    target_company: "StartupCo", 
    target_industry: "FinTech",
    value_proposition: "Increase lead conversion by 25%",
    status: :pending
  )
]

# Queue for background processing
process_campaign_batch(campaign_requests.map(&:id))

Concurrent Judge Evaluation

You can even run multiple judges in parallel:

def perform(campaign_id)
  # ... setup code
  
  Async do |task|
    # Generate campaign first
    campaign = sdr_generator.call(campaign_request: campaign_request)
    
    # Run multiple judges concurrently
    relevance_task = task.async { relevance_judge.call(campaign_inputs) }
    compliance_task = task.async { compliance_judge.call(campaign_inputs) }
    personalization_task = task.async { personalization_judge.call(campaign_inputs) }
    
    # Wait for all judgments
    relevance = relevance_task.wait
    compliance = compliance_task.wait  
    personalization = personalization_task.wait
    
    # Combine results
    final_decision = combine_judgments(relevance, compliance, personalization)
    process_final_decision(ar_request, campaign, final_decision)
  end.wait
end

This approach significantly improves throughput - instead of 6+ seconds of blocked time, you get concurrent evaluation with much better worker utilization.

Best Practices

1. Judge Calibration

Regularly validate your judges against human evaluations:

# Compare LLM judge to human ratings
human_ratings = load_human_evaluations()
llm_ratings = campaigns.map { |c| judge.call(c).quality_score }

correlation = calculate_correlation(human_ratings, llm_ratings)
puts "Judge-Human correlation: #{correlation}" # Aim for > 0.8

2. Prompt Engineering for Judges

Craft judge descriptions that reflect your quality standards:

class CalibratedSDRJudge < DSPy::Signature
  description <<~DESC
    You are an experienced B2B sales manager evaluating AI-generated outreach campaigns.
    
    Your standards:
    - Personalization should feel authentic, not templated
    - Value propositions must be specific and quantifiable
    - Professional tone builds trust without being overly formal
    - Compliance includes CAN-SPAM, GDPR considerations
    
    Be critical but fair. A score of 0.7+ means ready to send.
  DESC
  
  # ... rest of signature
end

3. Continuous Improvement

Use judge feedback to improve your SDR system:

# Analyze common failure patterns
def analyze_judge_feedback(evaluations)
  feedback_patterns = evaluations.group_by { |e| e.judge_feedback[:recommendation] }
  
  puts "Common Issues:"
  feedback_patterns["REJECT"].each do |rejection|
    puts "- #{rejection.judge_feedback[:personalization_reasoning]}"
  end
end

Key Takeaways

LLM-as-a-Judge offers an alternative to rigid rule-based evaluation:

  • Natural language reasoning can assess subjective qualities like tone
  • Detailed feedback provides specific suggestions for improvement
  • Consistent evaluation applies the same criteria across all campaigns
  • Contextual assessment adapts to different industries and communication styles

This approach requires calibration and ongoing monitoring. When implemented carefully, it can help improve campaign quality and reduce manual review overhead.

Beyond Manual Judge Configuration: Optimization

While this article shows how to manually craft judge signatures and configure evaluation criteria, you don’t have to write these prompts by hand. DSPy.rb’s optimization framework can automatically improve both your SDR generator AND your judge prompts.

Instead of manually tuning the AISDRJudge signature description and examples, you can:

# Let DSPy optimize your judge prompts automatically
judge_optimizer = DSPy::Teleprompt::MIPROv2.new(
  metric: human_validation_metric  # Use human ratings as ground truth
)

optimized_judge = judge_optimizer.compile(
  DSPy::ChainOfThought.new(AISDRJudge),
  trainset: human_rated_campaigns
)

# The optimized judge often performs better than hand-crafted prompts
puts "Optimized judge accuracy: #{optimized_judge.evaluation_score}"

This is especially powerful for complex evaluation tasks where the optimal prompt isn’t obvious. The optimization process discovers better ways to instruct the judge, often finding prompt improvements humans miss.

For high-stakes evaluation like compliance or legal review, consider optimizing your judges against human expert ratings to ensure they align with professional standards.

Ready to implement LLM judges in your AI SDR pipeline? Start with the examples above, then explore prompt optimization to let the machine improve your evaluation prompts automatically.


Want to dive deeper into DSPy.rb’s evaluation capabilities? Check out our comprehensive evaluation guide and custom metrics documentation.