Evaluator Loops in Ruby: Ship Sales Pitches with Confidence
A quick walkthrough of how to turn LLM calls into composable building blocks in Ruby. With evaluator loops, you get cheap iterations, clear critiques, and real observability into each step. Great for shipping better sales pitches without guessing what the model is doing (or overspending on tokens).
Vicente Reig
Fractional Engineering Lead • • 5 min read
Outbound copy rarely ships on the first LLM pass. You write a prompt, get something decent, tweak it, try again… and burn tokens guessing what “good enough” means. DSPy.rb turns this into a structured loop—propose, critique, refine—with a clear stopping condition.
Why Evaluator Loops?
The evaluator-optimizer pattern is a two-model handshake: a generator drafts content, an evaluator grades it against criteria and prescribes fixes, and the loop repeats until the rubric is met or the budget runs out. Anthropic recommends this pattern when “LLM responses can be demonstrably improved when a human articulates their feedback.”1
The key insight: use a cheap, fast model for drafting and a smarter model for critique. You get quality feedback without paying premium prices for every token.
flowchart LR
In((Requirements\n+ Persona + Offer))
Gen["LLM Call 1\nGenerator (Haiku)\nDSPy::Predict"]
Eval["LLM Call 2\nEvaluator (Sonnet CoT)\nDSPy::ChainOfThought"]
Budget["Token Guardrail\n10k-cap Tracker"]
Out((Approved Post))
In --> Gen --> Eval --> Out
Eval -.feedback/recs.-> Gen
Eval --> Budget --> Out
style In fill:#ffe4e1,stroke:#d4a5a5,stroke-width:2px
style Out fill:#ffe4e1,stroke:#d4a5a5,stroke-width:2px
style Gen fill:#e8f5e9,stroke:#81c784,stroke-width:2px
style Eval fill:#e8f5e9,stroke:#81c784,stroke-width:2px
style Budget fill:#e8f5e9,stroke:#81c784,stroke-width:2px
Signatures as Functions
DSPy Signatures turn LLM calls into typed, callable functions. Our running example defines two signatures—one for generating drafts, one for evaluating them:
class GenerateLinkedInArticle < DSPy::Signature
description "Draft a concise sales pitch that embraces a persona's preferences."
input do
const :topic_seed, TopicSeed
const :vibe_toggles, VibeToggles
const :structure_template, StructureTemplate
const :recommendations, T::Array[Recommendation], default: []
end
output do
const :post, String
const :hooks, T::Array[String]
end
end
class EvaluateLinkedInArticle < DSPy::Signature
description <<~DESC.strip
You are a SKEPTICAL editor who rarely approves drafts on the first attempt.
Your role is to find genuine flaws and push for excellence. Default to
'needs_revision' unless the post is truly exceptional.
DESC
input do
const :post, String
const :topic_seed, TopicSeed
const :vibe_toggles, VibeToggles
const :recommendations, T::Array[Recommendation]
const :hooks, T::Array[String]
const :attempt, Integer
end
output do
const :decision, EvaluationDecision # Approved or NeedsRevision
const :recommendations, T::Array[Recommendation]
const :self_score, Float
end
end
Notice the description on the evaluator signature. This is where you set the evaluator’s mindset. More on that below.
Loop Mechanics: draft → critique within a guardrail
SalesPitchWriterLoop pairs a cheap Haiku generator with a smarter Sonnet evaluator (using Chain-of-Thought for reasoning). The loop continues until either the evaluator approves or the token budget runs out—unlike DSPy::ReAct which caps iterations.
class SalesPitchWriterLoop < DSPy::Module
subscribe 'lm.tokens', :count_tokens # Track token usage
def forward(**input_values)
tracker = TokenBudgetTracker.new(limit: @token_budget_limit)
recommendations = []
while tracker.remaining.positive?
# Cheap model drafts
draft = @generator.call(**input_values.merge(recommendations: recommendations))
# Smart model critiques
eval = @evaluator.call(
post: draft.post,
hooks: draft.hooks,
**input_values
)
# Feed recommendations back into next iteration
recommendations = eval.recommendations || []
break if eval.decision == EvaluationDecision::Approved
end
end
end
The evaluator asks questions like: “Did we quantify the pain cost?” “Is the CTA a single action?” Then returns actionable fixes: “Add a percentage proof metric,” “Retune tone to consultative.”
Tuning the Evaluator Mindset
Here’s the catch: LLM evaluators tend to be sycophantic. They’ll approve mediocre content because they default to “yes.” If your loop finishes in one iteration every time, your evaluator is probably too easy to please.
The fix is simple—tune the signature description to set expectations:
class EvaluateLinkedInArticle < DSPy::Signature
# A neutral description leads to lenient evaluation:
# description "Score a sales pitch and provide feedback."
# A skeptical framing pushes for excellence:
description <<~DESC.strip
You are a SKEPTICAL editor who rarely approves drafts on the first attempt.
Your role is to find genuine flaws and push for excellence. Default to
'needs_revision' unless the post is truly exceptional.
DESC
end
This adversarial framing makes the evaluator earn its approval. The loop runs more iterations, but each draft gets meaningfully better.
Other levers you can pull:
- Self-score threshold: The example requires
self_score >= 0.9even if decision is “approved” - Forced criticisms: Require the output to include at least N specific issues
- Rubric criteria: Add explicit pass/fail fields for each quality dimension
O11y at a Glance
DSPy.rb ships observability out of the box: every lm.tokens event flows into Langfuse, so you don’t need X‑ray vision to see whether budget burned on the draft or the critique. Peek at the latest trace (Nov 21, 2025 — Haiku draft, Sonnet CoT evaluator):
└─ SalesPitchWriterLoop.forward (ed89899bac229240)
└─ SalesPitchWriterLoop.forward (ee155baa7ea3c707)
└─ SalesPitchWriterLoop.forward (25d6c7cb5ce67556)
├─ DSPy::Predict.forward (886c35a6382591b6)
│ └─ llm.generate (a19c643a7a7ebad2)
└─ DSPy::ChainOfThought.forward (a4ae3f51d105e27e)
├─ DSPy::Predict.forward (2c09e511ef4112e3)
│ └─ llm.generate (1693f7a4893de528)
├─ chain_of_thought.reasoning_complete (2f6cf25f6e671e4e)
└─ chain_of_thought.reasoning_metrics (7bb07c8d57d3041b)
Outcome: 1 attempt; 5,926 / 10,000 tokens; Langfuse cost ≈ $0.0258. Generator and evaluator hops are labeled, so you can confirm the cheap model carried drafting while the expensive model handled critique. Prefer shell? lf-cli points at the same Langfuse project and gives you the tree from your coding terminal.
Run It
bundle exec ruby examples/evaluator_loop.rb
Requires ANTHROPIC_API_KEY in your .env. You can tune:
DSPY_SLOP_TOKEN_BUDGET— how many tokens before the loop gives upDSPY_SLOP_GENERATOR_MODEL— the cheap drafting modelDSPY_SLOP_EVALUATOR_MODEL— the smart critique model
Takeaways
Evaluator loops beat single-shot prompts when quality matters. The pattern is simple:
- Signatures as functions — typed inputs and outputs, no prompt wrangling
- Cheap draft, smart critique — Haiku generates, Sonnet evaluates
- Budget as guardrail — token cap instead of iteration cap
- Skeptical by design — tune the evaluator description to push for excellence
The signature description is your main lever for controlling evaluator behavior. Don’t settle for sycophantic feedback—make the loop earn its approval.
-
Anthropic, “Building effective agents,” Workflow: Evaluator-optimizer, Dec 19 2024. https://www.anthropic.com/engineering/building-effective-agents#workflow-evaluator-optimizer ↩