Does Chain Of Thought Actually Improve Summaries? A Quick Experiment
Does 'think step by step' actually help for simple tasks? Instead of running on assumptions and endlessly debating, run a quick experiment. Here's how.
Vicente Reig
Fractional Engineering Lead • • 4 min read
You’re pairing with a coworker. They claim Chain Of Thought always produces better output. You’re skeptical. For simple tasks like summarization, does think step by step really help? Instead of running on blind assumptions, you two decide to run an experiment.
With DSPy.rb, this takes about 50 lines of code.
Setting Things up
We need three things:
- A summarization task to compare
- An LLM judge to score quality
- An evaluation harness to run them both
Let’s build each piece.
flowchart LR
subgraph Examples
E1["Wikipedia<br/>Articles"]
end
subgraph Predictors["Same Signature, Different Approaches"]
P["DSPy::Predict<br/>(direct)"]
C["DSPy::ChainOfThought<br/>(reasoning first)"]
end
subgraph Judge["LLM Judge (gpt-4.1)"]
J["EvaluateSummary<br/>faithfulness | relevance<br/>coherence | fluency"]
end
subgraph Results
R["Compare<br/>Scores"]
end
E1 --> P --> J
E1 --> C --> J
J --> R
style P fill:#e8f5e9,stroke:#81c784
style C fill:#e3f2fd,stroke:#64b5f6
style J fill:#fff3e0,stroke:#ffb74d
One Task, Two Approaches
A signature defines the task. Nothing fancy: just text in, summary out:
class Summarize < DSPy::Signature
description "Summarize the given text concisely while preserving key concepts"
input do
const :text, String, description: "Text to summarize"
end
output do
const :summary, String, description: "Concise summary preserving key concepts (2-3 sentences)"
end
end
Now we can create two predictors from the same signature:
DSPy.configure do |config|
config.lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
end
prediction = DSPy::Predict.new(Summarize)
puts prediction.summary
cot_prediction = DSPy::ChainOfThought.new(Summarize)
puts cot_prediction.summary
puts cot_prediction.reasoning
Multi-Dimensional Quality Scoring
Here’s where it gets interesting. Instead of manually reviewing summaries, we use another LLM to evaluate them. This G-Eval style approach scores multiple dimensions.
First, we define types to make our inputs explicit:
class EvaluatorMindset < T::Enum
enums do
Critical = new('critical') # Most should score 3-4, not 5
Balanced = new('balanced') # Fair assessment across the range
Generous = new('generous') # Benefit of the doubt
end
end
class GroundedSummary < T::Struct
const :source_text, String
const :summary, String
end
Then the judge signature uses these types:
class EvaluateSummary < DSPy::Signature
description "Evaluate summary quality using G-Eval criteria according to the specified mindset."
input do
const :grounded_summary, GroundedSummary
const :mindset, EvaluatorMindset
end
output do
const :faithfulness, Integer,
description: "Score 1-5: Is the summary factually accurate?"
const :relevance, Integer,
description: "Score 1-5: Does it capture the most important information?"
const :coherence, Integer,
description: "Score 1-5: Is it well-structured with logical flow?"
const :fluency, Integer,
description: "Score 1-5: Is it grammatically correct and readable?"
const :overall_score, Float,
description: "Overall quality score from 1.0 to 5.0"
end
end
The EvaluatorMindset enum controls how critically the judge scores—the LLM receives this as a constrained choice. The GroundedSummary struct keeps the source and summary paired together. We use DSPy::ChainOfThought for the judge so it reasons through its evaluation.
Packaging the Judge for Evaluation
A metric is a lambda that takes an example and prediction, returning evaluation results:
def create_llm_judge_metric(judge_lm, mindset: EvaluatorMindset::Critical)
judge = DSPy::ChainOfThought.new(EvaluateSummary)
judge.configure { |c| c.lm = DSPy::LM.new('openai/gpt-4.1', api_key: ENV['OPENAI_API_KEY']) }
->(example, prediction) do
eval_result = judge.call(
grounded_summary: GroundedSummary.new(
source_text: example.input_values[:text],
summary: prediction.summary
),
mindset: mindset
)
{
passed: eval_result.overall_score >= 3.5,
score: eval_result.overall_score / 5.0, # Normalize to 0-1
faithfulness: eval_result.faithfulness,
relevance: eval_result.relevance,
coherence: eval_result.coherence,
fluency: eval_result.fluency
}
end
end
Running Both Side by Side
DSPy::Evals unifies predictors, examples, and metrics. The API is dead simple:
examples = wikipedia_articles.map do |doc|
DSPy::Example.new(
signature_class: Summarize,
input: { text: doc[:text] },
expected: { summary: "" } # LLM judge evaluates absolute quality
)
end
llm_judge_metric = create_llm_judge_metric(judge_lm)
predict_evaluator = DSPy::Evals.new(predict, metric: llm_judge_metric)
predict_result = predict_evaluator.evaluate(examples)
cot_evaluator = DSPy::Evals.new(cot, metric: llm_judge_metric)
cot_result = cot_evaluator.evaluate(examples)
The Results
We ran both predictors on 5 Wikipedia articles (Photosynthesis, Byzantine Empire, Machine Learning, Great Barrier Reef, French Revolution)
using gpt-4o-mini as the summarizer and gpt-4.1 as the judge.
Predict avg score: 93.0%
ChainOfThought avg score: 96.0%
Improvement: +3.0 percentage points
The per-dimension breakdown tells a richer story:
| Dimension | Predict | CoT | Δ |
|---|---|---|---|
| Faithfulness | 4.4/5 | 4.8/5 | +0.4 |
| Relevance | 4.4/5 | 4.4/5 | +0.0 |
| Coherence | 4.8/5 | 5.0/5 | +0.2 |
| Fluency | 5.0/5 | 5.0/5 | +0.0 |
ChainOfThought’s edge comes from faithfulness and coherence. The reasoning step seems to help the model avoid hallucinations and produce better-structured output. Relevance and fluency? Both approaches nail it.
3 Points Better—Worth It?
For summarization with a capable model like gpt-4o-mini, ChainOfThought provides a modest but real improvement—particularly in factual accuracy. The 3 percentage point gain might matter for production use cases where faithfulness is critical.
But here’s the thing: you don’t have to guess anymore. The experiment took under an hour to build. The pattern works for any comparison:
- Predict vs ChainOfThought
- Different models (Claude vs GPT)
- Different prompt strategies
- Temperature variations
Run It Yourself
The complete example is in the repo:
export OPENAI_API_KEY=your-key
bundle exec ruby examples/summarization_comparison.rb
Tweak DSPY_SUMMARIZER_MODEL and DSPY_JUDGE_MODEL environment variables to experiment with different model combinations.
Takeaways
- Signatures define the task — same signature, different predictors
- Sorbet types model your domain —
T::Enumfor constrained choices,T::Structfor grouped inputs - LLM judges scale evaluation — multi-dimensional scoring without manual review
- DSPy::Evals unifies the workflow — predictors, examples, and metrics in one API
- Data beats blind assumptions — 50 lines of code settles the ChainOfThought question
Next time a coworker claims one approach is “obviously better,” suggest an experiment. With DSPy.rb, you’ll have results before the coffee gets cold.