Benchmarking Raw Prompts
When migrating from monolithic prompts to modular DSPy implementations, it’s crucial to measure and compare their performance. DSPy.rb provides the raw_chat
method specifically for this purpose, allowing you to run existing prompts through the same observability system as your DSPy modules.
Why Benchmark Raw Prompts?
- Fair Comparison: Compare apples-to-apples between monolithic and modular approaches
- Migration Path: Gradually migrate existing prompts while measuring impact
- Cost Analysis: Accurate token usage comparison for budget planning
- Performance Metrics: Measure latency, token efficiency, and quality
Using raw_chat
The raw_chat
method supports two formats: array format and DSL format.
Array Format
lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
# Run a raw prompt
result = lm.raw_chat([
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' }
])
puts result # => "The capital of France is Paris."
DSL Format
result = lm.raw_chat do |m|
m.system "You are a changelog generator. Format output as markdown."
m.user "Generate a changelog for: feat: Add user auth, fix: Memory leak"
end
puts result # => "# Changelog\n\n## Features\n- Add user authentication..."
Capturing Observability Data
Both raw_chat
and regular DSPy modules emit the same log events with span tracking, making comparison straightforward:
# Capture events for analysis by processing logs
require 'tempfile'
log_file = Tempfile.new('dspy_benchmark')
DSPy.configure do |config|
config.logger = Dry.Logger(:dspy, formatter: :json) do |logger|
logger.add_backend(stream: log_file)
end
end
# Run monolithic prompt
monolithic_result = lm.raw_chat do |m|
m.system MONOLITHIC_CHANGELOG_PROMPT
m.user commit_data
end
# Extract token usage from logs
log_file.rewind
events = log_file.readlines.map { |line| JSON.parse(line) }
monolithic_tokens = events
.select { |e| e["event"] == 'llm.generate' }
.last
# Clear logs for next test
log_file.truncate(0)
log_file.rewind
# Run modular DSPy version
changelog_generator = DSPy::ChainOfThought.new(ChangelogSignature)
modular_result = changelog_generator.forward(commits: commit_data)
# Extract token usage
modular_tokens = captured_events
.select { |e| e.id == 'dspy.lm.tokens' }
.last
.payload
# Compare results
puts "Monolithic: #{monolithic_tokens[:total_tokens]} tokens"
puts "Modular: #{modular_tokens[:total_tokens]} tokens"
puts "Savings: #{((1 - modular_tokens[:total_tokens].to_f / monolithic_tokens[:total_tokens]) * 100).round(2)}%"
Complete Benchmarking Example
Here’s a complete example comparing a monolithic changelog generator with a modular DSPy implementation:
require 'dspy'
# Monolithic prompt (from legacy system)
MONOLITHIC_PROMPT = <<~PROMPT
You are an expert changelog generator. Given a list of git commits, you must:
1. Parse each commit message to understand the change type and description
2. Group commits by type (feat, fix, chore, docs, etc.)
3. Generate clear, user-friendly descriptions for each change
4. Format the output as a well-structured markdown changelog
5. Highlight any breaking changes prominently
6. Order sections by importance: Breaking Changes, Features, Fixes, Others
Be concise but informative. Focus on what users need to know.
PROMPT
# Modular DSPy signature
class ChangelogSignature < DSPy::Signature
input do
const :commits, T::Array[String], description: "List of git commit messages"
end
output do
const :changelog, String, description: "Formatted markdown changelog"
const :breaking_changes, T::Array[String], description: "List of breaking changes"
end
end
# Benchmark function
def benchmark_approaches(commits_data)
lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
results = {}
# Benchmark monolithic approach
start_time = Time.now
# Reset log file
log_file.truncate(0)
log_file.rewind
monolithic_result = lm.raw_chat do |m|
m.system MONOLITHIC_PROMPT
m.user commits_data.join("\n")
end
monolithic_time = Time.now - start_time
# Extract tokens from logs
log_file.rewind
events = log_file.readlines.map { |line| JSON.parse(line) }
monolithic_tokens = events.find { |e| e["event"] == 'llm.generate' }
results[:monolithic] = {
time: monolithic_time,
tokens: monolithic_tokens,
result: monolithic_result
}
# Reset
events.clear
# Benchmark modular approach
start_time = Time.now
generator = DSPy::ChainOfThought.new(ChangelogSignature)
modular_result = generator.forward(commits: commits_data)
modular_time = Time.now - start_time
modular_tokens = events.find { |e| e.id == 'dspy.lm.tokens' }&.payload
results[:modular] = {
time: modular_time,
tokens: modular_tokens,
result: modular_result.changelog
}
results
end
# Run benchmark
commits = [
"feat: Add user authentication system",
"fix: Resolve memory leak in worker process",
"feat!: Change API response format",
"docs: Update installation guide",
"chore: Upgrade dependencies"
]
results = benchmark_approaches(commits)
# Display results
puts "=== Benchmark Results ==="
puts "\nMonolithic Approach:"
puts " Time: #{results[:monolithic][:time].round(3)}s"
puts " Tokens: #{results[:monolithic][:tokens][:total_tokens]}"
puts " Cost: $#{(results[:monolithic][:tokens][:total_tokens] * 0.00015 / 1000).round(4)}"
puts "\nModular Approach:"
puts " Time: #{results[:modular][:time].round(3)}s"
puts " Tokens: #{results[:modular][:tokens][:total_tokens]}"
puts " Cost: $#{(results[:modular][:tokens][:total_tokens] * 0.00015 / 1000).round(4)}"
# Calculate improvements
token_reduction = ((1 - results[:modular][:tokens][:total_tokens].to_f /
results[:monolithic][:tokens][:total_tokens]) * 100).round(2)
puts "\nImprovements:"
puts " Token reduction: #{token_reduction}%"
puts " Additional benefits: Type safety, testability, composability"
Advanced Benchmarking with Multiple Providers
Compare performance across different LLM providers:
def benchmark_providers(prompt_messages)
providers = [
{ id: 'openai/gpt-4o-mini', key: ENV['OPENAI_API_KEY'] },
{ id: 'anthropic/claude-3-5-sonnet-20241022', key: ENV['ANTHROPIC_API_KEY'] }
]
results = {}
providers.each do |provider|
lm = DSPy::LM.new(provider[:id], api_key: provider[:key])
# Reset log file for each provider
log_file.truncate(0)
log_file.rewind
start_time = Time.now
result = lm.raw_chat(prompt_messages)
elapsed = Time.now - start_time
# Extract token usage from logs
log_file.rewind
events = log_file.readlines.map { |line| JSON.parse(line) }
token_event = events.find { |e| e["event"] == 'llm.generate' }
results[provider[:id]] = {
response: result,
time: elapsed,
tokens: token_event
}
end
results
end
Integration with Observability Tools
The raw_chat
method emits standard DSPy log events with span tracking, making it compatible with all observability integrations:
# Configure observability
DSPy.configure do |config|
config.logger = Dry.Logger(:dspy, formatter: :json) do |logger|
logger.add_backend(stream: "/var/log/dspy/benchmarks.json")
end
end
# Both raw and modular prompts will be logged
lm.raw_chat([{ role: 'user', content: 'Hello' }]) # Logged as llm.generate
predictor.forward(input: 'Hello') # Logged as dspy.predict
Best Practices
- Use Consistent Test Data: Ensure both approaches receive identical inputs
- Multiple Runs: Average results across multiple runs to account for variance
- Consider Quality: Token count isn’t everything - evaluate output quality too
- Track Over Time: Monitor performance as you migrate from monolithic to modular
- Use with CI/CD: Integrate benchmarks into your deployment pipeline
Migration Strategy
# Phase 1: Benchmark existing prompts
baseline = benchmark_raw_prompt(LEGACY_PROMPT, test_data)
# Phase 2: Create modular version
class ModularVersion < DSPy::Module
# Implementation
end
# Phase 3: Compare and validate
comparison = benchmark_modular(ModularVersion.new, test_data)
# Phase 4: Deploy if metrics improve
if comparison[:tokens] < baseline[:tokens] * 0.9 # 10% improvement
deploy_modular_version
else
optimize_further
end
Conclusion
The raw_chat
method provides a crucial bridge for teams migrating from monolithic prompts to modular DSPy implementations. By enabling direct performance comparisons with full observability support, it helps make data-driven decisions about when and how to modularize your prompts.
Remember: while token efficiency is important, the real benefits of DSPy’s modular approach include improved maintainability, testability, and the ability to optimize prompts programmatically.