Raw Chat API for Benchmarking and Migration
Learn how to use DSPy.rb's raw_chat API for benchmarking monolithic prompts and migrating to modular implementations
Vicente Reig
Fractional Engineering Lead •
DSPy.rb 0.12.0 introduces the raw_chat
API for benchmarking existing prompts and migrating to DSPy’s modular approach.
The Problem
Many teams have existing prompts they want to compare against DSPy modules. Without running both through the same observability system, you can’t get accurate comparisons.
The raw_chat
API lets you:
- Run existing prompts through DSPy’s observability system
- Compare token usage between monolithic and modular approaches
- Measure performance across different providers
- Make data-driven migration decisions
API Overview
The raw_chat
method provides a direct interface to language models without DSPy’s structured output features:
# Initialize a language model
lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
# Array format
response = lm.raw_chat([
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' }
])
# DSL format for cleaner syntax
response = lm.raw_chat do |m|
m.system "You are a helpful assistant."
m.user "What is the capital of France?"
end
Key Features
1. Full Instrumentation Support
Unlike bypassing DSPy entirely, raw_chat
emits all standard log events with span tracking:
# These events are emitted for raw_chat:
# - dspy.lm.request (with signature_class: 'RawPrompt')
# - dspy.lm.tokens (with accurate token counts)
#
# NOT emitted:
# - dspy.lm.response.parsed (since there's no JSON parsing)
2. Message Builder DSL
The DSL provides a clean way to construct conversations:
lm.raw_chat do |m|
m.user "My name is Alice"
m.assistant "Nice to meet you, Alice!"
m.user "What's my name?"
end
3. Streaming Support
Stream responses with a block:
lm.raw_chat(messages) do |chunk|
print chunk
end
Real-World Example: Changelog Generation
Here’s how a team might compare their existing changelog generator with a DSPy implementation:
# Legacy monolithic prompt
LEGACY_PROMPT = <<~PROMPT
You are an expert changelog generator. Given git commits:
1. Parse each commit type and description
2. Group by type (feat, fix, chore, etc.)
3. Generate user-friendly descriptions
4. Format as markdown
5. Highlight breaking changes
Be concise but informative.
PROMPT
# Configure logging to capture data
require 'tempfile'
log_file = Tempfile.new('dspy_benchmark')
DSPy.configure do |config|
config.logger = Dry.Logger(:dspy, formatter: :json) do |logger|
logger.add_backend(stream: log_file)
end
end
# Benchmark legacy approach
legacy_result = lm.raw_chat do |m|
m.system LEGACY_PROMPT
m.user commits.join("\n")
end
# Extract legacy tokens
log_file.rewind
events = log_file.readlines.map { |line| JSON.parse(line) }
legacy_tokens = events.find { |e| e["event"] == 'llm.generate' }
# Clear log and benchmark modular approach
log_file.truncate(0)
log_file.rewind
generator = DSPy::ChainOfThought.new(ChangelogSignature)
modular_result = generator.forward(commits: commits)
# Extract modular tokens
log_file.rewind
events = log_file.readlines.map { |line| JSON.parse(line) }
modular_tokens = events.find { |e| e["event"] == 'llm.generate' }
# Compare results
puts "Legacy: #{legacy_tokens["gen_ai.usage.total_tokens"]} tokens"
puts "Modular: #{modular_tokens["gen_ai.usage.total_tokens"]} tokens"
puts "Reduction: #{((1 - modular_tokens["gen_ai.usage.total_tokens"].to_f / legacy_tokens["gen_ai.usage.total_tokens"]) * 100).round(2)}%"
Integration with Observability
raw_chat
uses the same observability system as regular DSPy calls:
# Configure observability
DSPy.configure do |config|
config.logger = Dry.Logger(:dspy, formatter: :json) do |logger|
logger.add_backend(stream: "/var/log/dspy/production.log")
end
end
# Both calls are tracked identically
lm.raw_chat([{ role: 'user', content: 'Hello' }])
predictor.forward(input: 'Hello')
Migration Strategy
Use raw_chat
for phased migration:
Phase 1: Baseline
# Measure existing prompt performance
baseline = benchmark_with_raw_chat(LEGACY_PROMPT, test_dataset)
Phase 2: Prototype
# Build modular version
class ModularImplementation < DSPy::Module
# ...
end
Phase 3: Compare
# Run side-by-side comparison
results = compare_approaches(baseline, ModularImplementation.new)
Phase 4: Migrate
# Deploy when metrics improve
if results[:modular][:tokens] < results[:legacy][:tokens] * 0.9
deploy_modular_version
end
Implementation Details
raw_chat
:
- Bypasses JSON parsing - Returns raw strings
- Skips retry strategies - No structured output validation
- Direct adapter calls - Minimal overhead
- Preserves observability - Full span tracking and logging
This gives you fair comparisons with full monitoring.
Best Practices
- Use identical test data for both approaches
- Run multiple times to account for variance
- Check quality, not just token count
- Start small with non-critical prompts
- Track production metrics after migration
Summary
The raw_chat
API helps you compare existing prompts with DSPy modules using the same observability system. This lets you make informed decisions about migration based on actual data, not guesswork.
See the benchmarking guide for detailed examples.