Article

Under the Hood: How DSPy.rb Extracts JSON from Every LLM

A technical deep-dive into DSPy.rb's multi-strategy JSON extraction system, showing exactly how it handles OpenAI, Anthropic, and other providers

V

Vicente Reig

Fractional Engineering Lead •

DSPy.rb uses 4 different strategies to extract JSON from LLMs. Here’s how each one works.

When you call predict.forward(), DSPy.rb picks the best strategy for your LLM provider. Each strategy is designed to get reliable JSON output from different models.

How Strategy Selection Works

DSPy.rb ranks strategies by priority and picks the best one available:

# From lib/dspy/lm/strategy_selector.rb
STRATEGIES = [
  Strategies::OpenAIStructuredOutputStrategy,     # Priority: 100
  Strategies::AnthropicToolUseStrategy,          # Priority: 95
  Strategies::AnthropicExtractionStrategy,       # Priority: 90
  Strategies::EnhancedPromptingStrategy          # Priority: 50
].freeze

The selector checks your LLM provider and model, then uses the highest-priority strategy that works:

def select
  # Allow manual override via configuration
  if DSPy.config.structured_outputs.strategy
    # ... handle manual selection
  end

  # Select the highest priority available strategy
  available_strategies = @strategies.select(&:available?)
  selected = available_strategies.max_by(&:priority)
  
  DSPy.logger.debug("Selected JSON extraction strategy: #{selected.name}")
  selected
end

OpenAI: Native Structured Outputs (Priority 100)

For OpenAI models that support structured outputs (GPT-4o, GPT-4o-mini), DSPy.rb uses OpenAI’s built-in JSON feature:

# What DSPy.rb sends to OpenAI
request_params[:response_format] = {
  type: "json_schema",
  json_schema: {
    name: "ProductExtractor",
    strict: true,
    schema: {
      type: "object",
      properties: {
        name: { type: "string" },
        price: { type: "number" },
        in_stock: { type: "boolean" }
      },
      required: ["name", "price", "in_stock"],
      additionalProperties: false
    }
  }
}

This guarantees valid JSON - OpenAI’s API won’t return invalid JSON when using structured outputs. The schema converter handles complex Ruby types:

# Converts T::Array[String] to JSON Schema
when T::Types::TypedArray
  {
    type: "array",
    items: type_to_json_schema(type_info.type)
  }

Anthropic: Two Ways to Get JSON

Tool Use Strategy (Priority 95)

DSPy.rb uses Anthropic’s tool calling feature to get structured JSON:

# From anthropic_tool_use_strategy.rb
def prepare_request(messages, request_params)
  # Convert signature to tool schema
  tool_schema = {
    name: "json_output",
    description: "Output the result in the required JSON format",
    input_schema: {
      type: "object",
      properties: build_properties_from_schema(output_schema),
      required: output_schema.keys.map(&:to_s)
    }
  }
  
  request_params[:tools] = [tool_schema]
  request_params[:tool_choice] = {
    type: "tool",
    name: "json_output"
  }
end

The response comes back with the JSON in a structured format:

# Extract JSON from tool use response
if response.metadata[:tool_calls]
  first_call = response.metadata[:tool_calls].first
  if first_call[:name] == "json_output"
    return JSON.generate(first_call[:input])
  end
end

4-Pattern Extraction Strategy (Priority 90)

When tool use isn’t available, DSPy.rb uses 4 patterns to extract JSON from Claude’s responses:

# From anthropic_adapter.rb
def extract_json_from_response(content)
  # Pattern 1: Look for ```json blocks
  if content.include?('```json')
    extracted = content[/```json\s*\n(.*?)\n```/m, 1]
    return extracted.strip if extracted
  end
  
  # Pattern 2: Look for ## Output values header
  if content.include?('## Output values')
    extracted = content.split('## Output values').last
                      .gsub(/```json\s*\n/, '')
                      .gsub(/\n```.*/, '')
                      .strip
    return extracted if extracted && !extracted.empty?
  end
  
  # Pattern 3: Check generic code blocks
  if content.include?('```')
    extracted = content[/```\s*\n(.*?)\n```/m, 1]
    return extracted.strip if extracted && looks_like_json?(extracted)
  end
  
  # Pattern 4: Already valid JSON
  content.strip
end

The adapter also adds special instructions for Claude:

# Special instruction added to Claude prompts
json_instruction = "\n\nIMPORTANT: Respond with ONLY valid JSON. " \
                  "No markdown formatting, no code blocks, no explanations. " \
                  "Start your response with '{' and end with '}'."

Enhanced Prompting: Works with Any Model (Priority 50)

For models without native JSON support, DSPy.rb adds clear instructions to the prompt:

def enhance_prompt_with_json_instructions(prompt, schema)
  json_example = generate_example_from_schema(schema)
  
  <<~ENHANCED
    #{prompt}

    IMPORTANT: You must respond with valid JSON that matches this structure:
    ```json
    #{JSON.pretty_generate(json_example)}
    ```

    Required fields: #{schema[:required]&.join(', ') || 'none'}
    
    Ensure your response:
    1. Is valid JSON (properly quoted strings, no trailing commas)
    2. Includes all required fields
    3. Uses the correct data types for each field
    4. Is wrapped in ```json``` markdown code blocks
  ENHANCED
end

The extraction then tries multiple patterns:

# Try markdown blocks first
if content.include?('```json')
  json_content = content.split('```json').last.split('```').first.strip
  return json_content if valid_json?(json_content)
end

# Check if entire response is JSON
return content if valid_json?(content)

# Look for JSON-like structures
json_match = content.match(/\{[\s\S]*\}|\[[\s\S]*\]/)

Real Examples: What Each Provider Receives

Here’s what happens when you use the same DSPy signature with different providers:

class ProductExtractor < DSPy::Signature
  input { const :description, String }
  output do
    const :name, String
    const :price, Float
  end
end

OpenAI Request:

{
  "model": "gpt-4o-mini",
  "messages": [{"role": "user", "content": "Extract: iPhone 15 Pro - $999"}],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "ProductExtractor",
      "strict": true,
      "schema": { /* full schema */ }
    }
  }
}

Anthropic with Tool Use:

{
  "model": "claude-3-sonnet",
  "messages": [{"role": "user", "content": "Extract: iPhone 15 Pro - $999\n\nPlease use the json_output tool to provide your response."}],
  "tools": [{
    "name": "json_output",
    "input_schema": { /* schema */ }
  }],
  "tool_choice": {"type": "tool", "name": "json_output"}
}

Anthropic with Extraction:

{
  "model": "claude-3-haiku",
  "messages": [{
    "role": "user", 
    "content": "Extract: iPhone 15 Pro - $999\n\nIMPORTANT: Respond with ONLY valid JSON..."
  }]
}

Behind the Scenes

Retry Logic

DSPy.rb tries multiple times if JSON extraction fails:

# From retry_handler.rb
def execute_with_retry
  @strategies.each do |strategy|
    @attempt = 0
    while @attempt < max_attempts
      @attempt += 1
      
      begin
        # Try the strategy
        result = execute_strategy(strategy)
        return result if result
      rescue => e
        # Handle specific errors
        if strategy.handle_error(e)
          # Strategy handled it, try next strategy
          break
        end
        # Otherwise retry with backoff
      end
    end
  end
end

Performance Optimizations

DSPy.rb caches schemas and capability checks to speed things up:

# Schema caching (1 hour TTL)
cache_manager = DSPy::LM.cache_manager
cached_schema = cache_manager.get_schema(signature_class, "openai", cache_params)

if cached_schema
  DSPy.logger.debug("Using cached schema for #{signature_class.name}")
  return cached_schema
end

# Capability caching (24 hour TTL) 
cached_result = cache_manager.get_capability(model, "structured_outputs")

if !cached_result.nil?
  DSPy.logger.debug("Using cached capability check for #{model}")
  return cached_result
end

# Check and cache the result
result = STRUCTURED_OUTPUT_MODELS.any? { |supported| base_model.start_with?(supported) }
cache_manager.cache_capability(model, "structured_outputs", result)

Try It Yourself

Want to see which strategy your setup uses? Enable debug logging:

DSPy.configure do |config|
  config.logger = Dry.Logger(:dspy, level: :debug)
end

lm = DSPy::LM.new("openai/gpt-4o-mini", 
                  api_key: ENV["OPENAI_API_KEY"],
                  structured_outputs: true)

# Watch the logs to see:
# "Selected JSON extraction strategy: openai_structured_output"

Or inspect the strategy directly:

strategy_selector = DSPy::LM::StrategySelector.new(lm.adapter, MySignature)
strategy = strategy_selector.select
puts "Using strategy: #{strategy.name} (priority: #{strategy.priority})"

Key Takeaways

DSPy.rb’s multi-strategy approach makes JSON extraction work reliably across all major LLMs. Understanding these details helps you:

  • Pick the right model for JSON tasks
  • Debug extraction problems faster
  • Configure strategies for your needs
  • Contribute improvements to the project

The best part? You don’t need to worry about any of this complexity. Just define your signature and call forward() - DSPy.rb does the rest.


Want to dive deeper? Check out the source code or join the discussion on GitHub.