A Coding Guide to Build Modular and Self-Correcting QA Systems with DSPy

In this tutorial, we explore how to build an intelligent and self-correcting question-answering system using the DSPy framework, integrated with Google’s Gemini 1.5 Flash model. We begin by defining structured Signatures that clearly outline input-output behavior, which DSPy uses as its foundation for building reliable pipelines. With DSPy’s declarative programming approach, we construct composable modules, such as AdvancedQA and SimpleRAG, to answer questions using both context and retrieval-augmented generation. By combining DSPy’s modularity with Gemini’s powerful reasoning, we craft an AI system capable of delivering accurate, step-by-step answers. As we progress, we also leverage DSPy’s optimization tools, such as BootstrapFewShot, to automatically enhance performance based on training examples.

!pip install dspy-ai google-generativeai


import dspy
import google.generativeai as genai
import random
from typing import List, Optional


GOOGLE_API_KEY = "Use Your Own API Key"  
genai.configure(api_key=GOOGLE_API_KEY)


dspy.configure(lm=dspy.LM(model="gemini/gemini-1.5-flash", api_key=GOOGLE_API_KEY))

We start by installing the required libraries, DSPy for declarative AI pipelines, and google-generativeai to access Google’s Gemini models. After importing the necessary modules, we configure Gemini using our API key. Finally, we set up DSPy to use the Gemini 1.5 Flash model as our language model backend.

class QuestionAnswering(dspy.Signature):
    """Answer questions based on given context with reasoning."""
    context: str = dspy.InputField(desc="Relevant context information")
    question: str = dspy.InputField(desc="Question to answer")
    reasoning: str = dspy.OutputField(desc="Step-by-step reasoning")
    answer: str = dspy.OutputField(desc="Final answer")


class FactualityCheck(dspy.Signature):
    """Verify if an answer is factually correct given context."""
    context: str = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.InputField()
    is_correct: bool = dspy.OutputField(desc="True if answer is factually correct")

We define two DSPy Signatures to structure our system’s inputs and outputs. First, QuestionAnswering expects a context and a question, and it returns both reasoning and a final answer, allowing the model to explain its thought process. Next, FactualityCheck is designed to verify the truthfulness of an answer by returning a simple boolean, helping us build a self-correcting QA system.

class AdvancedQA(dspy.Module):
    def __init__(self, max_retries: int = 2):
        super().__init__()
        self.max_retries = max_retries
        self.qa_predictor = dspy.ChainOfThought(QuestionAnswering)
        self.fact_checker = dspy.Predict(FactualityCheck)
       
    def forward(self, context: str, question: str) -> dspy.Prediction:
        prediction = self.qa_predictor(context=context, question=question)
       
        for attempt in range(self.max_retries):
            fact_check = self.fact_checker(
                context=context,
                question=question,
                answer=prediction.answer
            )
           
            if fact_check.is_correct:
                break
               
            refined_context = f"{context}nnPrevious incorrect answer: {prediction.answer}nPlease provide a more accurate answer."
            prediction = self.qa_predictor(context=refined_context, question=question)
       
        return prediction

We create an AdvancedQA module to add self-correction capability to our QA system. It first uses a Chain-of-Thought predictor to generate an answer with reasoning. Then, it checks the factual accuracy using a fact-checking predictor. If the answer is incorrect, we refine the context and retry, up to a specified number of times, to ensure more reliable outputs.

class SimpleRAG(dspy.Module):
    def __init__(self, knowledge_base: List[str]):
        super().__init__()
        self.knowledge_base = knowledge_base
        self.qa_system = AdvancedQA()
       
    def retrieve(self, question: str, top_k: int = 2) -> str:
        # Simple keyword-based retrieval (in practice, use vector embeddings)
        scored_docs = []
        question_words = set(question.lower().split())
       
        for doc in self.knowledge_base:
            doc_words = set(doc.lower().split())
            score = len(question_words.intersection(doc_words))
            scored_docs.append((score, doc))
       
        # Return top-k most relevant documents
        scored_docs.sort(reverse=True)
        return "nn".join([doc for _, doc in scored_docs[:top_k]])
   
    def forward(self, question: str) -> dspy.Prediction:
        context = self.retrieve(question)
        return self.qa_system(context=context, question=question)

We build a SimpleRAG module to simulate Retrieval-Augmented Generation using DSPy. We provide a knowledge base and implement a basic keyword-based retriever to fetch the most relevant documents for a given question. These documents serve as context for the AdvancedQA module, which then performs reasoning and self-correction to produce an accurate answer.

knowledge_base = [
    “Use Your Context and Knowledge Base Here”
]


training_examples = [
    dspy.Example(
        question="What is the height of the Eiffel Tower?",
        context="The Eiffel Tower is located in Paris, France. It was constructed from 1887 to 1889 and stands 330 meters tall including antennas.",
        answer="330 meters"
    ).with_inputs("question", "context"),
   
    dspy.Example(
        question="Who created Python programming language?",
        context="Python is a high-level programming language created by Guido van Rossum. It was first released in 1991 and emphasizes code readability.",
        answer="Guido van Rossum"
    ).with_inputs("question", "context"),
   
    dspy.Example(
        question="What is machine learning?",
        context="ML focuses on algorithms that can learn from data without being explicitly programmed.",
        answer="Machine learning focuses on algorithms that learn from data without explicit programming."
    ).with_inputs("question", "context")
]

We define a small knowledge base containing diverse facts across various topics, including history, programming, and science. This serves as our context source for retrieval. Alongside, we prepare a set of training examples to guide DSPy’s optimization process. Each example includes a question, its relevant context, and the correct answer, helping our system learn how to respond more accurately.

def accuracy_metric(example, prediction, trace=None):
    """Simple accuracy metric for evaluation"""
    return example.answer.lower() in prediction.answer.lower()


print("🚀 Initializing DSPy QA System with Gemini...")
print("📝 Note: Using Google's Gemini 1.5 Flash (free tier)")
rag_system = SimpleRAG(knowledge_base)


basic_qa = dspy.ChainOfThought(QuestionAnswering)


print("n📊 Before Optimization:")
test_question = "What is the height of the Eiffel Tower?"
test_context = knowledge_base[0]
initial_prediction = basic_qa(context=test_context, question=test_question)
print(f"Q: {test_question}")
print(f"A: {initial_prediction.answer}")
print(f"Reasoning: {initial_prediction.reasoning}")


print("n🔧 Optimizing with BootstrapFewShot...")
optimizer = dspy.BootstrapFewShot(metric=accuracy_metric, max_bootstrapped_demos=2)
optimized_qa = optimizer.compile(basic_qa, trainset=training_examples)


print("n📈 After Optimization:")
optimized_prediction = optimized_qa(context=test_context, question=test_question)
print(f"Q: {test_question}")
print(f"A: {optimized_prediction.answer}")
print(f"Reasoning: {optimized_prediction.reasoning}")

We begin by defining a simple accuracy metric to check if the predicted answer contains the correct response. After initializing our SimpleRAG system and a baseline ChainOfThought QA module, we test it on a sample question before any optimization. Then, using DSPy’s BootstrapFewShot optimizer, we fine-tune the QA system with our training examples. This enables the model to automatically generate more effective prompts, leading to improved accuracy, which we verify by comparing responses before and after optimization.

def evaluate_system(qa_module, test_cases):
    """Evaluate QA system performance"""
    correct = 0
    total = len(test_cases)
   
    for example in test_cases:
        prediction = qa_module(context=example.context, question=example.question)
        if accuracy_metric(example, prediction):
            correct += 1
   
    return correct / total


print(f"n📊 Evaluation Results:")
print(f"Basic QA Accuracy: {evaluate_system(basic_qa, training_examples):.2%}")
print(f"Optimized QA Accuracy: {evaluate_system(optimized_qa, training_examples):.2%}")


print("n✅ Tutorial Complete! Key DSPy Concepts Demonstrated:")
print("1. 🔤 Signatures - Defined input/output schemas")
print("2. 🏗️  Modules - Built composable QA systems")
print("3. 🔄 Self-correction - Implemented iterative improvement")
print("4. 🔍 RAG - Created retrieval-augmented generation")
print("5. ⚡ Optimization - Used BootstrapFewShot to improve prompts")
print("6. 📊 Evaluation - Measured system performance")
print("7. 🆓 Free API - Powered by Google Gemini 1.5 Flash")

We run an Advanced RAG demo by asking multiple questions across different domains. For each question, the SimpleRAG system retrieves the most relevant context and then uses the self-correcting AdvancedQA module to generate a well-reasoned answer. We print the answers along with a preview of the reasoning, showcasing how DSPy combines retrieval and thoughtful generation to deliver reliable responses.

In conclusion, we have successfully demonstrated the full potential of DSPy for building advanced QA pipelines. We see how DSPy simplifies the design of intelligent modules with clear interfaces, supports self-correction loops, integrates basic retrieval, and enables few-shot prompt optimization with minimal code. With just a few lines, we configure and evaluate our models using real-world examples, measuring performance gains. This hands-on experience shows how DSPy, when combined with Google’s Gemini API, empowers us to rapidly prototype, test, and scale sophisticated language applications without boilerplate or complex logic.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

A Coding Guide to Build Modular and Self-Correcting QA Systems with DSPy

A faster problem-solving tool that guarantees feasibility | MIT News

How to Design a Persistent Memory and Personalized Agentic AI System with Decay and Self-Evaluation?

LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Parameters with 27B activated, Excelling at Real-Time Audio-Visual Interaction

Comparing the Top 6 OCR (Optical Character Recognition) Models/Systems in 2025

Top Insights

Standard Chartered CEO Says ‘All Money Will Be Digital’

DeFi Protocol Balancer Hit By Multimillion-dollar Exploit

HKMA To Ramp Up Tokenization Plans In 2030 Strategy

A Coding Guide to Build Modular and Self-Correcting QA Systems with DSPy

Related Posts

A faster problem-solving tool that guarantees feasibility | MIT News

How to Design a Persistent Memory and Personalized Agentic AI System with Decay and Self-Evaluation?

LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Parameters with 27B activated, Excelling at Real-Time Audio-Visual Interaction

Comparing the Top 6 OCR (Optical Character Recognition) Models/Systems in 2025

Standard Chartered CEO Says ‘All Money Will Be Digital’

DeFi Protocol Balancer Hit By Multimillion-dollar Exploit

HKMA To Ramp Up Tokenization Plans In 2030 Strategy

Subscribe to Updates