In this tutorial, we walk through setting up an advanced AI Agent using Microsoft’s Agent-Lightning framework. We are running everything directly inside Google Colab, which means we can experiment with both the server and client components in one place. By defining a small QA agent, connecting it to a local Agent-Lightning server, and then training it with multiple system prompts, we can observe how the framework supports resource updates, task queuing, and automated evaluation. Check out the FULL CODES here.
!pip -q install agentlightning openai nest_asyncio python-dotenv > /dev/null
import os, threading, time, asyncio, nest_asyncio, random
from getpass import getpass
from agentlightning.litagent import LitAgent
from agentlightning.trainer import Trainer
from agentlightning.server import AgentLightningServer
from agentlightning.types import PromptTemplate
import openai
if not os.getenv("OPENAI_API_KEY"):
try:
os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter OPENAI_API_KEY (leave blank if using a local/proxy base): ") or ""
except Exception:
pass
MODEL = os.getenv("MODEL", "gpt-4o-mini")
We begin by installing the required libraries & importing all the core modules we need for Agent-Lightning. We also set up our OpenAI API key securely and defined the model we will use for the tutorial. Check out the FULL CODES here.
class QAAgent(LitAgent):
def training_rollout(self, task, rollout_id, resources):
"""Given a task {'prompt':..., 'answer':...}, ask LLM using the server-provided system prompt and return a reward in [0,1]."""
sys_prompt = resources["system_prompt"].template
user = task["prompt"]; gold = task.get("answer","").strip().lower()
try:
r = openai.chat.completions.create(
model=MODEL,
messages=[{"role":"system","content":sys_prompt},
{"role":"user","content":user}],
temperature=0.2,
)
pred = r.choices[0].message.content.strip()
except Exception as e:
pred = f"[error]{e}"
def score(pred, gold):
P = pred.lower()
base = 1.0 if gold and gold in P else 0.0
gt = set(gold.split()); pr = set(P.split());
inter = len(gt & pr); denom = (len(gt)+len(pr)) or 1
overlap = 2*inter/denom
brevity = 0.2 if base==1.0 and len(P.split())
We define a simple QAAgent by extending LitAgent, where we handle each training rollout by sending the user’s prompt to the LLM, collecting the response, and scoring it against the gold answer. We design the reward function to verify correctness, token overlap, and brevity, enabling the agent to learn and produce concise and accurate outputs. Check out the FULL CODES here.
TASKS = [
{"prompt":"Capital of France?","answer":"Paris"},
{"prompt":"Who wrote Pride and Prejudice?","answer":"Jane Austen"},
{"prompt":"2+2 = ?","answer":"4"},
]
PROMPTS = [
"You are a terse expert. Answer with only the final fact, no sentences.",
"You are a helpful, knowledgeable AI. Prefer concise, correct answers.",
"Answer as a rigorous evaluator; return only the canonical fact.",
"Be a friendly tutor. Give the one-word answer if obvious."
]
nest_asyncio.apply()
HOST, PORT = "127.0.0.1", 9997
We define a tiny benchmark with three QA tasks and curate multiple candidate system prompts to optimize. We then apply nest_asyncio and set our local server host and port, allowing us to run the Agent-Lightning server and clients within a single Colab runtime. Check out the FULL CODES here.
async def run_server_and_search():
server = AgentLightningServer(host=HOST, port=PORT)
await server.start()
print("✅ Server started")
await asyncio.sleep(1.5)
results = []
for sp in PROMPTS:
await server.update_resources({"system_prompt": PromptTemplate(template=sp, engine="f-string")})
scores = []
for t in TASKS:
tid = await server.queue_task(sample=t, mode="train")
rollout = await server.poll_completed_rollout(tid, timeout=40) # waits for a worker
if rollout is None:
print("⏳ Timeout waiting for rollout; continuing...")
continue
scores.append(float(getattr(rollout, "final_reward", 0.0)))
avg = sum(scores)/len(scores) if scores else 0.0
print(f"🔎 Prompt avg: {avg:.3f} | {sp}")
results.append((sp, avg))
best = max(results, key=lambda x: x[1]) if results else ("",0)
print("n🏁 BEST PROMPT:", best[0], " | score:", f"{best[1]:.3f}")
await server.stop()
We start the Agent-Lightning server and iterate through our candidate system prompts, updating the shared system_prompt before queuing each training task. We then poll for completed rollouts, compute average rewards per prompt, report the best-performing prompt, and gracefully stop the server. Check out the FULL CODES here.
def run_client_in_thread():
agent = QAAgent()
trainer = Trainer(n_workers=2)
trainer.fit(agent, backend=f"http://{HOST}:{PORT}")
client_thr = threading.Thread(target=run_client_in_thread, daemon=True)
client_thr.start()
asyncio.run(run_server_and_search())
We launch the client in a separate thread with two parallel workers, allowing it to process tasks sent by the server. At the same time, we run the server loop, which evaluates different prompts, collects rollout results, and reports the best system prompt based on average reward.
In conclusion, we will see how Agent-Lightning enables us to create a flexible agent pipeline with only a few lines of code. We can start a server, run parallel client workers, evaluate different system prompts, and automatically measure performance, all within a single Colab environment. This demonstrates how the framework streamlines the process of building, testing, and optimizing AI agents in a structured manner.
Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.