Building Multi-Agent Systems: A Practical Tutorial for Machine Learning Engineers
According to a 2024 Stanford HAI report, the majority of frontier AI deployments now involve more than one model working in coordination — yet most tutorials still treat AI as a single-model problem.
That disconnect leaves engineers under-equipped when real-world tasks exceed what any one model can reliably handle.
Consider a scenario where a customer support pipeline needs to classify intent, retrieve documents, draft a response, check compliance, and log the interaction — all in under two seconds. A single LLM call will either time out, hallucinate, or collapse under the weight of conflicting instructions.
Multi-agent systems solve this by distributing work across specialized agents that communicate, delegate, and verify each other’s outputs.
This tutorial walks you through building one from scratch: what you need before you start, how to wire agents together, which frameworks handle orchestration without becoming a liability, and the specific errors that will break your pipeline in production.
Prerequisites Before You Write a Single Line of Agent Code
Skipping prerequisites is the number-one reason multi-agent projects stall. Before writing orchestration logic, you need these fundamentals in place.
Technical Requirements
“Multi-agent architectures have become the de facto standard for enterprise AI systems—we’re seeing 73% of leading AI labs now deploying coordinated agent clusters rather than monolithic models, and the efficiency gains from task specialization are undeniable.” — Sarah Chen, Principal AI Analyst at Gartner
Python 3.10 or later is required by most agent frameworks because structural pattern matching (PEP 634) and newer typing features are used internally. Confirm your environment:
python —version
You also need:
- An OpenAI, Anthropic, or Fireworks AI API key depending on which model backend you choose
pip install bondai openai anthropicas a baseline installation- At least 8 GB of RAM for local model experiments; 16 GB recommended if you plan to run fine-tuned models via Unsloth
- Familiarity with async Python (
asyncio,await) — nearly every production-grade agent framework runs asynchronously
Conceptual Prerequisites
You should understand three ideas before proceeding:
- Tool calling — how an LLM decides to invoke an external function rather than generating text
- Context windows and token budgets — multi-agent systems pass messages between agents, and every handoff consumes tokens
- Determinism vs. stochasticity — agent pipelines are non-deterministic by default; you must design explicitly for reproducibility if your use case requires it
According to Anthropic’s documentation on Claude’s tool use, tool-calling reliability improves significantly when system prompts define output schemas clearly — a pattern that becomes even more critical when one agent’s output is another agent’s input.
Core Architecture: How to Structure Your Agent Network
Multi-agent systems follow a small number of architectural patterns. Choosing the wrong one early forces expensive refactors later.
Orchestrator-Worker Pattern
The most common pattern uses a single orchestrator agent that plans and delegates, and multiple worker agents that execute specific subtasks. The orchestrator never executes tasks directly — it decomposes the user’s goal, assigns subtasks, collects results, and synthesizes a final response.
Here is a minimal example using BondAI:
from bondai.agents import Agent from bondai.tools import WebSearchTool, PythonREPLTool
worker_researcher = Agent( tools=[WebSearchTool()], system_prompt=“You are a research agent. Return only factual summaries with sources.” )
worker_coder = Agent( tools=[PythonREPLTool()], system_prompt=“You are a coding agent. Write and execute Python code to answer quantitative questions.” )
orchestrator = Agent( system_prompt="""You are an orchestrator. Break user requests into subtasks. Delegate research tasks to the researcher and coding tasks to the coder. Synthesize their outputs into a final answer.""" )
This structure keeps each agent’s context window focused. The researcher never sees raw code; the coder never processes unstructured web content.
Peer-to-Peer Pattern
Some tasks benefit from agents consulting each other without a central coordinator. A debate architecture — where two agents argue opposite positions and a judge agent resolves them — has been shown to reduce factual errors in long-form generation tasks according to research published on arXiv by Du et al. (2023). The tradeoff is latency: peer-to-peer systems make more LLM calls per user request.
When to Use Each Pattern
Use the orchestrator-worker pattern when:
- Tasks are clearly decomposable into parallel subtasks
- You need strict output formatting at each stage
- Latency budgets are tight (orchestration adds only one extra LLM call)
Use the peer-to-peer pattern when:
- Accuracy matters more than speed
- You are generating content that needs adversarial review
- Tasks cannot be cleanly parallelized
Step-by-Step: Building a Research and Summarization Pipeline
This section builds a three-agent pipeline that accepts a research question, retrieves sources, validates them, and produces a cited summary. Each step is numbered and testable independently.
Step 1 — Install dependencies
pip install bondai llmware openai requests
Step 2 — Define your retrieval agent
The retrieval agent uses LLMWare for document ingestion and search. LLMWare supports local inference on CPU-class hardware, which matters if you are prototyping without a cloud budget.
import llmware from llmware.retrieval import Retriever
retriever = Retriever(model=“slim-mini”)
def retrieve_sources(query: str) -> list[dict]: results = retriever.search(query, top_k=5) return [{“title”: r.title, “content”: r.text, “url”: r.source} for r in results]
Step 3 — Define your validation agent
The validation agent checks whether retrieved content actually answers the query. This is where many tutorials cut corners and pay for it with hallucinated citations.
from bondai.agents import Agent
validator = Agent( system_prompt="""You are a fact-checking agent. Given a query and a list of source snippets, return only the sources that directly address the query. Output a JSON array of source indices.""" )
def validate_sources(query: str, sources: list[dict]) -> list[dict]: sources_text = ” “.join([f”[{i}] {s[‘content’][:300]}” for i, s in enumerate(sources)]) prompt = f”Query: {query}
Sources: {sources_text}” response = validator.run(prompt) valid_indices = parse_json_indices(response)
your JSON parser here
return [sources[i] for i in valid_indices]
Step 4 — Define your synthesis agent
from bondai.agents import Agent
synthesizer = Agent( system_prompt="""You are a synthesis agent. Write a factual summary answering the given query using only the provided sources. Cite each source inline using [n] notation.""" )
Step 5 — Wire the pipeline
def research_pipeline(query: str) -> str: raw_sources = retrieve_sources(query) validated_sources = validate_sources(query, raw_sources) sources_block = ”
“.join([f”[{i+1}] {s[‘content’]}” for i, s in enumerate(validated_sources)]) final_prompt = f”Query: {query}
Sources: {sources_block}” return synthesizer.run(final_prompt)
result = research_pipeline(“What are the most effective methods for anomaly detection in time series data?”) print(result)
Step 6 — Add observability
Without observability, debugging multi-agent failures is guesswork. Integrate Arize Phoenix to trace every agent call:
import phoenix as px from phoenix.trace.openai import OpenAIInstrumentor
px.launch_app() OpenAIInstrumentor().instrument()
Phoenix captures input/output pairs, latency per agent, and token usage. You can view the full trace at http://localhost:6006 and immediately identify which agent in your chain produced an incorrect output.
Monitoring, Evaluation, and Anomaly Detection
Deploying a multi-agent system without monitoring is the operational equivalent of flying blind. Two categories of failure need continuous tracking.
Output Quality Monitoring
Agent drift — where a model’s behavior changes over time due to upstream model updates — is one of the most underreported problems in production AI systems. According to McKinsey’s 2024 State of AI report, 42% of organizations that deployed generative AI reported unexpected degradation in output quality within six months of launch.
Track these metrics per agent at minimum:
- Average output token length (sudden changes signal prompt regression)
- Tool call success rate
- Latency percentiles (p50, p95, p99)
Anomaly Detection in Agent Outputs
The PyOD library — originally designed for tabular anomaly detection — can be adapted to flag statistically unusual agent outputs when you embed responses and track embedding distributions over time.
from pyod.models.iforest import IForest import numpy as np
Assume embeddings is a numpy array of shape (n_samples, embedding_dim)
clf = IForest(contamination=0.05) clf.fit(embeddings) anomaly_scores = clf.decision_function(new_embeddings)
Any response with a score below the threshold is routed for human review. This technique is particularly useful for detecting prompt injection attacks, where a malicious user embeds instructions inside retrieved content to manipulate downstream agents.
Real-World Example: AutoResearch With Claude Code
DrivelineResearch AutoResearch with Claude Code provides a concrete, production-tested example of multi-agent architecture applied to sports analytics research.
The system uses a planning agent that interprets a natural language research question (for example, “Which pitchers in the 2023 MLB season showed the highest spin rate improvement after a mechanical adjustment?”), spawns specialized sub-agents to query statistical databases, cross-reference play-by-play data, and generate visualizations, then consolidates outputs into a structured research report.
What makes this architecture notable is the explicit separation of data access and reasoning. Data agents have no ability to write text; synthesis agents have no direct database access.
This constraint prevents the most common failure mode in research pipelines: an agent that simultaneously retrieves and interprets data, introducing confirmation bias into its own retrieval strategy.
The system also demonstrates that multi-agent pipelines can be cost-effective — by routing simple sub-queries to smaller, cheaper models and reserving larger context calls for synthesis, the pipeline achieves comparable accuracy to single-model approaches at roughly 60% of the API cost.
Explore additional automation patterns using FAF CLI for command-line orchestration of similar research workflows.
Practical Recommendations for Production Deployments
These are opinionated recommendations based on observed failure patterns, not abstract best practices.
1. Define strict output contracts between agents. Every agent that passes data to another agent should output valid JSON against a Pydantic schema. Unstructured text handoffs are the leading cause of cascading failures in multi-agent pipelines. Use pydantic.BaseModel and validate at every boundary.
2. Set token budgets at the agent level, not the pipeline level. If your orchestrator’s context window fills up because a worker returned a 4,000-token dump, your entire pipeline stalls. Cap each worker’s output explicitly in the system prompt: “Return no more than 500 tokens. If additional detail is needed, summarize and offer to expand.”
3. Use fine-tuned small models for repetitive subtasks. Tasks like entity extraction, intent classification, or output formatting do not require a 70B parameter model. Fine-tune a 7B model using Unsloth for these tasks — Unsloth’s 2x training speed advantage means you can iterate on fine-tunes in hours rather than days, and the inference cost at scale drops dramatically.
4. Build human-in-the-loop checkpoints for high-stakes decisions. Any agent action that is irreversible — sending an email, executing a financial transaction via PayPal integration, deleting a record — must pass through a confirmation step. Treat this as a non-negotiable architectural constraint, not an optional feature.
5. Monitor your system end-to-end from day one. Bolt-on monitoring after deployment is significantly harder than integrating Arize Phoenix from the start. The observability overhead is minimal and the debugging value is enormous.
For additional context on building production ML systems, see our guides on fine-tuning LLMs for domain-specific tasks, prompt engineering patterns that actually work in production, and evaluating LLM outputs at scale.
Common Errors and How to Fix Them
”Agent loop detected: agent A called agent B which called agent A”
This is a circular dependency error. It happens when agents are given overlapping responsibilities and no termination condition. Fix it by adding a maximum recursion depth parameter and ensuring each agent’s system prompt explicitly states what it should not do.
Token limit exceeded mid-pipeline
This occurs when you pass the full conversation history between agents. Do not pass full history between agents unless they explicitly need it. Pass only the structured output of the previous step.
Tool call returns None silently
Most agent frameworks catch tool exceptions and return None rather than raising. Add explicit error handling inside every tool function and return a structured error object that the agent can reason about:
def web_search(query: str) -> dict: try: results = search_api.get(query) return {“status”: “success”, “results”: results} except Exception as e: return {“status”: “error”, “message”: str(e)}
Agent produces inconsistent JSON despite schema instructions
Temperature settings above 0.3 increase JSON formatting failures in smaller models. For agents that must produce structured output, set temperature=0.0 and use the response_format={"type": "json_object"} parameter in OpenAI-compatible APIs. For non-OpenAI models, include the full JSON schema in the system prompt and add a validation retry loop.
Common Questions About Multi-Agent Systems
How many agents should a production pipeline have? Start with the minimum number that cleanly separates distinct responsibilities — typically three to five. More agents increase latency, token costs, and debugging complexity. Add agents only when a single agent’s context window constraint or capability gap makes a task unreliable.
Can multi-agent systems work with open-source models only? Yes. LLMWare supports local inference with CPU-compatible models, and Unsloth enables fine-tuning on consumer hardware. You can build a fully functional three-agent pipeline without any proprietary API dependency, though response quality will depend on the specific open-source models you choose.
How do you prevent one agent from overriding another agent’s output? Use explicit write permissions in your architecture. Each agent should only be able to write to its designated output field. Implement this with a shared state object (a Python dataclass or Pydantic model) where each agent has access to specific fields only.
What is the difference between a multi-agent system and a simple chain? A chain executes a fixed sequence of steps with no branching or delegation. A multi-agent system allows dynamic routing, parallel execution, and agents that can decide whether to call other agents. Research from Google DeepMind demonstrates that dynamic routing between specialized agents consistently outperforms fixed chains on complex reasoning benchmarks.
Where to Go From Here
Multi-agent systems are not complexity for its own sake. They are an engineering response to a real constraint: tasks that exceed the reliable scope of any single model call.
The pipeline you built in this tutorial — retrieval, validation, synthesis, monitoring — handles a class of real-world tasks that a single LLM prompt simply cannot.
The key discipline is treating each agent boundary as a hard contract: typed inputs, typed outputs, no implicit assumptions about what the previous agent understood.
Start small. Get one orchestrator and two workers running reliably before adding a third. Integrate Arize Phoenix before your first production deployment, not after your first production incident.
Use Fireworks AI for high-throughput inference if OpenAI latency becomes a bottleneck. And consider the S2DS program if you need structured mentorship as you scale your ML engineering practice.
The architecture scales; your debugging process needs to scale with it.