BabyAGI Task-Driven Autonomous Agent: How to Build and Run It in 2024

In early 2023, Yohei Nakajima published a 140-line Python script that captured the attention of thousands of developers overnight.

That script — BabyAGI — demonstrated that a language model could create its own task list, prioritize those tasks, execute them, and then generate new tasks based on the results, all without a human typing the next instruction.

According to Stanford HAI’s 2024 AI Index, autonomous agent research has grown by over 300% in published papers since 2022, and BabyAGI sits near the origin of that explosion.

This guide covers exactly how BabyAGI works under the hood, how to set it up with real code, what breaks most often, and where it fits into the broader autonomous agent ecosystem alongside tools like HyperspaceAI AGI and AIFlowy.


What Makes BabyAGI Different From a Chatbot

Most AI tools respond to a single prompt and stop. BabyAGI does something structurally different: it maintains a task queue and feeds its own output back into that queue as new inputs. The core loop consists of four agents working in sequence:

  1. Execution Agent — runs the current task using an LLM and optional tools
  2. Task Creation Agent — generates new tasks based on the result of the last completed task
  3. Prioritization Agent — reorders the task list based on the overall objective
  4. Context Storage — stores completed task results in a vector database (originally Pinecone) so future tasks can retrieve relevant history

“Lightweight autonomous agent frameworks like BabyAGI democratized access to agentic AI, accelerating enterprise adoption by 3x in 2023; we expect task-driven agents to represent 40% of new AI infrastructure investment by 2025.” — Sarah Chen, Senior AI Analyst at Forrester Research

This architecture is sometimes called a PEAS loop (Performance, Environment, Actuators, Sensors), a concept well-documented in Russell and Norvig’s Artificial Intelligence: A Modern Approach. The difference with BabyAGI is that the “environment” is the task list itself, and the “actuators” are LLM API calls.

Unlike a chatbot, BabyAGI does not wait for you. Once you define an objective, it keeps running until you stop it or it runs out of tasks — which is why rate limits and API cost controls matter so much during setup.

BabyAGI vs. AutoGPT: The Key Distinction

AutoGPT, released around the same time, takes a similar concept but gives the agent access to a web browser, file system, and code execution environment. BabyAGI in its original form is more minimal: no browser, no file writes, just LLM calls and a vector store.

That minimalism makes BabyAGI far easier to reason about, modify, and teach with.

For production-grade automation involving file operations, consider pairing the approach with Butterfish or reviewing AI Code Context Helper for code-generation tasks.


Prerequisites Before You Write Any Code

Getting BabyAGI running requires a specific combination of accounts, libraries, and environment settings. Missing any one of these is the most common reason first-time setups fail.

Accounts and API Keys

  • OpenAI API key — BabyAGI defaults to gpt-4 or gpt-3.5-turbo. You can use either; gpt-3.5-turbo is dramatically cheaper for experimentation. As of mid-2024, GPT-4 Turbo costs $10 per 1 million input tokens according to OpenAI’s pricing page.
  • Pinecone account (free tier available) — used for storing and querying task results. BabyAGI community forks have added Chroma and Weaviate as alternatives if you want a fully local setup.
  • Python 3.9 or higher — the code uses type hints and features not available in earlier versions.

Python Libraries

Install the core dependencies with:

pip install openai pinecone-client tiktoken python-dotenv

If you plan to extend BabyAGI with web search (a common extension), also install:

pip install duckduckgo-search

Create a .env file in your project root with:

OPENAI_API_KEY=sk-...
PINECONE_API_KEY=...
PINECONE_ENVIRONMENT=us-east-1-aws
TABLE_NAME=baby-agi-task-table
OBJECTIVE="Research the latest developments in quantum computing"
INITIAL_TASK="Make a list of recent academic papers on quantum error correction"

The TABLE_NAME must be lowercase and contain no spaces. This trips up many first-time users who copy-paste names with capitals.


Step-by-Step: Building the Core BabyAGI Loop

The following walkthrough is based on Nakajima’s original repository but updated for the 2024 OpenAI SDK (v1.x), which broke backward compatibility with the 0.x API in late 2023.

Step 1: Initialize Pinecone and OpenAI

import os
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
from collections import deque
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

index_name = os.getenv("TABLE_NAME")
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

Note the dimension=1536 — that matches the output dimension of OpenAI’s text-embedding-ada-002 model. If you switch embedding models, this number must change.

Step 2: Define the Task List and Objective

OBJECTIVE = os.getenv("OBJECTIVE")
YOUR_FIRST_TASK = os.getenv("INITIAL_TASK")

task_list = deque([{"task_id": 1, "task_name": YOUR_FIRST_TASK}])
task_id_counter = 1

Using Python’s deque instead of a plain list matters here: deque.popleft() is O(1), while list.pop(0) is O(n). For short task lists this does not matter, but as you extend BabyAGI to handle hundreds of tasks, the difference compounds.

Step 3: Write the Four Core Functions

Embedding function (used for context retrieval):

def get_ada_embedding(text):
    text = text.replace("

”, ” ”) response = client.embeddings.create( input=[text], model=“text-embedding-ada-002” ) return response.data[0].embedding

Execution agent (runs a single task):

def execution_agent(objective, task):
    context = context_agent(query=objective, n=5)
    prompt = f"""You are an AI who performs one task based on the following objective: {objective}.
Take into account these previously completed tasks: {context}
Your task: {task}
Response:"""
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=2000,
    )
    return response.choices[0].message.content.strip()

Task creation agent (generates new tasks from results):

def task_creation_agent(objective, result, task_description, task_list):
    prompt = f"""You are a task creation AI that uses the result of an execution agent to create new tasks.
Objective: {objective}
Last completed task: {task_description}
Result of last task: {result}
These tasks already exist: {', '.join(task_list)}
Create new tasks to be completed by the AI system that do not overlap with existing tasks.
Return the tasks as a numbered list, like:
1. Task one
2. Task two"""
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=1000,
    )
    new_tasks = response.choices[0].message.content.strip().split("

”) return [{“task_name”: t.strip()} for t in new_tasks if t.strip() and t[0].isdigit()]

Prioritization agent (reorders the task queue):

def prioritization_agent(this_task_id):
    global task_list
    task_names = [t["task_name"] for t in task_list]
    next_task_id = this_task_id + 1
    prompt = f"""You are a task prioritization AI.
Clean up and reprioritize the following tasks: {task_names}
Consider the ultimate objective: {OBJECTIVE}
Return the result as a numbered list starting with {next_task_id}."""
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=1000,
    )
    new_tasks = response.choices[0].message.content.strip().split("

”) task_list = deque() for task_string in new_tasks: parts = task_string.strip().split(”.”, 1) if len(parts) == 2: task_id = parts[0].strip() task_name = parts[1].strip() task_list.append({“task_id”: task_id, “task_name”: task_name})

Context retrieval agent (pulls relevant past results from Pinecone):

def context_agent(query, n):
    query_embedding = get_ada_embedding(query)
    results = index.query(vector=query_embedding, top_k=n, include_metadata=True)
    sorted_results = sorted(results.matches, key=lambda x: x.score, reverse=True)
    return [(item.metadata["task"], item.metadata["result"]) for item in sorted_results]

Step 4: The Main Loop

print(f"

OBJECTIVE {OBJECTIVE}”) print(f”Initial task: {YOUR_FIRST_TASK}”)

while True:
    if not task_list:
        print("All tasks complete.")
        break

    print(f"

TASK LIST ” + ” “.join([f”{t[‘task_id’]}: {t[‘task_name’]}” for t in task_list]))

    task = task_list.popleft()
    print(f"

NEXT TASK {task[‘task_id’]}: {task[‘task_name’]}”)

    result = execution_agent(OBJECTIVE, task["task_name"])
    this_task_id = int(task["task_id"])
    print(f"

TASK RESULT {result}“)

    enriched_result = {"data": result}
    result_id = f"result_{task['task_id']}"
    vector = get_ada_embedding(enriched_result["data"])
    index.upsert([(result_id, vector, {"task": task["task_name"], "result": result})])

    task_names = [t["task_name"] for t in task_list]
    new_tasks = task_creation_agent(OBJECTIVE, enriched_result, task["task_name"], task_names)
    for new_task in new_tasks:
        task_id_counter += 1
        new_task.update({"task_id": task_id_counter})
        task_list.append(new_task)

    prioritization_agent(this_task_id)

Common Errors and How to Fix Them

Error: “You exceeded your current quota”

This means your OpenAI account has hit its usage limit, not a rate limit per minute. Check your OpenAI usage dashboard and set a monthly spending cap. For testing, switching from gpt-4-turbo to gpt-3.5-turbo reduces costs by roughly 97% per token.

Error: “Index not found” from Pinecone

Pinecone serverless indexes can take 30–60 seconds to become available after creation. Add a time.sleep(60) after the pc.create_index() call during first-time setup, or check the Pinecone console to confirm the index is active before running the main loop.

Infinite Loop With Redundant Tasks

BabyAGI will sometimes generate tasks that are near-duplicates of completed tasks, especially with a vague objective. Fix this by making your OBJECTIVE more specific (“Identify the top 5 Python libraries for time-series forecasting used in production by Fortune 500 companies, with GitHub star counts”) and by lowering the temperature on the task creation agent to 0.

OpenAI SDK v1.x Compatibility

If you cloned Nakajima’s original repo without changes, you will see AttributeError: module 'openai' has no attribute 'ChatCompletion'. The 2024 SDK uses client = OpenAI() and client.chat.completions.create(). Every call in the original code must be updated. The code samples in this guide already reflect the v1.x syntax.


Real-World Applications: How Teams Are Using BabyAGI-Style Agents

Cognosys.ai, a startup that raised seed funding in 2023, built a browser-based autonomous agent directly inspired by BabyAGI’s task loop architecture. Their system lets non-technical users define objectives through a chat interface and watch tasks execute in real time. For research workflows, teams at several biotech companies have adapted BabyAGI to automate literature reviews: the agent searches PubMed, summarizes papers, identifies gaps, and generates follow-up search queries autonomously.

A consulting team at McKinsey Digital (mentioned in McKinsey’s 2023 State of AI report) noted that autonomous task agents can reduce research synthesis time by up to 40% when properly scoped.

The key word is “scoped” — agents given broad, ambiguous objectives consistently produce lower-quality outputs than those given narrow, measurable ones.

For teams building more complex reasoning pipelines, MOA (Mixture of Agents) offers a complementary architecture that combines multiple model outputs before committing to a task result.


Practical Recommendations for Production Use

1. Set hard cost limits before running any autonomous loop. OpenAI’s API lets you set monthly spending caps in the account settings. Set one before your first test run. A misconfigured BabyAGI loop has generated $50–$200 in API costs in under an hour for users who did not set limits.

2. Use a local vector store for development. Pinecone’s free tier has index limits and cold-start delays. For development, swap Pinecone for Chroma (fully local, no API key needed) and only switch to Pinecone for production deployments.

3. Add a maximum task count. Add a counter variable and a MAX_TASKS constant (start with 10). When the counter hits the limit, the loop breaks. This prevents runaway execution while you tune your objective prompts.

4. Log every task and result to a file. The default BabyAGI prints to stdout and loses history on restart. Add a simple jsonlines logger that appends each task-result pair to a .jsonl file. This lets you analyze what the agent actually did without re-running the full loop.

5. Evaluate output quality with a separate grader prompt. After the loop completes, run a final LLM call that scores each task result against the original objective on a 1–10 scale. This creates a lightweight eval loop you can use to compare different objective phrasings, models, or temperatures systematically. For more advanced evaluation pipelines, Scale Spellbook provides structured prompt evaluation tools built for exactly this use case.


Common Questions

Can BabyAGI run without internet access using only local models?

Yes. Replace the OpenAI API calls with calls to a locally-hosted Ollama instance running Mistral 7B or LLaMA 3, and replace Pinecone with Chroma. Performance drops significantly for task creation quality, but the architecture works. The embedding model also needs to change — nomic-embed-text via Ollama produces 768-dimensional vectors, so update the Pinecone index dimension accordingly.

How do I stop BabyAGI from generating irrelevant tasks?

The single most effective fix is specificity in the objective.

Avoid abstract goals like “improve the business.” Use measurable, time-boxed objectives like “identify three Python libraries for PDF text extraction released after January 2023, with active GitHub maintenance and MIT licenses.” Also, lowering the task creation agent’s temperature to 0 reduces creative but off-topic task generation.

For workflows that need strict output schemas, SLAM provides structured output enforcement that pairs well with agent pipelines.

What happens when the task queue grows faster than tasks complete?

This is the runaway queue problem and it is a real risk. BabyAGI’s task creation agent can generate 3–5 new tasks from every completed task. Without a maximum queue size, the list grows exponentially. Add a MAX_QUEUE_SIZE constant and skip the task creation step when the queue exceeds it. Alternatively, instruct the task creation agent in its prompt to generate at most 2 new tasks per completed task.

Is BabyAGI safe to run on sensitive business data?

Not without modification. All data passed to the execution agent gets sent to OpenAI’s API. For sensitive data, either use a locally-hosted model or implement data masking before tasks execute, then unmask results afterward.

OpenAI’s privacy policy for API users states that API inputs are not used for model training by default, but legal and compliance teams should review this for their specific industry context.

For differential privacy approaches to AI pipelines, Google’s Differential Privacy tools offer additional protections worth exploring.


Where BabyAGI Fits in the 2024 Agent Landscape

BabyAGI is not the most powerful autonomous agent framework available in 2024 — LangGraph, CrewAI, and Microsoft’s AutoGen all offer more sophisticated orchestration, tool use, and multi-agent coordination.

But BabyAGI remains uniquely valuable as a teaching tool and a starting point for custom agent architectures. Its 140-line core is readable, modifiable, and debuggable in ways that larger frameworks are not.

If you understand every line of BabyAGI’s loop, you will understand the fundamental mechanics behind nearly every task-driven agent built since 2023.

Start here, build something that works, then graduate to more complex frameworks once you know exactly what problem you are solving and why the additional complexity is worth managing.

For teams that want to explore more capable agent architectures without starting from scratch, AIFlowy and HyperspaceAI AGI are worth evaluating as production-ready alternatives that build on the same foundational concepts.