How to Trace LangChain Agent Execution: A Complete Guide — AI Agent Observability & Governance

Your LangChain agent just ran.

It called GPT-4 three times, searched a vector database, invoked two tools, made a decision — and returned a response.

But something's wrong. The output doesn't match expectations. A customer is confused. Your LLM bill spiked.

What happened?

Without tracing, you have no idea. You're debugging a black box — an AI system that makes decisions you can't see, in an order you can't predict, using reasoning you can't inspect.

This is the reality for most LangChain deployments. A single user request can trigger 15+ LLM calls across multiple chains, models, and tools. Each call branches, loops, or fails independently. And without proper tracing, you're essentially reverse-engineering your own stack every time there's an issue.

This guide shows you how to trace every step of your LangChain agent execution — what happened, in what order, how long it took, how much it cost, and why it made each decision.

Why LangChain Agents Are Hard to Debug

LangChain agents aren't like traditional software. They're non-deterministic, multi-step systems where:

The same input produces different outputs. Run the same prompt twice and you'll get different reasoning paths, different tool calls, different results.
Execution branches unpredictably. An agent might call a search API, then a calculator, then another LLM — or it might skip all three. You don't know until it runs.
Costs are invisible. Each LLM call costs money. A retry loop or recursive chain can turn a $0.02 request into a $12,000 bill — and you won't know until the invoice arrives.
Errors cascade silently. A retriever returns irrelevant documents. The LLM hallucinates based on bad context. The tool gets wrong input. The final output looks plausible but is completely wrong.

With 130 million+ downloads and 1,300+ companies using LangChain in production, these aren't edge cases — they're daily reality.

What You Need to Trace

For any LangChain agent in production, you need visibility into:

Component	What to capture
LLM calls	Model, input messages, output, tokens used, cost, latency
Tool calls	Tool name, input arguments, output, duration, errors
Chain steps	Chain name, input/output, execution order
Retriever queries	Query text, documents returned, source metadata
Agent decisions	Why path A was chosen over path B
Parent-child relationships	Which LLM call belongs to which chain step

The key insight: tracing individual components isn't enough. You need the full execution tree — the parent-child relationships that show how each step connects to the next.

Method 1: LangChain Callbacks (The Foundation)

LangChain's callback system is the hook for all tracing. Every component — LLMs, chains, tools, retrievers — fires events at key moments:

on_llm_start / on_llm_end — when an LLM call begins and finishes
on_chain_start / on_chain_end — when a chain starts and completes
on_tool_start / on_tool_end — when a tool is invoked
on_retriever_start / on_retriever_end — when a retriever fetches documents

Each callback receives a run_id (unique to that operation) and a parent_run_id (linking it to its parent). This is what makes execution trees possible.

Basic verbose logging

The simplest approach:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4", verbose=True)
response = llm.invoke("What is prompt injection?")

This prints raw logs to stdout. Useful for local debugging, but useless in production — no structured data, no search, no alerts, no cost tracking.

Custom callback handler

You can build your own:

from langchain_core.callbacks import BaseCallbackHandler

class MyTracer(BaseCallbackHandler):
    def on_llm_start(self, serialized, prompts, *, run_id, **kwargs):
        print(f"LLM started: {run_id}")

    def on_llm_end(self, response, *, run_id, **kwargs):
        tokens = response.llm_output.get("token_usage", {})
        print(f"LLM finished: {tokens}")

    def on_tool_start(self, serialized, input_str, *, run_id, **kwargs):
        print(f"Tool called: {serialized.get('name')}")

llm = ChatOpenAI(model="gpt-4", callbacks=[MyTracer()])

This gives you control, but you're building logging infrastructure from scratch — storage, querying, visualization, alerting, cost calculation. That's weeks of work.

Method 2: Automatic Tracing with AgentShield (2 Lines of Code)

AgentShield provides a LangChain callback handler that automatically captures everything — LLM calls, tool calls, chain steps, retrievers, agent decisions — with full parent-child span trees, token counts, cost estimation, and risk analysis.

Installation

pip install agentshield-ai[langchain]

Basic setup

from agentshield import AgentShield
from agentshield.langchain_callback import AgentShieldCallbackHandler
from langchain_openai import ChatOpenAI

# Initialize
shield = AgentShield(api_key="ask_your_key_here")
handler = AgentShieldCallbackHandler(shield, agent_name="my-agent")

# Use with any LangChain component
llm = ChatOpenAI(model="gpt-4", callbacks=[handler])
response = llm.invoke("What are the risks of AI agents in production?")

That's it. Every LLM call is now traced with model name, input/output, tokens, duration, and cost.

With chains

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_template("Explain {topic} in 3 sentences.")
chain = prompt | llm | StrOutputParser()

# The handler captures the entire chain execution
result = chain.invoke(
    {"topic": "prompt injection"},
    config={"callbacks": [handler]}
)

The trace shows: 1. Chain start (input: "prompt injection") 2. LLM call (model: gpt-4, tokens: 150 in / 89 out, cost: $0.008) 3. Parser execution 4. Chain end (output: the 3 sentences)

With agents and tools

from langchain.agents import create_react_agent, AgentExecutor
from langchain_community.tools import DuckDuckGoSearchRun

search = DuckDuckGoSearchRun()
tools = [search]

agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(
    agent=agent,
    tools=tools,
    callbacks=[handler],  # Traces everything automatically
)

result = executor.invoke({"input": "What happened with AI agents this week?"})

The trace captures the full agent loop: 1. Agent executor starts 2. LLM call #1 — agent "thinks" and decides to use search 3. Tool call: DuckDuckGoSearch (input: "AI agents news March 2026") 4. LLM call #2 — agent processes search results 5. Agent finishes with final answer

Each step shows duration, tokens, cost, and parent-child relationships in a visual span tree.

With RAG (Retrieval-Augmented Generation)

from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

retriever = Chroma(embedding_function=embeddings).as_retriever()

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    callbacks=[handler],
)

result = qa_chain.invoke("What is our refund policy?")

The trace captures: 1. Chain start 2. Retriever query: "What is our refund policy?" 3. Documents returned (count, sources, content preview) 4. LLM call with retrieved context 5. Final answer

This is critical for debugging RAG failures — when the answer is wrong, you can see exactly which documents were retrieved and whether the LLM used them correctly.

What a Trace Looks Like

When you open the AgentShield dashboard, you see a span tree — a visual breakdown of every step your agent took:

[trace] my-agent                          2.3s total
├── [function] AgentExecutor              2.3s
│   ├── [llm_call] gpt-4                  0.8s  (150 in / 89 out)   $0.008
│   ├── [tool_call] DuckDuckGoSearch      0.6s
│   ├── [llm_call] gpt-4                  0.7s  (340 in / 120 out)  $0.015
│   └── [llm_call] gpt-4                  0.2s  (95 in / 45 out)    $0.004
└── Total: 3 LLM calls, $0.027, 2.3s

Each span shows: - Type (LLM call, tool call, retrieval, chain step) - Duration (how long it took) - Tokens (input/output count) - Cost (estimated based on model pricing) - Input/Output (expandable to see full content) - Risk level (if the output triggered any alerts)

Tracing Best Practices for Production

1. Name your agents

handler = AgentShieldCallbackHandler(
    shield,
    agent_name="support-bot-v2",  # Not just "my-agent"
)

When you have 5 agents in production, you need to know which one caused the alert.

2. Use session IDs to group conversations

handler = AgentShieldCallbackHandler(
    shield,
    agent_name="support-bot",
    session_id=f"user-{user_id}-{conversation_id}",
)

This groups all traces from a single user conversation, so you can replay the entire interaction.

3. Add metadata for filtering

handler = AgentShieldCallbackHandler(
    shield,
    agent_name="support-bot",
    trace_metadata={
        "environment": "production",
        "version": "2.1.0",
        "user_tier": "enterprise",
    },
)

Filter traces by environment, version, or customer tier in the dashboard.

4. One handler per execution

Create a new handler for each request to keep traces isolated:

@app.post("/chat")
async def chat(request: ChatRequest):
    handler = AgentShieldCallbackHandler(
        shield,
        agent_name="support-bot",
        session_id=request.session_id,
    )
    result = chain.invoke(
        {"input": request.message},
        config={"callbacks": [handler]}
    )
    return {"response": result}

5. Monitor costs across models

Different models have different costs. Tracing lets you see exactly which model was used for each call:

gpt-4:      3 calls → $0.027  (67% of cost)
gpt-3.5:    12 calls → $0.004 (10% of cost)
claude-3:   2 calls → $0.009  (23% of cost)

This data helps you optimize — maybe the GPT-4 call for formatting can be replaced with GPT-3.5.

Common Issues Tracing Helps You Catch

1. Infinite loops

A recursive chain with no exit condition. Tracing shows the repeating pattern immediately — 50 identical LLM calls in a row, each costing $0.03.

2. Retriever returning irrelevant documents

The LLM generates a confident but wrong answer because the retriever pulled documents about a different topic. Tracing shows the query, the documents, and the gap.

3. Token explosion

A single prompt includes the entire conversation history (8,000 tokens) when it only needed the last message. Tracing shows token counts per call so you can spot inefficiencies.

4. Tool failures

A tool throws an error, the agent retries 5 times, each retry costs tokens. Tracing shows the error, the retries, and the total cost of the failure.

5. Model mismatch

You thought you were using GPT-3.5 for the summarization step, but it's actually hitting GPT-4. Tracing shows the exact model for every call.

Getting Started

Sign up at useagentshield.com (free tier, no credit card)
Install the SDK: pip install agentshield-ai[langchain]
Add 2 lines to your existing agent:

from agentshield import AgentShield
from agentshield.langchain_callback import AgentShieldCallbackHandler

shield = AgentShield(api_key="ask_your_key_here")
handler = AgentShieldCallbackHandler(shield, agent_name="my-agent")

# Pass handler to any LangChain component
llm = ChatOpenAI(model="gpt-4", callbacks=[handler])

Every LLM call, tool use, retrieval, and chain step — traced automatically. Fail-silent. Never breaks your agent.

Building with LangChain? Follow AgentShield on Twitter/X for more guides on AI agent observability.

Start monitoring your AI agents

3 lines of code. Real-time risk analysis. Automatic tracing for LangChain and CrewAI.

Get Started Free Read the Docs

← Back to all posts