Your LangChain agent just ran.
It called GPT-4 three times, searched a vector database, invoked two tools, made a decision — and returned a response.
But something's wrong. The output doesn't match expectations. A customer is confused. Your LLM bill spiked.
What happened?
Without tracing, you have no idea. You're debugging a black box — an AI system that makes decisions you can't see, in an order you can't predict, using reasoning you can't inspect.
This is the reality for most LangChain deployments. A single user request can trigger 15+ LLM calls across multiple chains, models, and tools. Each call branches, loops, or fails independently. And without proper tracing, you're essentially reverse-engineering your own stack every time there's an issue.
This guide shows you how to trace every step of your LangChain agent execution — what happened, in what order, how long it took, how much it cost, and why it made each decision.
Why LangChain Agents Are Hard to Debug
LangChain agents aren't like traditional software. They're non-deterministic, multi-step systems where:
- The same input produces different outputs. Run the same prompt twice and you'll get different reasoning paths, different tool calls, different results.
- Execution branches unpredictably. An agent might call a search API, then a calculator, then another LLM — or it might skip all three. You don't know until it runs.
- Costs are invisible. Each LLM call costs money. A retry loop or recursive chain can turn a $0.02 request into a $12,000 bill — and you won't know until the invoice arrives.
- Errors cascade silently. A retriever returns irrelevant documents. The LLM hallucinates based on bad context. The tool gets wrong input. The final output looks plausible but is completely wrong.
With 130 million+ downloads and 1,300+ companies using LangChain in production, these aren't edge cases — they're daily reality.
What You Need to Trace
For any LangChain agent in production, you need visibility into:
| Component | What to capture |
|---|---|
| LLM calls | Model, input messages, output, tokens used, cost, latency |
| Tool calls | Tool name, input arguments, output, duration, errors |
| Chain steps | Chain name, input/output, execution order |
| Retriever queries | Query text, documents returned, source metadata |
| Agent decisions | Why path A was chosen over path B |
| Parent-child relationships | Which LLM call belongs to which chain step |
The key insight: tracing individual components isn't enough. You need the full execution tree — the parent-child relationships that show how each step connects to the next.
Method 1: LangChain Callbacks (The Foundation)
LangChain's callback system is the hook for all tracing. Every component — LLMs, chains, tools, retrievers — fires events at key moments:
on_llm_start/on_llm_end— when an LLM call begins and finisheson_chain_start/on_chain_end— when a chain starts and completeson_tool_start/on_tool_end— when a tool is invokedon_retriever_start/on_retriever_end— when a retriever fetches documents
Each callback receives a run_id (unique to that operation) and a parent_run_id (linking it to its parent). This is what makes execution trees possible.
Basic verbose logging
The simplest approach:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4", verbose=True)
response = llm.invoke("What is prompt injection?")
This prints raw logs to stdout. Useful for local debugging, but useless in production — no structured data, no search, no alerts, no cost tracking.
Custom callback handler
You can build your own:
from langchain_core.callbacks import BaseCallbackHandler
class MyTracer(BaseCallbackHandler):
def on_llm_start(self, serialized, prompts, *, run_id, **kwargs):
print(f"LLM started: {run_id}")
def on_llm_end(self, response, *, run_id, **kwargs):
tokens = response.llm_output.get("token_usage", {})
print(f"LLM finished: {tokens}")
def on_tool_start(self, serialized, input_str, *, run_id, **kwargs):
print(f"Tool called: {serialized.get('name')}")
llm = ChatOpenAI(model="gpt-4", callbacks=[MyTracer()])
This gives you control, but you're building logging infrastructure from scratch — storage, querying, visualization, alerting, cost calculation. That's weeks of work.
Method 2: Automatic Tracing with AgentShield (2 Lines of Code)
AgentShield provides a LangChain callback handler that automatically captures everything — LLM calls, tool calls, chain steps, retrievers, agent decisions — with full parent-child span trees, token counts, cost estimation, and risk analysis.
Installation
pip install agentshield-ai[langchain]
Basic setup
from agentshield import AgentShield
from agentshield.langchain_callback import AgentShieldCallbackHandler
from langchain_openai import ChatOpenAI
# Initialize
shield = AgentShield(api_key="ask_your_key_here")
handler = AgentShieldCallbackHandler(shield, agent_name="my-agent")
# Use with any LangChain component
llm = ChatOpenAI(model="gpt-4", callbacks=[handler])
response = llm.invoke("What are the risks of AI agents in production?")
That's it. Every LLM call is now traced with model name, input/output, tokens, duration, and cost.
With chains
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_template("Explain {topic} in 3 sentences.")
chain = prompt | llm | StrOutputParser()
# The handler captures the entire chain execution
result = chain.invoke(
{"topic": "prompt injection"},
config={"callbacks": [handler]}
)
The trace shows: 1. Chain start (input: "prompt injection") 2. LLM call (model: gpt-4, tokens: 150 in / 89 out, cost: $0.008) 3. Parser execution 4. Chain end (output: the 3 sentences)
With agents and tools
from langchain.agents import create_react_agent, AgentExecutor
from langchain_community.tools import DuckDuckGoSearchRun
search = DuckDuckGoSearchRun()
tools = [search]
agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(
agent=agent,
tools=tools,
callbacks=[handler], # Traces everything automatically
)
result = executor.invoke({"input": "What happened with AI agents this week?"})
The trace captures the full agent loop: 1. Agent executor starts 2. LLM call #1 — agent "thinks" and decides to use search 3. Tool call: DuckDuckGoSearch (input: "AI agents news March 2026") 4. LLM call #2 — agent processes search results 5. Agent finishes with final answer
Each step shows duration, tokens, cost, and parent-child relationships in a visual span tree.
With RAG (Retrieval-Augmented Generation)
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
retriever = Chroma(embedding_function=embeddings).as_retriever()
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
callbacks=[handler],
)
result = qa_chain.invoke("What is our refund policy?")
The trace captures: 1. Chain start 2. Retriever query: "What is our refund policy?" 3. Documents returned (count, sources, content preview) 4. LLM call with retrieved context 5. Final answer
This is critical for debugging RAG failures — when the answer is wrong, you can see exactly which documents were retrieved and whether the LLM used them correctly.
What a Trace Looks Like
When you open the AgentShield dashboard, you see a span tree — a visual breakdown of every step your agent took:
[trace] my-agent 2.3s total
├── [function] AgentExecutor 2.3s
│ ├── [llm_call] gpt-4 0.8s (150 in / 89 out) $0.008
│ ├── [tool_call] DuckDuckGoSearch 0.6s
│ ├── [llm_call] gpt-4 0.7s (340 in / 120 out) $0.015
│ └── [llm_call] gpt-4 0.2s (95 in / 45 out) $0.004
└── Total: 3 LLM calls, $0.027, 2.3s
Each span shows: - Type (LLM call, tool call, retrieval, chain step) - Duration (how long it took) - Tokens (input/output count) - Cost (estimated based on model pricing) - Input/Output (expandable to see full content) - Risk level (if the output triggered any alerts)
Tracing Best Practices for Production
1. Name your agents
handler = AgentShieldCallbackHandler(
shield,
agent_name="support-bot-v2", # Not just "my-agent"
)
When you have 5 agents in production, you need to know which one caused the alert.
2. Use session IDs to group conversations
handler = AgentShieldCallbackHandler(
shield,
agent_name="support-bot",
session_id=f"user-{user_id}-{conversation_id}",
)
This groups all traces from a single user conversation, so you can replay the entire interaction.
3. Add metadata for filtering
handler = AgentShieldCallbackHandler(
shield,
agent_name="support-bot",
trace_metadata={
"environment": "production",
"version": "2.1.0",
"user_tier": "enterprise",
},
)
Filter traces by environment, version, or customer tier in the dashboard.
4. One handler per execution
Create a new handler for each request to keep traces isolated:
@app.post("/chat")
async def chat(request: ChatRequest):
handler = AgentShieldCallbackHandler(
shield,
agent_name="support-bot",
session_id=request.session_id,
)
result = chain.invoke(
{"input": request.message},
config={"callbacks": [handler]}
)
return {"response": result}
5. Monitor costs across models
Different models have different costs. Tracing lets you see exactly which model was used for each call:
gpt-4: 3 calls → $0.027 (67% of cost)
gpt-3.5: 12 calls → $0.004 (10% of cost)
claude-3: 2 calls → $0.009 (23% of cost)
This data helps you optimize — maybe the GPT-4 call for formatting can be replaced with GPT-3.5.
Common Issues Tracing Helps You Catch
1. Infinite loops
A recursive chain with no exit condition. Tracing shows the repeating pattern immediately — 50 identical LLM calls in a row, each costing $0.03.
2. Retriever returning irrelevant documents
The LLM generates a confident but wrong answer because the retriever pulled documents about a different topic. Tracing shows the query, the documents, and the gap.
3. Token explosion
A single prompt includes the entire conversation history (8,000 tokens) when it only needed the last message. Tracing shows token counts per call so you can spot inefficiencies.
4. Tool failures
A tool throws an error, the agent retries 5 times, each retry costs tokens. Tracing shows the error, the retries, and the total cost of the failure.
5. Model mismatch
You thought you were using GPT-3.5 for the summarization step, but it's actually hitting GPT-4. Tracing shows the exact model for every call.
Getting Started
- Sign up at useagentshield.com (free tier, no credit card)
- Install the SDK:
pip install agentshield-ai[langchain] - Add 2 lines to your existing agent:
from agentshield import AgentShield
from agentshield.langchain_callback import AgentShieldCallbackHandler
shield = AgentShield(api_key="ask_your_key_here")
handler = AgentShieldCallbackHandler(shield, agent_name="my-agent")
# Pass handler to any LangChain component
llm = ChatOpenAI(model="gpt-4", callbacks=[handler])
Every LLM call, tool use, retrieval, and chain step — traced automatically. Fail-silent. Never breaks your agent.
Building with LangChain? Follow AgentShield on Twitter/X for more guides on AI agent observability.
Start monitoring your AI agents
3 lines of code. Real-time risk analysis. Automatic tracing for LangChain and CrewAI.