Debugging and Tracing LLMs Like a Pro

Image by Author | Canva

# Introduction

Traditional debugging with print() or logging works, but it’s slow and clunky with LLMs. Phoenix provides a timeline view of every step, prompt, and response inspection, error detection with retries, visibility into latency and costs, and a complete visual understanding of your app. Phoenix by Arize AI is a powerful open-source observability and tracing tool specifically designed for LLM applications. It helps you monitor, debug, and trace everything happening in your LLM pipelines visually. In this article, we’ll walk through what Phoenix does and why it matters, how to integrate Phoenix with LangChain step by step, and how to visualize traces in the Phoenix UI.

# What is Phoenix?

Phoenix is an open-source observability and debugging tool made for large language model applications. It captures detailed telemetry data from your LLM workflows, including prompts, responses, latency, errors, and tool usage, and presents this information in an intuitive, interactive dashboard. Phoenix allows developers to deeply understand how their LLM pipelines behave inside the system, identify and debug issues with prompt outputs, analyze performance bottlenecks, monitor using tokens and associated costs, and trace any errors/retry logic during execution phase. It supports consistent integrations with popular frameworks like LangChain and LlamaIndex, and also offers OpenTelemetry support for more customized setups.

# Step-by-Step Setup

// 1. Installing Required Libraries

Make sure you have Python 3.8+ and install the dependencies:

pip install arize-phoenix langchain langchain-together openinference-instrumentation-langchain langchain-community

// 2. Launching Phoenix

Add this line to launch the Phoenix dashboard:

import phoenix as px
px.launch_app()

This starts a local dashboard at http://localhost:6006.

// 3. Building the LangChain Pipeline with Phoenix Callback

Let’s understand Phoenix using a use case. We are building a simple LangChain-powered chatbot. Now, we want to:

Debug if the prompt is working
Monitor how long the model takes to respond
Track prompt structure, model usage, and outputs
See all this visually instead of logging everything manually

// Step 1: Launch the Phoenix Dashboard in the Background

import threading
import phoenix as px

# Launch Phoenix app locally (access at http://localhost:6006)
def run_phoenix():
    px.launch_app()

threading.Thread(target=run_phoenix, daemon=True).start()

// Step 2: Register Phoenix with OpenTelemetry & Instrument LangChain

from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

# Register OpenTelemetry tracer
tracer_provider = register()

# Instrument LangChain with Phoenix
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

// Step 3: Initialize the LLM (Together API)

from langchain_together import Together

llm = Together(
    model="meta-llama/Llama-3-8b-chat-hf",
    temperature=0.7,
    max_tokens=256,
    together_api_key="your-api-key",  # Replace with your actual API key
)

Please don’t forget to replace the “your-api-key” with your actual together.ai API key. You can get it using this link.

// Step 4: Define the Prompt Template

from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "{question}"),
])

// Step 5: Combine Prompt and Model into a Chain

// Step 6: Ask Multiple Questions and Print Responses

questions = [
    "What is the capital of France?",
    "Who discovered gravity?",
    "Give me a motivational quote about perseverance.",
    "Explain photosynthesis in one sentence.",
    "What is the speed of light?",
]

print("Phoenix running at http://localhost:6006n")

for q in questions:
    print(f" Question: {q}")
    response = chain.invoke({"question": q})
    print(" Answer:", response, "n")

// Step 7: Keep the App Alive for Monitoring

try:
    while True:
        pass
except KeyboardInterrupt:
    print(" Exiting.")

# Understanding Phoenix Traces & Metrics

Before seeing the output, we should first understand Phoenix metrics. You will need to first understand what traces and spans are:
Trace: Each trace represents one full run of your LLM pipeline. For example, each question like “What is the capital of France?” generates a new trace.
Spans: Each trace is mixed of multiple spans, each representing a stage in your chain:

ChatPromptTemplate.format: Prompt formatting
TogetherLLM.invoke: LLM call
Any custom components you add

Metrics Shown per Trace

Metric	Meaning & Importance
Latency (ms)	Measures total time for full LLM chain execution, including prompt formatting, LLM response, and post-processing. Helps identify performance bottlenecks and debug slow responses.
Input Tokens	Number of tokens sent to the model. Important for monitoring input size and controlling API costs, since most usage is token-based.
Output Tokens	Number of tokens generated by the model. Useful for understanding verbosity, response quality, and cost impact.
Prompt Template	Displays the full prompt with inserted variables. Helps confirm whether prompts are structured and filled in correctly.
Input / Output Text	Shows both user input and the model’s response. Useful for checking interaction quality and spotting hallucinations or incorrect answers.
Span Durations	Breaks down the time taken by each step (like prompt creation or model invocation). Helps identify performance bottlenecks within the chain.
Chain Name	Specifies which part of the pipeline a span belongs to (e.g., `prompt.format`, `TogetherLLM.invoke`). Helps isolate where issues are occurring.
Tags / Metadata	Extra information like model name, temperature, etc. Useful for filtering runs, comparing results, and analyzing parameter impact.

Now visit http://localhost:6006 to view the Phoenix dashboard. You will see something like:

Open the first trace to view its details.

Phoenix first trace

# Wrapping Up

To wrap it up, Arize Phoenix makes it incredibly easy to debug, trace, and monitor your LLM applications. You don’t have to guess what went wrong or dig through logs. Everything’s right there: prompts, responses, timings, and more. It helps you spot issues, understand performance, and just build better AI experiences with way less stress.

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.