: You have built a complex LLM application that responds to user queries about a specific domain. You have spent days setting up the complete pipeline, from refining your prompts to adding context retrieval, chains, tools and finally presenting the output. However, after deployment, you realize that the application’s response seems to be missing the mark e.g., either you are not satisfied with its responses or it’s taking an exorbitant amount of time to respond. Whether the problem is rooted in your prompts, your retrieval, API calls, or somewhere else, monitoring and observability can help you sort it out.
In this tutorial, we will start by learning the basics of LLM monitoring and observability. Then, we will explore the open-source ecosystem, culminating our discussion on Langfuse. Finally, we will implement monitoring and observability of a Python based LLM application using Langfuse.
What is Monitoring and Observability?
Monitoring and observability are crucial concepts in maintaining the health of any IT system. While the terms ‘monitoring’ and ‘observability’ are often clipped together, they represent slightly different concepts.
According to IBM’s definition, monitoring is the process of collecting and analyzing system data to track performance over time. It relies on predefined metrics to detect anomalies or potential failures. Common examples include tracking system’s CPU and memory usage and alerting when certain thresholds are breached.
Observability provides a deeper understanding of the system’s internal state based on external outputs. It allows you to diagnose and understand why something is happening, not just that something is wrong. For example, observability allows you to trace inputs and outputs through various parts of the system to spot where a bottleneck is occurring.
The above definitions are also valid in the realm of LLM applications. It’s through monitoring and observability that we can trace the internal states of an LLM application, such as how user query is processed through various modules (e.g., retrieval, generation) and what are associated latencies and costs.
Here are some key terms used in the monitoring and observability:
Telemetry: Telemetry is a broad term which encompasses collecting data from your application while it’s running and processing it to understand the behavior of the application.
Instrumentation: Instrumentation is the process of adding code to your application to collect telemetry data. For LLM applications, this means adding hooks at various key points to capture internal states, such as API calls to the LLM or the retriever’s outputs.
Trace: Trace, a direct consequence of instrumentation, highlights the detailed execution journey of a request through the entire application. This encompasses input/output at each key point and the corresponding time taken at each point. Each trace is made up of a sequence of spans.
Observation: Each trace is made up of one or more observations, which can be of type Span, Event or Generation.
Span: Span is a unit of work or operation, which explains the process being carried out on each key point.
Generation: Generation is a special kind of span which tracks the input request sent to the LLM model and its output response.
Logs: Logs are time stamped records of events and interactions within the LLM application.
Metrics: Metrics are numerical measurements that provide aggregate insights into the LLM’s behavior and performance such as hallucinations or answer relevancy.

Why is LLM Monitoring and Observability Necessary?
As LLM applications are becoming increasingly complex, LLM monitoring and observability can play a crucial role in optimizing the application performance. Here are some reasons why it is important:
Reliability: LLM applications are critical to organizations; performance degradation can directly impact their businesses. Monitoring ensures that the application is performing within the acceptable limits in terms of quality, latency and uptime etc.
Debugging: A complex LLM application can be unpredictable; it can produce erroneous responses or encounter errors. Monitoring and Observability can help identify problems in the application by sifting through the complete lifecycle of each request and pinpointing the root cause.
User Experience: Monitoring user experience and feedback is vital for LLM applications which directly interact with the customer base. This allows organizations to enhance user experience by tracking the user conversations and making informed decisions. Most importantly, it allows collection of users’ feedback to improve the model and downstream processes.
Bias and Fairness: LLMs are trained on publicly available data and therefore sometimes internalize the possible bias in the available data. This might cause them to produce offensive or harmful information. Observability can help in mitigating such responses through proper corrective measures.
Cost Management: Monitoring can help you track and optimize costs incurred during the regular operations, such as LLM’s API costs per token. You can also set up alerts in case of over usage.
Tools for Monitoring and Observability
There are many amazing tools and libraries available for enabling monitoring and observability of LLM applications. Plenty of these tools are open source, offering free self-hosting solutions on local infrastructure as well as enterprise level deployment on their respective cloud servers. Each of these tools offers common features such as tracing, token count, latencies, total requests, and time-based filtering etc. Apart from this, each solution has its own set of distinct features and strengths.
Here, we are going to name only a few open-source tools which offer free self-hosting solutions.
Langfuse: A popular open source LLM monitoring tool, which is both model and framework agnostic. It offers a wide range of monitoring options using Client SDKs purpose built for Python and JavaScript/TypeScript.
Arize Phoenix: Another popular tool which offers both self-hosting and Phoenix Cloud deployment. Phoenix offers SDKs for Python and JavaScript/TypeScript.
AgentOps: AgentOps is a well-known solution which tracks LLM outputs, retrievers, allows benchmarking, and ensures compliance. It offers integration with several LLM providers.
Grafana: A classic and widely used monitoring tool which can be combined with OpenTelemetry to provide detailed LLM tracing and monitoring.
Weave: Weights & Biases’ Weave is another LLM tracking and experimentation tool for LLM based applications, which offers both self-managed and dedicated cloud environments. The Client SDKs are available in Python and TypeScript.
Introducing Langfuse
Note: Langfuse should not be confused with LangSmith, which is a proprietary Monitoring and Observability tool, developed and maintained by the LangChain community. You can learn more about the differences here.
Langfuse offers a wide variety of features such as LLM observability, tracing, LLM token and cost monitoring, prompt management, datasets and LLM security. Additionally, Langfuse offers evaluation of LLM responses using various techniques such as LLM-as-a-Judge and user’s feedback. Moreover, Langfuse offers LLM playground to its premium users, which allows you to tweak your LLM prompts and parameters on the spot and watch how LLM responds to those changes. We will discuss more details later on in our tutorial.
Langfuse’s solution to LLM monitoring and observability consists of two parts:
- Langfuse SDKs
- Langfuse Server
The Langfuse SDKs are the coding side of Langfuse, available for various platforms, which allow you to enable instrumentation in your application’s code. They are nothing more than a few lines of code which can be used appropriately in your application’s codebase.
The Langfuse server, on the other hand, is the UI based dashboard, along with other underlying services, which can be used to log, view and persist all the traces and metrics. The Langfuse’s dashboard is usually accessible through any modern web browser.
Before setting up the dashboard, it’s important to note that Langfuse offers three different ways of hosting dashboards, which are:
- Self-hosting (local)
- Managed hosting (using Langfuse’s cloud infrastructure)
- On-premises deployment
The managed and on-premises deployment are beyond the scope of this tutorial. You can visit Langfuse’s official documentation to get all the relevant information.
A self-hosting solution, as the name implies, enables you to simply run an instance of Langfuse on your own machine (e.g., PC, laptop, virtual machine or web service). However, there is a catch in this simplicity. The Langfuse server requires a persistent Postgres database server to continuously maintain its states and data. This means that along with a Langfuse server, we also need to set up a Postgres server. But don’t worry, we have got things under control. You can either use a Postgres server hosted on any cloud service (such as Azure, AWS), or you can easily self-host it, just like Langfuse service. Capiche?
How is Langfuse’s self-hosting accomplished? Langfuse offers several ways to do that, such as using docker/docker-compose or Kubernetes and/or deploying on cloud servers. For the time being, let’s stick to leveraging docker commands.
Setting Up a Langfuse Server
Now, it’s time to get hands-on experience with setting up a Langfuse dashboard for an LLM application and logging traces and metrics onto it. When we say Langfuse server, we mean the Langfuse’s dashboard and other services which allow the traces to be logged, viewed and persisted. This requires a fundamental understanding of docker and its associated concepts. You can go through this tutorial, if you are not already familiar with docker.
Using docker-compose
The most convenient and the fastest way to set up Langfuse on your own machine is to use a docker-compose file. This is just a two-step process, which involves cloning Langfuse on your local machine and simply invoking docker-compose.
Step 1: Clone the Langfuse’s repository:
$ git clone https://github.com/langfuse/langfuse.git
$ cd langfuse
Step 2: Start all services
$ docker compose up
And that’s it! Go to your web browser and open http://localhost:3000 to witness Langfuse UI working. Also cherish the fact that docker-compose takes care of the Postgres server automatically.
From this point, we can safely move on to the section of setting up Python SDK and enabling instrumentation in our code.
Using docker
The docker setup of the Langfuse server is like a docker-compose implementation, with an obvious difference: we will set up both the containers (Langfuse and Postgres) separately and will connect them using an internal network. This might be helpful in scenarios where docker-compose is not the suitable first choice, maybe because you already have your Postgres server running, or you want to run both services separately for more control, such as hosting both services separately on Azure Web App Services due to resource limitations.
Step 1: Create a custom network
First, we need to set up a custom bridge network, which will allow both the containers to communicate with each other privately.
$ docker network create langfuse-network
This command creates a network by the name langfuse-network
. Feel free to change it according to your preferences.
Step 2: Set up a Postgres service
We will start by running the Postgres container, since Langfuse service depends on this, using the following command:
$ docker run -d
--name postgres-db
--restart always
-p 5432:5432
--network langfuse-network
-v database_data:/var/lib/postgresql/data
-e POSTGRES_USER=postgres
-e POSTGRES_PASSWORD=postgres
-e POSTGRES_DB=postgres
postgres:latest
Explanation:
This command will run a docker image of postgres:latest
as a container with the name postgres-db
, on a network named langfuse-network
and expose this service to port 5432
on your local machine. For persistence, (i.e. to keep data intact for future use) it will create a volume and connect it to a folder named database_data
on your local machine. Furthermore, it will set up and assign values to three crucial environment variables of a Postgres server’s superuser: POSTGRES_USER
, POSTGRES_PASSWORD
and POSTGRES_DB
.
Step 3: Set up the Langfuse service
$ docker run –d
--name langfuse-server
--network langfuse-network
-p 3000:3000
-e DATABASE_URL=postgresql://postgres:postgres@postgres-db:5432/postgres
-e NEXTAUTH_SECRET=mysecret
-e SALT=mysalt
-e ENCRYPTION_KEY=0000000000000000000000000000000000000000000000000000000000000000
-e NEXTAUTH_URL=http://localhost:3000
langfuse/langfuse:2
Explanation:
Likewise, this command will run a docker image of langfuse/langfuse:2
in the detached mode (-d
), as a container with the name langfuse-server
, on the same network called langfuse-network
and expose this service to port 3000
. It will also assign values to mandatory environment variables. The NEXTAUTH_URL
must point to the URL where the langfuse-server
would be deployed.
ENCRYPTION_KEY
must be 256 bits, 64 string characters in hex format. You can generate this in Linux via:
$ openssl rand -hex 32
The DATABASE_URL
is an environment variable which defines the complete database path and credentials. The general format for Postgres URL is:
postgresql://[POSTGRES_USER[:POSTGRES_PASSWORD]@][host[:port]/[POSTGRES_DB]
Here, the host
is the host name (i.e. container name) of our PostgreSQL server or the IP address.
Finally, go to your web browser and open http://localhost:3000 to verify that the Langfuse server is available.
Configuring Langfuse Dashboard
Once you have successfully set up the Langfuse server, it’s time to configure the Langfuse dashboard before you can start tracing application data.
Go to the http://localhost:3000 on your web browser, as explained in the previous section. You must create a new organization, members and a project under which you would be tracing and logging all your metrics. Follow through the process on the dashboard that takes you through all the steps.
For example, here we have set up an organization by the name of datamonitor
, added a member by the name data-user1
with “Owner” role, and a project named data-demo
. This will lead us to the following screen:

This screen displays both public and private API keys, which will be used while setting up tracing using SDKs; keep them saved for future use. And with this step, we are finally done with configuring the langfuse server. The only other task left is to start the instrumentation process on the code side of our application.
Enabling Langfuse Tracing using SDKs
Langfuse offers a straightforward way to enable tracing of LLM applications with minimal lines of code. As mentioned earlier, Langfuse offers tracing solutions for various languages, frameworks and LLM models, such as Langchain, LlamaIndex, OpenAI and others. You can even enable Langfuse tracing in serverless functions such as AWS Lambda.
But before we trace our application, let’s actually create a sample application using OpenAI’s framework. We will create a very simple chat completion application using OpenAI’s gpt-4o-mini
for demonstration purposes only.
First, install the required packages:
$ pip install openai
import os
import openai
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('OPENAI_KEY','')
client = openai.OpenAI(api_key=api_key)
country = 'Pakistan'
query = f"Name the capital of {country} in one phrase only"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": query}],
max_tokens=100,
)
print(response.choices[0].message.content)
Output:
Islamabad.
Let’s now enable langfuse tracing in the given code. You have to make minor adjustments to the code, beginning with installing the langfuse
package.
Install all the required packages once again:
$ pip install langfuse openai --upgrade
The code with langfuse
enabled looks like this:
import os
#import openai
from langfuse.openai import openai
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('OPENAI_KEY','')
client = openai.OpenAI(api_key=api_key)
LANGFUSE_SECRET_KEY="sk-lf-..."
LANGFUSE_PUBLIC_KEY="pk-lf-..."
LANGFUSE_HOST="http://localhost:3000"
os.environ['LANGFUSE_SECRET_KEY'] = LANGFUSE_SECRET_KEY
os.environ['LANGFUSE_PUBLIC_KEY'] = LANGFUSE_PUBLIC_KEY
os.environ['LANGFUSE_HOST'] = LANGFUSE_HOST
country = 'Pakistan'
query = f"Name the capital of {country} in one phrase only"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": query}],
max_tokens=100,
)
print(response.choices[0].message.content)
You see, we have just replaced import openai
with from langfuse.openai import openai
to enable tracing.
If you now go to your Langfuse dashboard, you will observe traces of the OpenAI application.
A Complete End-to-End Example
Now let’s dive into enabling monitoring and observability on a complete LLM application. We will implement a RAG pipeline, which fetches relevant context from the vector database. We are going to use ChromaDB as a vector database.
We will use the Langchain framework to build our RAG based application (refer to ‘basic LLM-RAG application’ figure above). You can learn Langchain by pursuing this tutorial on how to build LLM applications with Langchain.
If you want to learn the basics of RAG, this tutorial can be a good starting point. As for the vector database, refer to this tutorial on setting up ChromaDB.
This section assumes that you have already set up and configured the Langfuse server on the localhost, as done in the previous section.
Step 1: Installation and Setup
Install all required packages including langchain
, chromadb
and langfuse
.
pip install -U langchain-community langchain-openai chromadb langfuse
Next, we import all the required packages and libraries:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langfuse.callback import CallbackHandler
from dotenv import load_dotenv
The load_dotenv
package is used to load all environment variables, which are saved in a .env file. Make sure that your OpenAI’s secret key is saved as OPENAI_API_KEY
in the .env
file.
Finally, we integrate Langfuse’s Langchain callback system to enable tracing in our application.
langfuse_handler = CallbackHandler(
secret_key="sk-lf-...",
public_key="pk-lf-...",
host="http://localhost:3000"
)
Step 2: Set up Knowledge Base
To mimic a RAG system, we will:
- Scrape some insightful articles from the Confiz’ blogs section using
WebBaseLoader
- Break them into smaller chunks using
RecursiveCharacterTextSplitter
- Convert them into vector embeddings using OpenAI’s embeddings
- Ingest them into our Chroma vector database. This will serve as the knowledge base for our LLM to look for and answer user queries.
urls = [
"https://www.confiz.com/blog/a-cios-guide-6-essential-insights-for-a-successful-generative-ai-launch/",
"https://www.confiz.com/blog/ai-at-work-how-microsoft-365-copilot-chat-is-driving-transformation-at-scale/",
"https://www.confiz.com/blog/setting-up-an-in-house-llm-platform-best-practices-for-optimal-performance/",
]
loader = WebBaseLoader(urls)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=20,
length_function=len,
)
chunks = text_splitter.split_documents(docs)
# Create the vector store
vectordb = Chroma.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
persist_directory="chroma_db",
collection_name="confiz_blog"
)
retriever = vectordb.as_retriever(search_type="similarity",search_kwargs={"k": 3})
We have assumed a chunk size of 500 tokens with an overlap of 20 tokens in Recursive Text Splitter, which considers various factors before chunking on the given size. The vectordb
object of ChromaDB is converted into a retriever object, allowing us to use it conveniently in the Langchain retrieval pipeline.
Step 3: Set up RAG pipeline
The next step is to set up the RAG chain, using the power of LLM along with the knowledge base of the vector database to answer user queries. As previously, we will use OpenAI’s gpt-4o-mini
as our base model.
model = ChatOpenAI(
model_name="gpt-4o-mini",
)
template = """
You are an AI assistant providing helpful information based on the given context.
Answer the question using only the provided context."
Context:
{context}
Question:
{question}
Answer:
"""
prompt = PromptTemplate(
template=template,
input_variables=["context", "question"]
)
qa_chain = RetrievalQA.from_chain_type(
llm=model,
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
)
We have used RetrievalQA
that implements end-to-end pipeline comprising document retrieval and LLM’s question answering capability.
Step 4: Run RAG pipeline
It’s time to run our RAG pipeline. Let’s concoct a few queries related to the articles ingested in the ChromaDB and observe LLM’s response in the Langfuse dashboard
queries = [
"What are the ways to deal with compliance and security issues in generative AI?",
"What are the key considerations for a successful generative AI launch?",
"What are the key benefits of Microsoft 365 Copilot Chat?",
"What are the best practices for setting up an in-house LLM platform?",
]
for query in queries:
response = qa_chain.invoke({"query": query}, config={"callbacks": [langfuse_handler]})
print(response)
print('-'*60)
As you might have noticed, the callbacks
argument in the qa_chain
is what gives Langfuse the ability to capture traces of the complete RAG pipeline. Langfuse supports various frameworks and LLM libraries which can be found here.
Step 5: Observing the traces
Finally, it’s time to open Langfuse Dashboard running in the web browser and reap the fruits of our hard work. If you have followed our tutorial from the beginning, we created a project named data-demo
under the organization named datamonitor
. On the landing page of your Langfuse dashboard, you will find this project. Click on ‘Go to project’ and you will find a dashboard with various panels such as traces and model costs etc.

As visible, you can adjust the time window and add filters according to your needs. The cool part is that you don’t need to manually add LLM’s description and input/output token costs to enable cost tracking; Langfuse automatically does it for you.But this is not just it; in the left bar, select Tracing > Traces to look at all the individual traces. Since we have asked four queries, we will observe four different traces each representing the complete pipeline against each query.

Each trace is distinguished by an ID, timestamp and contains corresponding latency and total cost. The usage column shows the total input and output token usage against each trace.
If you click on any of those traces, the Langfuse will depict the complete picture of the underlying processes, such as inputs and outputs for each stage, covering everything from retrieval, LLM call and the generation. Insightful, isn’t it?

Evaluation Metrics
As a bonus feature, let’s also add our custom metrics related to the LLM’s response on the same dashboard. On a self-hosted solution, just like we have implemented, this can be made possible by fetching all traces from the dashboard, applying customized evaluation on those traces and publishing them back to the dashboard.
The evaluation can be applied by simply employing another LLM with suitable prompts. Otherwise, we can use evaluation frameworks, such as DeepEval or promptfoo etc., which also use LLMs under the hood. We shall go with DeepEval, which is an open-source framework developed to evaluate the response of LLMs.
Let’s do this process in the following steps:
Step 1: Installation and Setup
First, we install deepeval
framework:
$ pip install deepeval
Next, we make necessary imports:
from langfuse import Langfuse
from datetime import datetime, timedelta
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from dotenv import load_dotenv
load_dotenv()
Step 2: Fetching the traces from the dashboard
The first step is to fetch all the traces, within the given time window, from the running Langfuse server into our Python code.
langfuse_handler = Langfuse(
secret_key="sk-lf-...",
public_key="pk-lf-...",
host="http://localhost:3000"
)
now = datetime.now()
five_am_today = datetime(now.year, now.month, now.day, 5, 0)
five_am_yesterday = five_am_today - timedelta(days=1)
traces_batch = langfuse_handler.fetch_traces(
limit=5,
from_timestamp=five_am_yesterday,
to_timestamp=datetime.now()
).data
print(f"Traces in first batch: {len(traces_batch)}")
Note that we are using the same secret and public keys as previously, since we are fetching the traces from our data-demo
project. Also note that we are fetching traces from 5 am yesterday till the current time.
Step 3: Applying Evaluation
Once we have the traces, we can apply various evaluation metrics such as bias, toxicity, hallucinations and relevance. For simplicity, let’s stick only to the AnswerRelevancyMetric
metric.
def calculate_relevance(trace):
relevance_model = 'gpt-4o-mini'
relevancy_metric = AnswerRelevancyMetric(
threshold=0.7,model=relevance_model,
include_reason=True
)
test_case = LLMTestCase(
input=trace.input['query'],
actual_output=trace.output['result']
)
relevancy_metric.measure(test_case)
return {"score": relevancy_metric.score, "reason": relevancy_metric.reason}
# Do this for each trace
for trace in traces_batch:
try:
relevance_measure = calculate_relevance(trace)
langfuse_handler.score(
trace_id=trace.id,
name="relevance",
value=relevance_measure['score'],
comment=relevance_measure['reason']
)
except Exception as e:
print(e)
continue
In the above code snippet, we have defined the calculate_relevance
function to calculate relevance of the given trace using DeepEval’s standard metric. Then we loop over all the traces and individually calculate each trace’s relevance score. The langfuse_handler
object takes care of logging that score back to the dashboard against each trace ID.
Step 4: Observing the metrics
Now if you focus on the same dashboard as previous, the ‘Scores’ panel has been populated as well.

You will notice that relevance score has been added to the individual traces as well.

You can also view the feedback provided by the DeepEval, for each trace individually.

This example showcases a simple way of logging evaluation metrics on the dashboard. Of course, there is more to it in terms of metrics calculation and handling, but let’s keep it for the future. Also importantly, you might wonder what the most appropriate way is to log evaluation metrics on the dashboard of a running application. For the self-hosting solution, a straightforward answer is to run the evaluation script as a Cron Job, at specific times. For the enterprise version, Langfuse offers live evaluation metrics of the LLM response, as they are populated on the dashboard.
Advanced Features
Langfuse offers many advanced features, such as:
Prompt Management
This allows management and versioning of prompts using the Langfuse Dashboard UI. This enables users to keep an eye on evolving prompts as well as record all metrics against each version of the prompt. Additionally, it also supports prompt playground to tweak prompts and model parameters and observe their effects on the overall LLM response, directly in the Langfuse UI.
Datasets
Datasets feature allows users to create a benchmark dataset to measure the performance of the LLM application against different model parameters and tweaked prompts. As new edge-cases are reported, they can be directly fed into the existing datasets.
User Management
This feature allows organizations to track the costs and metrics associated with each user. This also means that organizations can trace the activity of each user, encouraging fair use of the LLM application.
Conclusion
In this tutorial, we have explored LLM Monitoring and Observability and its related concepts. We implemented Monitoring and Observability using Langfuse—an open-source framework, offering free and enterprise solutions. Opting for the self-hosting solution, we set up Langfuse dashboard using docker file along with PostgreSQL server for persistence. We then enabled instrumentation in our sample LLM application using Langfuse Python SDKs. Finally, we observed all the traces in the dashboard and also performed evaluation on these traces using the DeepEval framework.
In a future tutorial, we may also explore advanced features of the Langfuse framework or explore other open-source frameworks such as Arize Phoenix. We may also work on the deployment of Langfuse dashboard on a cloud service such as Azure, AWS or GCP.