The Observability Gap in LLM Apps
Traditional APM tools (Datadog, New Relic) capture HTTP latency and error rates. They tell you a request took 4 seconds — but not that 3.8 seconds was an LLM call, that the prompt had 1,200 tokens, that the model used was gpt-4o, or that the user-visible answer was a hallucination. Langfuse fills this gap with LLM-native observability.
Core Hierarchy
Trace → Span → Generation
- A Trace represents one user interaction (e.g., one chat message)
- Spans are logical steps within a trace (retrieve documents, format prompt, call LLM)
- A Generation is a specific LLM call with input/output tokens, model name, latency, and cost
Python SDK Integration
pip install langfuse
Decorator-Based Tracing
from langfuse.decorators import observe, langfuse_context
from openai import OpenAI
client = OpenAI()
@observe()
def retrieve_docs(query: str) -> str:
# Simulate vector retrieval
return f"Documents for: {query}"
@observe()
def generate_answer(docs: str, question: str) -> str:
langfuse_context.update_current_observation(
input={"docs": docs, "question": question}
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Context: {docs}"},
{"role": "user", "content": question},
],
)
answer = response.choices[0].message.content
langfuse_context.update_current_observation(output=answer)
return answer
@observe()
def answer_question(question: str) -> str:
docs = retrieve_docs(question)
return generate_answer(docs, question)
result = answer_question("What is PagedAttention?")
Every @observe() call automatically creates a span in the trace. Langfuse captures function arguments as input and return values as output.
Prompt Management With Versioning
Store, version, and A/B test prompts in the Langfuse UI:
from langfuse import Langfuse
lf = Langfuse()
prompt = lf.get_prompt("answer-question", version=3)
compiled = prompt.compile(context="...", question="...")
Changing a prompt in production no longer requires a code deploy.
Dataset Creation From Production Traces
Mark any trace as a dataset item directly from the UI. Build ground-truth datasets from real user interactions, then run batch evaluations to compare prompt versions or models.
LLM-as-Judge Scoring
from langfuse import Langfuse
lf = Langfuse()
lf.score(
trace_id="trace-xyz",
name="faithfulness",
value=0.92,
comment="Answer matches source documents",
)
Automate this with an evaluator function that runs after each generation.
Self-Hosting With Docker
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d
The stack includes PostgreSQL, Redis, and the Langfuse web server. Full instructions in the self-host guide.
User-Level Cost Attribution
Pass user_id to attribute token costs to individual users — essential for multi-tenant SaaS billing:
langfuse_context.update_current_trace(user_id="user-456", session_id="session-789")
The dashboard then shows cost-per-user histograms and per-session token usage.