Tracing and Evaluating LangGraph AI Agents with MLflow

Written by Ayodeji Ogunlami | Oct 6, 2025 3:59:57 PM

Building AI agents can feel like a bit of a black box – you feed in a prompt and get an answer, but what happens in between? In this post, we’ll show how to give your AI agents “x-ray vision” using MLflow to trace their execution and evaluate their performance. We’ll be using LangGraph, a framework for constructing AI agent workflows, as our example. Whether you’re just starting out or already deploying complex agents, this guide will help you instrument, track, and assess your agents in a fun and effective way.

Introduction: LangGraph + MLflow = Explainable AI Agents

LangGraph is a library for building stateful, multi-actor LLM applications – essentially, it lets you design an agent’s logic as a graph of nodes (actions) and edges (transitions). Unlike simpler pipelines, LangGraph supports cycles and branching (your agent can loop or revisit steps), persistence (state is saved across steps), and even human-in-the-loop interventions.

This makes it ideal for creating more advanced AI agents that can remember context or handle multi-step tasks. MLflow, on the other hand, is an open-source platform to manage the ML lifecycle. Recent MLflow updates introduced MLflow Tracing, which provides end-to-end observability for generative AI applications. In plain terms, MLflow can automatically capture fine-grained traces of your agent’s execution – each step, input, output, and timing, with minimal code changes. The traces enable:

Easier debugging: You can inspect each decision or tool invocation your agent made.
Offline evaluation: The captured data can be used to evaluate and improve your agent’s quality over time.
Production monitoring: Even after deployment, you get visibility into agent behaviour and can spot anomalies.
Auditability: A full log of actions and decisions is recorded for compliance or analysis.

In addition, MLflow offers an evaluation toolkit for ML models (including LLMs). This lets us benchmark our agent’s answers using both heuristic metrics (like ROUGE or BLEU for text overlap) and even AI-based graders (where another model like GPT-4o judges the answers’ quality).

In this article, we’ll walk through:

Tracing a LangGraph agent locally with MLflow – getting a step-by-step view of the agent’s reasoning with just one line of code.

Integrating with Databricks for centralized experiment tracking – so your whole team can view runs and traces in one place.

Evaluating the agent’s performance using MLflow’s evaluation API – measuring how well the agent answers questions with quantitative metrics.

Setting the Stage: A Simple LangGraph Agent

First, let’s define a simple LangGraph workflow to have something to trace. Imagine a Q&A chatbot that answers a user’s question using an LLM and then stores the interaction in history (a minimal conversational agent). We can define two LangGraph nodes: one to get the LLM’s answer, and one to update the chat history. Here’s a high-level view:

• Node 1: get_llm_response – Takes the user’s question (and any prior conversation history) and calls an LLM to generate an answer. We’re using Azure OpenAI in this example to get an answer to the question.

• Node 2: update_history – Appends the question and answer to the conversation history, so the agent remembers it going forward. We connect Node 1 → Node 2 in the graph and set Node 1 as the start and Node 2 as the end of the workflow. After building this graph, we compile it into a workflow object that can be run. At this point, we essentially have an agent (the compiled graph) that, given a question, will produce an answer and update its state.

(LangGraph allows far more complex graphs – loops, multiple agents, conditional branches,

etc. – but we’ll keep it simple for clarity.)

Using MLflow to Trace Agent Execution Locally

With our agent ready, the next step is to instrument it with MLflow. Amazingly, MLflow provides auto logging for LangGraph out-of-the-box – all we need is one line to enable tracing! MLflow v2.13+ supports mlflow.langgraph.autolog() for LangGraph versions 0.1.1 and above. This will automatically capture each node execution as a span in a trace, record input/output data, and track runtime metrics like latency for us. Mlflow can also display traces locally especially when you run your code in a notebook. First, you will need to install MLflow and then start the MLflow server in your terminal locally using the code: mlflow server.

Let’s set up MLflow for local use and run our agent with tracing enabled:

When you run this code, the agent will query the LLM and return the answer. In our case, it should output something like:

'The capital of France is Paris.'

More interestingly, thanks to auto logging, MLflow is quietly recording a detailed trace of this execution.

Viewing the Trace

How do we inspect the trace? If you’re running this in a Jupyter Notebook (with the MLflow server running), MLflow’s UI can render the trace inline. After the code executes, you might notice an interactive element in the output – typically a button that says “Collapse MLflow Trace” and an embedded view. This is the MLflow Trace UI, showing each step (span) of your LangGraph workflow.

MLflow’s inline Trace UI in a notebook, showing each task in the agent’s run. You can expand nodes to see sub-steps, inputs/outputs, and timings for each.

In the screenshot above, you can see how the agent’s trace might look: every step is listed in a tree on the left (with nested sub-steps), along with the duration of each, and on the right, you have a “single pane of glass” view of inputs, outputs, attributes, and any intermediate artifacts. For our simple two-node agent, the trace just shows two spans: one for get_llm_response and one for update_history, each with their input and output data (e.g. the question and the model’s answer).

Even with this trivial example, tracing is incredibly useful. You can verify that the LLM was called with the expected prompt and see the answer it returned, and you can confirm that the history was updated correctly. As you build more complex agents, having this step-by-step visibility is a lifesaver for debugging logic and ensuring the agent behaves as intended.

Pro Tip: If you’re not in a notebook, you can still view traces via the MLflow UI. Open your MLflow tracking server (for the URI used above) in a browser. In the run for your agent execution, there will be a Trace tab or artifact. The trace data can be viewed as a timeline of spans in the MLflow UI, like the inline view.

Now that we’ve seen local tracing, let’s take it to the cloud for a more collaborative setup.

Integrating with Databricks for Centralised Tracking

Local runs are fine for early development, but often you’ll want to log runs to a central MLflow server so that your whole team (or future self) can examine them. Databricks comes with a hosted MLflow tracking server, and we can easily send our LangGraph agent’s traces there.

In fact, if you’re running your agent on Databricks notebooks, MLflow auto logging is likely enabled by default for LangGraph on recent runtime versions. But you can also use MLflow’s tracking API to log from anywhere to Databricks.

Setup: To connect, you’ll need your Databricks workspace URL and an access token. You can set these as environment variables and then call mlflow.set_tracking_uri("databricks").

Note: You should also create a location to store the experiments, for this demo I have created a folder called qa_test in my Databricks shared folder.

Here’s how our code changes:

The code above connects MLflow to your Databricks workspace and ensures an experiment named mlflow_aicrew (under the /Shared/qa_test folder) is available for logging runs. We then call mlflow.langchain.autolog()) and execute the workflow in the same way as we did locally.

One nice addition is to explicitly start and name an MLflow run, which we can do with a context manager. For example:

When running on Databricks, you’ll see links in the notebook output to quickly open the run and experiment in the UI. For example, after running, it might output something like:

This is MLflow politely telling us where to find our results. Clicking these links (or navigating in Databricks to the Experiments page) will show the logged run. You can explore parameters, artifacts, and – importantly – the trace. On Databricks, the trace UI is integrated as well: open the run, go to the Traces section, and you should see a visual trace view.

If you run your graph without mlflow.start_run you should find your trace under the Traces section in Databricks. Its best to use run names to keep track of specific runs as they all appear under Traces anyway.

Now your agent’s execution details are centrally tracked: you can compare runs, share them with colleagues, and keep a history of improvements. This centralisation is crucial as you iterate on the agent’s design.

Evaluating the Agent’s Performance with MLflow

Tracing is great for understanding how the agent works. Now let’s evaluate how well it works. Does our agent give correct answers? How accurate or fluent are its responses? These are questions we can answer with MLflow’s evaluation tools.

MLflow provides a function mlflow.evaluate() that can take a model (or in our case, an agent function) and a dataset of examples, and compute a suite of metrics. There are some built-in evaluators for common task types: for example, if you specify model_type="question-answering", MLflow will by default compute metrics like exact match (whether the answer exactly matches the ground truth), answer toxicity, and reading level metrics (ARI and Flesch-Kincaid grade levels). If you use model_type="text-summarization", it will compute ROUGE, etc. These heuristic metrics can give a rough sense of the quality and safety of the outputs.

However generative AI often needs more nuanced evaluation. That’s where MLflow’s LLM-as-a-judge metrics come in. MLflow can leverage a powerful LLM (like GPT-4o) to grade the outputs. Metrics like answer_similarity and answer_correctness use an LLM to compare the model’s answer with the ground truth: How similar are they? Is the answer factually correct according to the ground truth? This approach (using another AI to judge the model’s answers) can capture correctness even if wording differs and can penalise hallucinations or inaccuracies.

Let’s see this in action by evaluating our Q&A agent on a small test set of questions. Suppose we have a JSON file of test Q&As (each with a "question" and correct "answer" key for the expected answer). We’ll run the agent on each question and then use mlflow.evaluate to get metrics:

Note: You need to set up your open ai environment variables to use LLM as a judge. If using Azure Open AI, set OPENAI_API_TYPE to azure like this:

You will also need to pip install some libraries based on the metrics or model_type you are using. For the question-answer model type, you would need to pip-install these libraries:

Evaluate
Transformers
Absl-py
Textstat
Torch
Nltk
Rouge-score

Now that your environment is set up, let’s go ahead to evaluate the LangGraph model.

A few things are happening here, so let’s break it down:

We load our test questions and expected answers, then put them into a Pandas DataFrame eval_df with columns inputs and ground_truth.
We define extra metrics: ROUGE-L and BLEU (common text overlap metrics) and answer_correctness/answer_similarity, which use an LLM (GPT-4o) to evaluate the quality of the generated answer.
We call mlflow.evaluate(), specifying our agent’s predictions (attached to the DataFrame) and comparing them against the ground truth.
Finally, we log the metrics so that they can be tracked over time.

This process lets you see briefly how the model is performing, and with the MLflow UI, you can compare runs to monitor improvements.

Logging and Inferencing Your LangGraph Model

Beyond tracing and evaluating individual runs, MLflow can also log your entire LangGraph model as an MLflow artifact that you can later load and use for inference. This means that your model – with all its stateful logic and complex graph structure – becomes a first-class MLflow model. You can deploy it, version it, and run predictions using the familiar predict method.

Here’s how you can do it:

Once logged, the model can be loaded later for inference:

In this snippet, the predict method is used to run the model on new input data. This enables you to integrate the model into production applications or further testing pipelines seamlessly. The process of logging, loading, and inferencing ensures that your model’s behaviour is reproducible and version-controlled.

Conclusion:

In this journey, we saw how just a few lines of MLflow integration can dramatically enhance the development of AI agents built with LangGraph:

With MLflow Tracing, we added transparent, real-time observability to our agent’s decision-making process. This makes debugging and iterating on agent logic much faster – no more guesswork about what the agent is doing step-by-step.
By logging into Databricks, we made our experimentation collaborative and centralised, benefiting from the robust MLflow infrastructure on the platform.
Using MLflow’s evaluation toolkit, we quantified our agent’s performance on a test set – combining traditional NLP metrics with cutting-edge LLM-based evaluators.
Finally, by logging the entire LangGraph model, we can now load it later and run inference using the predict method, streamlining deployment and production integration.

This approach appeals to a wide range of AI developers. Beginners get an easy-to-use framework for understanding their agent’s behaviour. Intermediate developers gain robust tools to measure and tune performance. Advanced users can integrate these tools into a full MLOps pipeline, ensuring their AI agents are reliable and continually improving even as they tackle complex, dynamic tasks.

In an era where AI agents are becoming increasingly complex (and sometimes unpredictable), having this level of observability, evaluation, and inferencing capability is like having a trusty compass and map. It turns developing AI agents from a game of chance into a systematic, enjoyable process – you can see what’s happening under the hood and steer accordingly.

So go ahead and give it a try! Instrument your LangGraph (or LangChain, etc.) agents with MLflow, trace a run and inspect the timeline, evaluate performance with quantitative metrics, and log your entire model for reproducible inferencing. With these insights, you’ll be well on your way to building agents that are not only creative but also consistent, correct, and transparent in how they operate.

Need help implementing MLflow tracing for your AI agents, or want to discuss your specific use case? We'd love to hear from you - reach out to our team and let's chat!

Happy tracing and testing!

View full post