Skip to main content

How to run an evaluation

In this guide we'll go over how to evaluate an application using the evaluate() method in the LangSmith SDK.

tip

For larger evaluation jobs in Python we recommend using aevaluate(), the asynchronous version of evaluate(). It is still worthwhile to read this guide first, as the two have nearly identical interfaces, and then read the how-to guide on running an evaluation asynchronously.

Define an application

First we need an application to evaluate. Let's create a simple toxicity classifier for this example.

from langsmith import traceable, wrappers
from openai import OpenAI

# Optionally wrap the OpenAI client to trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())

# Optionally add the 'traceable' decorator to trace the inputs/outputs of this function.
@traceable
def toxicity_classifier(inputs: dict) -> str:
instructions = (
"Please review the user query below and determine if it contains any form of toxic behavior, "
"such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does "
"and 'Not toxic' if it doesn't."
)
messages = [
{"role": "system", "content": instructions},
{"role": "user", "content": inputs["text"]},
]
result = oai_client.chat.completions.create(
messages=messages, model="gpt-4o-mini", temperature=0
)
return result.choices[0].message.content

We've optionally enabled tracing to capture the inputs and outputs of each step in the pipeline. To understand how to annotate your code for tracing, please refer to this guide.

Create or select a dataset

We need a Dataset to evaluate our application on. Our dataset will contain labeled examples of toxic and non-toxic text.

from langsmith import Client

ls_client = Client()

labeled_texts = [
("Shut up, idiot", "Toxic"),
("You're a wonderful person", "Not toxic"),
("This is the worst thing ever", "Toxic"),
("I had a great day today", "Not toxic"),
("Nobody likes you", "Toxic"),
("This is unacceptable. I want to speak to the manager.", "Not toxic"),
]

dataset_name = "Toxic Queries"
dataset = ls_client.create_dataset(dataset_name=dataset_name)
inputs, outputs = zip(
*[({"text": text}, {"label": label}) for text, label in labeled_texts]
)
ls_client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)

See here for more on dataset management.

Define an evaluator

Evaluators are functions for scoring your application's outputs. They take in the example inputs, actual outputs, and, when present, the reference outputs. Since we have labels for this task, our evaluator can directly check if the actual outputs match the reference outputs.

Requires langsmith>=0.1.145

def correct(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
return outputs["output"] == reference_outputs["label"]

See here for more on how to define evaluators.

Run the evaluation

We'll use the evaluate() / aevaluate() methods to run the evaluation.

The key arguments are:

  • a function that takes an input dictionary and returns an output dictionary or object
  • data - the name OR UUID of the LangSmith dataset to evaluate on, or an iterator of examples
  • evaluators - a list of evaluators to score the outputs of the function
from langsmith import evaluate

results = evaluate(
toxicity_classifier,
data=dataset_name,
evaluators=[correct],
experiment_prefix="gpt-4o-mini, baseline", # optional, experiment name prefix
description="Testing the baseline system.", # optional, experiment description
)

See here for other ways to kick off evaluations and here for how to configure evaluation jobs.

Explore the results

Each invocation of evaluate() creates an Experiment which can be viewed in the LangSmith UI or queried via the SDK. Evaluation scores are stored against each actual output as feedback.

If you've annotated your code for tracing, you can open the trace of each row in a side panel view.

Reference code

Click to see a consolidated code snippet

Requires langsmith>=0.1.145

from langsmith import Client, evaluate, traceable, wrappers
from openai import OpenAI

# Step 1. Define an application
oai_client = wrappers.wrap_openai(OpenAI())

@traceable
def toxicity_classifier(inputs: dict) -> str:
system = (
"Please review the user query below and determine if it contains any form of toxic behavior, "
"such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does "
"and 'Not toxic' if it doesn't."
)
messages = [
{"role": "system", "content": system},
{"role": "user", "content": inputs["text"]},
]
result = oai_client.chat.completions.create(
messages=messages, model="gpt-4o-mini", temperature=0
)
return result.choices[0].message.content

# Step 2. Create a dataset
ls_client = Client()

labeled_texts = [
("Shut up, idiot", "Toxic"),
("You're a wonderful person", "Not toxic"),
("This is the worst thing ever", "Toxic"),
("I had a great day today", "Not toxic"),
("Nobody likes you", "Toxic"),
("This is unacceptable. I want to speak to the manager.", "Not toxic"),
]

dataset_name = "Toxic Queries"
dataset = ls_client.create_dataset(dataset_name=dataset_name)
inputs, outputs = zip(
*[({"text": text}, {"label": label}) for text, label in labeled_texts]
)
ls_client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)

# Step 3. Define an evaluator
def correct(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
return outputs["output"] == reference_outputs["label"]

# Step 4. Run the evaluation
results = evaluate(
toxicity_classifier,
data=dataset_name,
evaluators=[correct],
experiment_prefix="gpt-4o-mini, simple", # optional, experiment name prefix
description="Testing the baseline system.", # optional, experiment description
)

Was this page helpful?


You can leave detailed feedback on GitHub.