How to Use Promptfoo for LLM Testing
"Untested software is broken software."
As developers writing code for production environments, we deeply embrace this principle, and it holds particularly true in the context of working with large language models (LLMs). In order to develop robust applications, the capability to systematically evaluate LLM outputs is indispensable. Relying on traditional trial-and-error approaches not only proves to be inefficient but frequently results in less-than-ideal outcomes.
Enter Promptfoo, a cutting-edge CLI and library designed to revolutionize how we approach LLM development through a test-driven framework. In this tutorial, I'll explore Promptfoo, showcasing its capabilities such as testing JSON model responses, model costs, and adherence to instructions, by walking you through a sample project focused on inventive storytelling.
You can access all the code in the companion GitHub repository that accompanies this blog post.
What is Promptfoo?
Promptfoo is a comprehensive tool that facilitates the evaluation of LLM output quality in a systematic and efficient manner. It allows developers to test prompts, models, and Retrieval-Augmented Generation (RAG) setups against predefined test cases, thereby identifying the best-performing combinations for specific applications. With Promptfoo, developers can:
Perform side-by-side comparisons of LLM outputs to detect quality variances and regressions.
Utilize caching and concurrent testing to expedite evaluations.
Automatically score outputs based on predefined expectations.
Integrate Promptfoo into existing workflows either as a CLI or a library.
Work with a wide range of LLM APIs, including OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, and even custom API providers.
The philosophy behind Promptfoo is simple: embrace test-driven development for LLM applications to move beyond the inefficiencies of trial-and-error. This approach not only saves time but also ensures that your applications meet the desired quality standards before deployment.
Demo Project: Creative Storytelling with Promptfoo
To illustrate the capabilities of Promptfoo, let's go over our demo project centered on creative storytelling. This project uses a configuration file (promptfooconfig.yaml
) that defines the evaluation setup for generating diary entries set in various contexts, such as a mysterious island, a futuristic city, and an ancient Egyptian civilization.
Project Setup
Writing the Prompt
The core of our evaluation is the prompt defined in prompt1.txt
, which instructs the LLM to generate a diary entry from someone living in a specified context (e.g., a mysterious island). The output must be a JSON object containing metadata (person's name, location, date) and the diary entry itself. Here's the entire prompt1.txt
for our project:
Write a diary entry from someone living in {{topic}}.
Return a JSON object with metadata and the diary entry.
The metadata should include the person's name, location, and the date.
The date should be the current date.
The diary entry key should be named "diary_entry" and as a raw string.
An example of the expected output is:
{
"metadata": {
"name": "John Doe",
"location": "New York",
"date": "2020-01-01"
},
"diary_entry": "Today was a good day."
}
Pretty simple prompt asking the LLM for JSON output. Promptfoo uses Nunjucks templates (the {{topic}}
in the prompt1.txt
) to be able to include variables from the promptfooconfig.yaml
.
More information can be found on Promptfoo's Input and output files doc.
The promptfooconfig.yaml
The promptfooconfig.yaml
file outlines the structure of our evaluation. It includes a description of the project, specifies the prompts, lists the LLM providers (with their configurations), and defines the tests with associated assertions to evaluate the output quality based on cost, content relevance, and specific JSON structure requirements. The example promptfooconfig.yaml
isn't too long, and here is the whole file:
description: "Creative Storytelling"
prompts: [prompt1.txt]
providers:
- id: "mistral:mistral-medium"
config:
temperature: 0
max_tokens: 1000
safe_prompt: true
- id: "openai:gpt-3.5-turbo-0613"
config:
temperature: 0
max_tokens: 1000
- id: "openai:gpt-4-0125-preview"
config:
temperature: 0
max_tokens: 1000
tests:
- vars:
topic: "a mysterious island"
assert:
- type: cost
threshold: 0.002
- type: "contains-json"
value:
{
"required": ["metadata", "diary_entry"],
"type": "object",
"properties":
{
"metadata":
{
"type": "object",
"required": ["name", "location", "date"],
"properties":
{
"name": { "type": "string" },
"location": { "type": "string" },
"date": { "type": "string", "format": "date" },
},
},
"diary_entry": { "type": "string" },
},
}
- vars:
topic: "a futuristic city"
assert:
- type: answer-relevance
value: "Ensure that the output contains content about a futuristic city"
- type: "llm-rubric"
value: "ensure that the output showcases innovation and detailed world-building"
- vars:
topic: "an ancient Egyptian civilization"
assert:
- type: "model-graded-closedqa"
value: "References Egypt in some way"
The Assertions Explained
Promptfoo offers a versatile suite of assertions to evaluate LLM outputs against predefined conditions or expectations, ensuring the outputs meet specific quality standards. These assertions are categorized into deterministic eval metrics and model-assisted eval metrics. Here's a deep dive into each assertion used in the preceding example promptfooconfig.yaml
for our creative storytelling project.
Cost Assertion
The cost
assertion verifies if the inference cost of generating an output is below a predefined threshold. It's crucial for managing computational resources effectively, especially when scaling LLM applications. In our example, the assertion ensures that generating a diary entry for "a mysterious island" remains cost-effective, with a threshold set at 0.002.
Contains-JSON Assertion
This assertion (contains-json
) checks whether the output contains valid JSON that matches a specific schema. It's particularly useful for structured data outputs, ensuring they adhere to the expected format. In the creative storytelling example, this assertion validates the JSON structure of the diary entry, including required fields like metadata
(with subfields name
, location
, and date
) and diary_entry
.
Answer-Relevance Assertion
The answer-relevance
assertion evaluates whether the LLM output is relevant to the original query or topic. This ensures that the model's responses are on-topic and meet the user's intent. For the futuristic city prompt, this assertion confirms that the content indeed revolves around a futuristic city, aligning with the user's request for thematic accuracy.
LLM-Rubric Assertion
An llm-rubric
assertion uses a Language Model to grade the output against a specific rubric. This method is effective for qualitative assessments of outputs, such as creativity, detail, or adherence to a theme. For our futuristic city scenario, this assertion evaluates whether the output demonstrates innovation and detailed world-building, as expected for a narrative set in a futuristic environment.
Model-Graded-ClosedQA Assertion
This model-graded-closedqa
assertion uses Closed QA methods (based on the OpenAI Evals) to ensure that the output adheres to specific criteria. It's beneficial for factual correctness and thematic relevance. In the case of "an ancient Egyptian civilization," this assertion verifies that the output references Egypt in some manner, ensuring historical or thematic accuracy.
Running the Evaluation
With Promptfoo, executing this evaluation is straightforward. Developers can run tests using the command line, allowing Promptfoo to compare outputs from different LLMs based on the specified criteria. This process helps in identifying which LLM performs best for creative storytelling within the defined parameters. I've provided a simple test
script (leveraging npx
) that can be found on the package.json
of the project, and run like the following from the root of the repository:
npm run test
Analyzing the Results
Promptfoo produces matrix views that enable quick evaluation of outputs across multiple prompts and inputs in the terminal, as well as a web UI for more in-depth exploration of the test results. These features are invaluable for spotting trends, understanding model strengths and weaknesses, and making informed decisions about which LLM to use for your specific application.
For more information on viewing the Promptfoo's test results, check out Promptfoo's Usage docs.
Why Choose Promptfoo?
Promptfoo stands out for several reasons:
Battle-tested: Designed for LLM applications serving millions of users, Promptfoo is both robust and adaptable.
Simple and Declarative: Define evaluations without extensive coding or the use of cumbersome notebooks.
Language Agnostic: Work in Python, JavaScript, or your preferred language.
Collaboration-Friendly: Share evaluations and collaborate with teammates effortlessly.
Open-Source and Private: Promptfoo is fully open-source and runs locally, ensuring your evaluations remain private.
Conclusion
Promptfoo may very well become the Jest of LLM application testing.
By integrating Promptfoo into your development workflow (and CI/CD process), you can significantly enhance the efficiency, quality, and reliability of your LLM applications.
Whether you're developing creative storytelling applications or any other LLM-powered project, Promptfoo offers the features and flexibility needed to add confidence to your LLM integrations through a robust set of testing utilities.