DeepEval package for unit testing LLMs. Using Confident, everyone can build robust language models through faster iterations using both unit testing and integration testing. We provide support for each step in the iteration from synthetic data creation to testing.In this guide we will demonstrate how to test and measure LLMs in performance. We show how you can use our callback to measure performance and how you can define your own metric and log them into our dashboard. DeepEval also offers:
- How to generate synthetic data
- How to measure performance
- A dashboard to monitor and review results over time
Installation and Setup
Getting API Credentials
To get the DeepEval API credentials, follow the next steps:- Go to app.confident-ai.com
- Click on “Organization”
- Copy the API Key.
implementation name. The implementation name is required to describe the type of implementation. (Think of what you want to call your project. We recommend making it descriptive.)
Setup DeepEval
You can, by default, use theDeepEvalCallbackHandler to set up the metrics you want to track. However, this has limited support for metrics at the moment (more to be added soon). It currently supports:
Get Started
To use theDeepEvalCallbackHandler, we need the implementation_name.
Scenario 1: Feeding into LLM
You can then feed it into your LLM with OpenAI.is_successful() method.
Scenario 2: Tracking an LLM in a chain without callbacks
To track an LLM in a chain without callbacks, you can plug into it at the end. We can start by defining a simple chain as shown below.What’s next?
You can create your own custom metrics here. DeepEval also offers other features such as being able to automatically create unit tests, tests for hallucination. If you are interested, check out our Github repository here https://github.com/confident-ai/deepeval. We welcome any PRs and discussions on how to improve LLM performance.Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.