langsmith>=0.3.4
.
For extra features like rich terminal outputs and test caching install:
@pytest.mark.langsmith
decorator. Every decorated test case will be synced to a dataset example. When you run the test suite, the dataset will be updated and a new experiment will be created with one result for each test case.
pass
boolean feedback key based on the test case passing / failing. It will also track any inputs, outputs, and reference (expected) outputs that you log.
Use pytest
as you normally would to run the tests:
pass
feedback key for each test caselog_inputs
, log_outputs
, and log_reference_outputs
methods. You can run these any time in a test to update the example and run for that test:
{"a": 1, "b": 2}
, reference outputs {"foo": "bar"}
and trace a run with outputs {"foo": "baz"}
.
NOTE: If you run log_inputs
, log_outputs
, or log_reference_outputs
twice, the previous values will be overwritten.
Another way to define example inputs and reference outputs is via pytest fixtures/parametrizations. By default any arguments to your test function will be logged as inputs on the corresponding example. If certain arguments are meant to represet reference outputs, you can specify that they should be logged as such using @pytest.mark.langsmith(output_keys=["name_of_ref_output_arg"])
:
{"c": 5}
and reference outputs {"d": 6}
, and run output {"d": 10}
.
pass
feedback key for each test case. You can add additional feedback with log_feedback
.
trace_feedback()
context manager. This makes it so that the LLM-as-judge call is traced separately from the rest of the test case. Instead of showing up in the main test case run it will instead show up in the trace for the correct
feedback key.
NOTE: Make sure that the log_feedback
call associated with the feedback trace occurs inside the trace_feedback
context. This way we’ll be able to associate the feedback with the trace, and when seeing the feedback in the UI you’ll be able to click on it to see the trace that generated it.
test_suite_name
parameter to @pytest.mark.langsmith
for case-by-case grouping, or you can set the LANGSMITH_TEST_SUITE
env var to group all tests from an execution into a single test suite:
LANGSMITH_TEST_SUITE
to get a consolidated view of all of your results.
LANGSMITH_EXPERIMENT
env var:
langsmith[pytest]
and set an env var: LANGSMITH_TEST_CACHE=/my/cache/path
:
tests/cassettes
and loaded from there on subsequent runs. If you check this in to your repository, your CI will be able to use the cache as well.
In langsmith>=0.4.10
, you may selectively enable caching for requests to individual URLs or hostnames like this:
@pytest.mark.langsmith
is designed to stay out of your way and works well with familiar pytest
features.
pytest.mark.parametrize
parametrize
decorator as before. This will create a new test case for each parametrized instance of the test.
evaluate()
instead. This parallelizes the evaluation and makes it easier to control individual experiments and the corresponding dataset.
pytest-xdist
pytest-asyncio
@pytest.mark.langsmith
works with sync or async tests, so you can run async tests exactly as before.
pytest-watch
--langsmith-output
:
--output=langsmith
in langsmith<=0.3.3
but was updated to avoid collisions with other pytest plugins.
You’ll get a nice table per test suite that updates live as the results are uploaded to LangSmith:
pip install -U "langsmith[pytest]"
pytest-xdist
LANGSMITH_TEST_TRACKING=false
in your environment.
assert
ing that the expectation is met possibly triggering a test failure.
expect
also provides “fuzzy match” methods. For example:
embedding_distance
between the prediction and the expectationexpectation
score (1 if cosine distance is less than 0.5, 0 if not)edit_distance
between the prediction and the expectationexpect
utility is modeled off of Jest’s expect API, with some off-the-shelf functionality to make it easier to grade your LLMs.
@test
/ @unit
decorator@test
or @unit
decorators: