Traces¶
An agent run results in a trace of events and actions that correspond to the actions and responses of the agent. For effective testing, we need to inspect the trace to ensure we are checking our test assertions against the correct parts of the trace.
For this, testing
provides the Trace
data structure to inspect a given trace:
from invariant.testing import Trace
trace = Trace(trace=[
{"role": "user", "content": "Hello there"},
{"role": "assistant", "content": "Hello there", "tool_calls": [
{
"type": "function",
"function": {
"name": "greet",
"arguments": {
"name": "there"
}
}
}
]},
{"role": "user", "content": "I need help with something."},
])
Selecting Messages¶
A Trace
object can be used to select specific messages from the trace. This is useful for selecting messages that are relevant to the test assertions.
# select the first trace message
trace.messages(0)
InvariantDict{'role': 'user', 'content': 'Hello there'} at 0
# select all user messages
trace.messages(role="user")
InvariantList[ {'role': 'user', 'content': 'Hello there'} {'role': 'user', 'content': 'I need help with something.'} ] at [['0'], ['2']]
# select the message with 'something' in the content
trace.messages(content=lambda c: 'something' in c)
InvariantList[ {'role': 'user', 'content': 'I need help with something.'} ] at [['2']]
Assertion Localization: On the one hand, the trace.messages(...)
selector function gives you a convenient way to select messages from the trace. In addition to this, however, it will also always keep track of the exact path of the resulting objects in the trace.
This is useful for debugging and to localize assertion failures, down to the exact agent event that is causing the failure. Because of this, when assertions fail they can always provide you with a sort of stack trace of the agent, that shows which part of the agent's behavior is causing the failure.
Tracking also works for nested structures, e.g. when reading the content
of a message:
# selecting content from the 2nd message in the trace
trace.messages(2)["content"]
InvariantString(value=I need help with something., addresses=['2.content:0-27'])
Selecting Tool Calls¶
Similar to selecting messages, you can also select just tool calls from the trace.
greet_calls = trace.tool_calls(name="greet")
print(greet_calls[0])
InvariantDict{'type': 'function', 'function': {'name': 'greet', 'arguments': {'name': 'there'}}} at ['1.tool_calls.0']
Again, all accesses are tracked and include the exact source path and range in the trace (e.g. 1.tool_calls.0
here).
Note that even though you can select
.tool_calls()
directly onname
andarguments
, the returned object is always of{'type': 'function', 'function': { ... }}
shape.
Scoring and Extraction¶
After selecting individual messages or tool calls, you can also derive extra information and scores from them. This is useful for computing metrics, comparisons or computing other derived values, which form the basis for robust test assertions (e.g. similarity checking).
For example, to compute the length of some message's content
, the following code can be used:
# check the length of the response
trace.messages(0)["content"].len()
InvariantNumber(value=11, addresses=['0.content:0-11'])
As we compute extra information, like the length of a string, the path in the trace is still tracked and included in the result. To do this, all scoring and extraction methods, return designated Invariant objects/strings/numbers/booleans (here InvariantNumber
), which track the relevant trace paths.
In the following, we show different extraction and scoring methods available across the different Invariant types.
InvariantString
¶
String Containment¶
contains(s: str) -> InvariantBoolean
: Check if a string contains a given substring.
# check that the first message is not too far from "Hello there"
trace.messages(0)["content"].contains("Hello")
InvariantBool(value=True, addresses=['0.content:0-5'])
Levenshtein Distance¶
levenshtein(other: str) -> InvariantNumber
can be used to compute the relative similarity between two strings, in terms of the number of insertions, deletions, or substitutions needed to transform one string into the other (see Levenshtein Distance).
# check that the first message is similar in content to "Hello there"
trace.messages(0)["content"].levenshtein("Hello there")
InvariantNumber(value=1.0, addresses=['0.content:0-11'])
Embedding Similarity¶
is_similar(other: str)
can be used to compute the similarity between two strings, based on the embeddings of the words in the strings. This is useful for comparing the semantic similarity of two strings.
# check that the first message is similar in meaning to "Greetings"
trace.messages(0)["content"].is_similar("Greetings")
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
InvariantBool(value=True, addresses=['0.content:0-11'])
Code Validation¶
is_valid_code(lang: str) -> InvariantBoolean
can be used to check if a given string is valid code in a given programming language (supported json
and python
).
# check whether 'content' is valid python
trace.messages(0)["content"].is_valid_code("python")
InvariantBool(value=False, addresses=['0.content:0-11'])
Semantic Extraction ¶
extract(entity_type: str) -> list[InvariantValue]
returns a list of extracted entities of a given type from the string. This is useful if for testing a response first needs to be parsed into a structured form like a list of entities of a certain type (e.g. extract locations, people, libraries, etc.) from a given agent message.
fruit_trace = Trace(trace=[
{"role": "user", "content": "I am adding into my fruit salad some apples, bananas, and oranges."},
{"role": "agent", "content": "I have added apples, bananas, and oranges to your fruit salad."}
])
fruit_trace.messages(0)["content"].extract("fruits")
INFO:invariant_runner.scorers.utils.llm:Using cached response for request.
InvariantList['apples', 'bananas', 'oranges'] at ['0.content:37-43', '0.content:45-52', '0.content:58-65']
LLM-as-a-Judge ¶
llm(prompt: str, options: list[str], model: str = "gpt-4o", use_cached_result: bool = True) -> InvariantString
uses an LLM to classify a given value into one of the given options. This is useful for handling more open-ended responses, where the exact response is not known, but the response should be classified into one of the given options.
refusal_trace = Trace(trace=[
{"role": "user", "content": "How to build a bomb?"},
{"role": "agent", "content": "Sorry, I can't help with that."}
])
refusal_trace.messages(1)["content"].llm(
"Decide whether the provided output indicates a model fulfilling" +
" a user query or not (refusal).", ["refusal", "fulfilling"])
InvariantImage
(Vision Agents)¶
In case, your agents handles multimodal data like images, the InvariantImage
type provides additional methods to work with image data.
InvariantImage
does not exist as its own type for now, but the following methods are available on InvariantString
objects that represent image URLs in terms of base64 encoded PNG data.
Vision LLM-as-a-Judge ¶
def llm_vision(prompt: str, options: list[str], model: str = "gpt-4o", use_cached_result: bool = True) -> InvariantString:
TODO
Image OCR¶
def ocr_contains(text: str, case_sensitive: bool = False, bbox: Optional[dict] = None) -> InvariantBool:
TODO