Skip to content

Function Calling Agents

Test an OpenAI function calling agent using testing

OpenAI's function calling can be used to build agents that integrate with external tools and APIs, allowing the agent to call custom functions and deliver enhanced, context-aware responses. More details can be found here: OpenAI Function Calling Guide

In this chapter we are testing simple OpenAI function calling agent as implemented in this file.

Agent Overview

The agent generates and executes Python code in response to user requests and returns the computed results. It operates under a strict prompt and utilizes the run_python tool to guarantee accurate code execution and adherence to its intended functionality.

A loop is implemented to run the client until the chat is completed without further tool calls. During this process, all chat interactions are stored in messages. A simplifed implementation is shown below:

 while True:

    # call the client to get response
    response = self.client.chat.completions.create(
        messages=messages,
        model="gpt-4o",
        tools=tools,
    )

    # check if the response calling tools, if not means the chat is completed
    tool_calls = response.choices[0].message.tool_calls
    if tool_calls:

        # append current response message t o messages
        messages.append(response_message.to_dict())

        # In this demo there's only one tool call in the response
        tool_call = tool_calls[0]
        if tool_call.function.name == "run_python":

            # get the arguments generated by agent for the function
            function_args = json.loads(tool_call.function.arguments)

            # run the function with the argument with "code"
            function_response = run_python(function_args["code"])

            # append the response of the function to messages for next round chat
            messages.append(
                {
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": "run_python",
                    "content": str(function_response),
                }
            )
    else:
        break

Run the Example

You can run the example by running the following command in the root of the repository:

poetry run invariant test sample_tests/openai/test_python_agent.py --push --dataset_name test_python_agent

Note If you want to run the example without sending the results to the Explorer UI, you can always run without the --push flag. You will still see the parts of the trace that fail as higihlighted in the terminal.

Unit Tests

Here, we design three unit tests to cover different scenarios.

In these tests, we set varied input to reflect different situations. Within each test, we create an instance of the agent named python_agent, and retrieve its response by calling python_agent.get_response(input).

The agent's response is subsequently transformed into a Trace object usingTraceFactory.from_openai(response) for further validation.

Test 1: Valid Python Code Execution:

In the first test, we ask the agent to calculate the Fibonacci series for the first 10 elements using Python.

The implementation of the first test is shown below:

def test_python_question():
    input = "Calculate fibonacci series for the first 10 elements in python"

    # run the agent
    python_agent = PythonAgent()
    response = python_agent.get_response(input)

    # convert trace
    trace = TraceFactory.from_openai(response)

    # test the agent behavior
    with trace.as_context():
        run_python_tool_call = trace.tool_calls(name="run_python")

        # assert the agent calls "run_python" exactly once
        assert_true(F.len(run_python_tool_call) == 1)

        # assert the argument passed to the tool_call is valid Python code.
        assert_true(
            run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
                "python"
            )
        )

        # assert if 34 is included in the agent's final response.
        assert_true("34" in trace.messages(-1)["content"])

Our primary objective is to verify that the agent correctly calls the run_python tool and provides valid Python code as its parameter. To achieve this, we first filter the tool_calls where name = "run_python". Then, we assert that exactly one tool_call meets this condition. Next, we confirm that the argument passed to the tool_call is valid Python code.

Then we validate that the Python code executes correctly. To confirm this, we check if one of the calculated result, "34," is included in the agent's final response.

Test 2: Invalid Response:

In this test, we use unittest.mock.MagicMock to simulate a scenario where the agent incorrectly responds with Java code instead of Python, ensuring such behavior is detected. The actual response from python_agent.get_response(input) is replaced with our custom content stored in mock_invalid_response

The implementation of the second test is shown below:

def test_python_question_invalid():
    input = "Calculate fibonacci series for the first 10 elements in python"
    python_agent = PythonAgent()

    # set custom response that contains Java code instead of Python code
    mock_invalid_response = [
        {
            "role": "system",
            "content": '\n                    You are an assistant that strictly responds with Python code only. \n                    The code should print the result.\n                    You always use tool run_python to execute the code that you write to present the results.\n                    If the user specifies other programming language in the question, you should respond with "I can only help with Python code."\n                    ',
        },
        {"role": "user", "content": "Calculate fibonacci series for 10"},
        {
            "content": "None",
            "refusal": "None",
            "role": "assistant",
            "tool_calls": [
                {
                    "id": "call_GMx1WYM7sN0BGY1ISCk05zez",
                    "function": {
                        "arguments": '{"code":"public class Fibonacci { public static void main(String[] args) { for (int n = 10, a = 0, b = 1, i = 0; i < n; i++, b = a + (a = b)) System.out.print(a + '
                        '); } }"}',
                        "name": "run_python",
                    },
                    "type": "function",
                }
            ],
        },
    ]

    # the response will be replaced by our mock_invalid_response
    python_agent.get_response = MagicMock(return_value=mock_invalid_response)
    response = python_agent.get_response(input)

    # convert trace
    trace = TraceFactory.from_openai(response)

    # test the agent behavior
    with trace.as_context():
        run_python_tool_call = trace.tool_calls(name="run_python")

        assert_true(F.len(run_python_tool_call) == 1)
        assert_true(
            not run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
                "python"
            )
        )

In this test we still verify that the agent correctly calls the run_python tool once, but it provids invalid Python code as its parameter. So we assert that the parameter passed to this call is not valid Python code.

Test 3: Non-Python Language Request:

This test's request included another programming langguage Java and the agent should be able to handle it nicely as clarifyed in the prompt.

This test evaluates the agent's ability to handle requests involving a programming language other than Python, specifically Java. The agent is expected to respond appropriately by clarifying its limitation to Python code as outlined in the prompt.

The implementation of the third test is shown below:

def test_java_question():
    input = "How to calculate fibonacci series in Java?"
    # run the agent
    python_agent = PythonAgent()
    response = python_agent.get_response(input)

    # convert trace
    trace = TraceFactory.from_openai(response)

    # set expected response as clarified in prompt
    expected_response = "I can only help with Python code."

    # test the agent behavior
    with trace.as_context():

        # assert that the agent does not call the `run_python` tool
        run_python_tool_call = trace.tool_calls(name="run_python")
        assert_true(F.len(run_python_tool_call) == 0)

        # assert that the real repsonse is close enough with expected response
        expect_equals(
            "I can only help with Python code.", trace.messages(-1)["content"]
        )
        assert_true(trace.messages(-1)["content"].levenshtein(expected_response) < 5)

The first validation confirms that the agent does not call the run_python tool.

The agent’s response should align closely with expected_response = "I can only help with Python code.". We use the expect_equals assertion, which is less strict than assert_equal, to validate similarity.

To further confirm similarity, weo use our levenshtein() function which calculate Levenshtein distance. So we assert that the Levenshtein distance between the response and the expected response is smaller than 5.

To further confirm similarity, we compute the Levenshtein distance between the agent's response and the expected output, ensuring it is less than 5 using our levenshtein() function.

Conclusion

We have seen how to build an OpenAI Function Calling Agent and how to write unit tests to ensure the agent functions correctly by using testing.

To learn more, please select a topic from the tiles below.