Skip to content

Moderated and Toxic Content

Defining and Enforcing Content Moderation in Agentic Systems

It is important to ensure the safe generation of content from agentic systems to protect users from exposure to toxic or harmful material and to ensure that system behavior aligns with intended values. Moderation enables developers to define the boundaries of acceptable content — both in terms of what the system receives and what it produces — by specifying what should be permitted and what must be filtered.

By implementing moderation guardrails, you can shape the behavior of agentic systems in a way that is predictable, value-aligned, and resilient to misuse.

Moderated and Toxic Content Risks
Without moderation safeguards, agents may:

  • Generate or amplify hate speech, harassment, or explicit content.

  • Act on inappropriate user inputs causing unintended behavoiour.

  • Spread misinformation or reinforce harmful stereotypes.

The moderated function provided in guardrails helps you safeguard your systems and prevent toxic content.

moderated

def moderated(
    data: str | list[str],
    model: str | None = None,
    default_threshhold: float | None = 0.5,
    cat_threshold: dict[str, float] | None = None
) -> bool
Detector which evaluates to true if the given data should be moderated.

Parameters

Name Type Description
data str | list[str] A single message or a list of messages.
model str | list[str] The model to use for moderation detection (KoalaAI/Text-Moderation or openai for the OpenAI Moderations API).
default_threshhold float | None The threshold for the model score above which text is considered to be moderated.
cat_threshhold dict[str, float] | None A dictionary of category-specific thresholds.

Returns

Type Description
bool TRUE if a prompt injection was detected, FALSE otherwise

Detecting Harmful Messages

To detect content that you want to moderate in messages, you can directly apply the moderated function to messages.

Example: Harmful Message Detection

from invariant.detectors import moderated

raise "Detected a harmful message" if:
    (msg: Message)
    moderated(msg.content)
[
  {
    "role": "assistant",
    "content": "Hello, how can I assist you?"
  },
  {
    "role": "user",
    "content": "This is hatefully hateful hate!"
  }
]

Default moderation detection.

Thresholding

The threshold for when content is classified as requiring moderation can also be modified using the cat_threshold parameter. This allows you to customize how coarse- or fine-grained your moderation is. The default is 0.5.

Example: Thresholding Detection

from invariant.detectors import moderated

raise "Detected a harmful message" if:
    (msg: Message)
    moderated(
        msg.content,
        cat_thresholds={"hate/threatening": 0.15}
    )
[
  {
    "role": "assistant",
    "content": "Hello, how can I assist you?"
  },
  {
    "role": "user",
    "content": "This is hatefully hateful hate!"
  }
]

Thresholding for a specific category.