Moderated and Toxic Content
It is important to ensure the safe generation of content from agentic systems to protect users from exposure to toxic or harmful material and to ensure that system behavior aligns with intended values. Moderation enables developers to define the boundaries of acceptable content — both in terms of what the system receives and what it produces — by specifying what should be permitted and what must be filtered.
By implementing moderation guardrails, you can shape the behavior of agentic systems in a way that is predictable, value-aligned, and resilient to misuse.
Moderated and Toxic Content Risks
Without moderation safeguards, agents may:
Generate or amplify hate speech, harassment, or explicit content.
Act on inappropriate user inputs causing unintended behavoiour.
Spread misinformation or reinforce harmful stereotypes.
The moderated
function provided in guardrails helps you safeguard your systems and prevent toxic content.
moderated
def moderated(
data: str | list[str],
model: str | None = None,
default_threshhold: float | None = 0.5,
cat_threshold: dict[str, float] | None = None
) -> bool
Parameters
Name | Type | Description |
---|---|---|
data |
str | list[str] |
A single message or a list of messages. |
model |
str | list[str] |
The model to use for moderation detection (KoalaAI/Text-Moderation or openai for the OpenAI Moderations API). |
default_threshhold |
float | None |
The threshold for the model score above which text is considered to be moderated. |
cat_threshhold |
dict[str, float] | None |
A dictionary of category-specific thresholds. |
Returns
Type | Description |
---|---|
bool |
TRUE if a prompt injection was detected, FALSE otherwise |
Detecting Harmful Messages
To detect content that you want to moderate in messages, you can directly apply the moderated
function to messages.
Example: Harmful Message Detection
from invariant.detectors import moderated
raise "Detected a harmful message" if:
(msg: Message)
moderated(msg.content)
[
{
"role": "assistant",
"content": "Hello, how can I assist you?"
},
{
"role": "user",
"content": "This is hatefully hateful hate!"
}
]
Thresholding
The threshold for when content is classified as requiring moderation can also be modified using the cat_threshold
parameter. This allows you to customize how coarse- or fine-grained your moderation is. The default is 0.5
.
Example: Thresholding Detection
from invariant.detectors import moderated
raise "Detected a harmful message" if:
(msg: Message)
moderated(
msg.content,
cat_thresholds={"hate/threatening": 0.15}
)
[
{
"role": "assistant",
"content": "Hello, how can I assist you?"
},
{
"role": "user",
"content": "This is hatefully hateful hate!"
}
]