Alerts and Monitoring

Traceport provides a robust, real-time alerting and monitoring system designed to help you maintain the health, performance, and cost-efficiency of your AI applications. By defining custom alert rules, you can stay informed about critical events as they happen and respond quickly to degradations or outages.

Key Concepts

To effectively use Traceport’s monitoring system, it is important to understand the following core concepts:

Alert Rules: The configuration that defines what metric to monitor, the conditions for triggering an alert, and where notifications should be sent.
Incidents (Events): A specific occurrence where an Alert Rule’s conditions are met. Traceport automatically groups related triggers into a single incident based on a unique fingerprint.
Notification Channels: The delivery methods for alerts, such as Slack, Email, or Webhooks.
Metric Timeline: A granular historical record of metric values associated with an incident, visualized as a time-series chart in the Traceport dashboard.

Monitor Catalog

Traceport offers a wide range of monitors categorized by focus area:

Error & Reliability

Error Rate Spike: Triggers when the percentage of failed requests exceeds a threshold.
Provider Error Count: Monitors absolute error counts from specific AI providers.
Model Error Rate: Isolate and detect degradations in specific AI models.

Latency & Performance

High Latency (P95/P99): Monitors tail latency to ensure a responsive user experience.
TTFB Degradation: Detects increases in Time To First Byte for streaming responses.
Model Latency Spike: Monitors performance for specific model integrations.

Usage & Traffic

Request Rate Spike: Detects sudden surges in traffic that might impact availability.
Request Rate Drop: Useful for detecting client-side outages or integration failures.
Token Usage Spike: Monitors for unexpected spikes in token consumption.
Zero Traffic: Alerts you if a previously active API key stops sending requests.

Cost & Spend

Spend Threshold Breach: Alerts you when your cumulative spend exceeds a configured budget.
Spend Rate Anomaly: Detects unusual spending patterns compared to historical averages.
Cost Per Request Spike: Helps identify runaway prompts or expensive context growth.

Managing Alert Rules

Creating a Rule

Choose a Metric

Navigate to the Alerts page in the Dashboard and click Create Alert Rule. Select the monitor type from the catalog.

General Information

Provide a name, description, and define the Scope (e.g., specific API keys or organization-wide).

Alert Conditions

Define the comparison operator (>, <, ==), the threshold value, and the time window (e.g., 5 minutes).

Delivery Settings

Set the severity level (Info, Warning, Critical) and select your notification channels.

Editing a Rule

When editing an existing rule, you can quickly adjust the Conditions and Delivery Settings.

The monitor type and scope are fixed once a rule is created to maintain data consistency.

Handling Incidents

When an alert triggers, an Incident is created and your selected notification channels are notified.

Incident Lifecycle

Triggered: The alert’s conditions are currently being met.
Acknowledged: A team member has seen the alert and is investigating.
Resolved: The issue is fixed.

Auto-Resolution

Traceport can automatically resolve incidents if the metric returns to a healthy state for a sustained period. This can be enabled in the Delivery Settings of any alert rule.

Metric Timeline Chart

Every incident includes a Metric Timeline Chart. This allows you to visualize the exact moment the threshold was breached and see how the metric progressed over the duration of the incident, helping with root cause analysis.

Getting Started

Overview

Risk & Governance

Config Workflows

Prompt Studio

Playground

API Management

Alerts and Monitoring

Key Concepts

Monitor Catalog

Error & Reliability

Latency & Performance

Usage & Traffic

Cost & Spend

Managing Alert Rules

Creating a Rule

Editing a Rule

Handling Incidents

Incident Lifecycle

Auto-Resolution

Metric Timeline Chart

​Key Concepts

​Monitor Catalog

​Error & Reliability

​Latency & Performance

​Usage & Traffic

​Cost & Spend

​Managing Alert Rules

​Creating a Rule

​Editing a Rule

​Handling Incidents

​Incident Lifecycle

​Auto-Resolution

​Metric Timeline Chart

Key Concepts

Monitor Catalog

Error & Reliability

Latency & Performance

Usage & Traffic

Cost & Spend

Managing Alert Rules

Creating a Rule

Editing a Rule

Handling Incidents

Incident Lifecycle

Auto-Resolution

Metric Timeline Chart