Error Handling for Agent Triggers: Retries, Fallbacks, and Dead Letter Queues
Your trigger fires. Your agent starts processing. Then something breaks. Maybe the downstream API is down. Maybe your agent's context window overflows on a massive payload. Maybe the action partially completes — the email sent but the CRM update failed.
What happens next determines whether your automation is production-grade or a demo that works on sunny days.
## Why Triggers Fail
Trigger failures fall into three categories, and each requires a different strategy.
### Transient failures
The downstream service is temporarily unavailable. Slack is having an outage. Your CRM's API returns a 503 for 10 minutes during a deployment. These failures resolve themselves. The correct response is to wait and retry.
### Permanent failures
The data is invalid and no amount of retrying will fix it. A customer email address is malformed, so sending a receipt will always fail. A required field is missing from the webhook payload because the source changed its schema. The correct response is to route the event somewhere a human can investigate.
### Partial failures
The trigger executed multiple actions and some succeeded while others failed. Your agent sent the welcome email but failed to update the CRM. Retrying the entire trigger would send a duplicate email. The correct response depends on whether your actions are idempotent.
## Retry Policies
Retries are your first defense against transient failures. A naive retry (immediately re-run the trigger) creates more problems than it solves. You hammer a struggling service with more requests, you process events out of order, and you waste compute on attempts that have no chance of succeeding.
### Exponential backoff
ClawJolt uses exponential backoff by default. The first retry happens after 5 seconds. The second after 30 seconds. The third after 2 minutes. The fourth after 10 minutes. Each attempt waits progressively longer, giving the downstream service time to recover.
You can configure the backoff parameters per trigger:
- **Initial delay**: How long to wait before the first retry (default: 5 seconds) - **Multiplier**: How much to increase the delay each time (default: 3x) - **Max delay**: The ceiling on retry delay (default: 1 hour) - **Max attempts**: How many times to retry before giving up (default: 5)
### Jitter
When multiple triggers fail at the same time (because the downstream service went down), they all retry at the same intervals. This creates a "thundering herd" — all retries hit the recovering service simultaneously and knock it down again.
ClawJolt adds random jitter to each retry delay. Instead of retrying at exactly 30 seconds, it retries somewhere between 20 and 40 seconds. This spreads the load and gives the downstream service a smoother recovery curve.
### Retry conditions
Not every error deserves a retry. A 500 error from Slack is worth retrying. A 400 error (bad request) is not — the request is malformed and sending it again won't help. ClawJolt classifies HTTP responses automatically:
- **5xx errors**: Retry with backoff - **429 (rate limited)**: Retry after the `Retry-After` header duration - **4xx errors**: Do not retry, route to dead letter queue - **Timeouts**: Retry with backoff, but flag for investigation if recurring
## Fallback Actions
When retries are exhausted or the error is permanent, you need a fallback. A fallback is an alternative action that runs when the primary action fails.
### Common fallback patterns
**Notify a human.** The simplest fallback: send a Slack message or email to the person responsible. Include the original event data, the error details, and what the trigger was trying to do. This is the minimum viable fallback for any production trigger.
**Degrade gracefully.** If the CRM update fails, at least log the data to a spreadsheet or a database table. The customer still gets their welcome email, and someone can update the CRM manually from the log. Partial automation beats total failure.
**Use an alternative service.** If your primary email provider is down, fall back to a secondary one. If your Slack integration fails, send the notification via email instead. This requires more setup but eliminates single points of failure.
**Queue for later.** Accept the failure, store the event, and schedule a retry for a later time (e.g., 6 hours from now). Different from automatic retries because it's a deliberate decision to defer rather than an automatic backoff cycle.
### Setting up fallbacks in ClawJolt
In the trigger editor, every action has a "On failure" section. You can add one or more fallback actions that execute when the primary action fails after exhausting all retries. Fallback actions have their own retry policies — usually simpler, since a fallback that also fails means you're in real trouble and should notify a human.
## Dead Letter Queues
A dead letter queue (DLQ) is where events go when they can't be processed and all retries and fallbacks have been exhausted. Think of it as a holding area for events that need human attention.
### Why you need a DLQ
Without a DLQ, unprocessable events disappear. Your trigger tried 5 times, failed every time, and the event is gone. You have no record of what happened, no way to investigate, and no path to recovery.
With a DLQ, every failed event is preserved with full context: the original payload, the trigger that tried to process it, every retry attempt with the error response, and the timestamp of each attempt.
### Working with the DLQ
ClawJolt's DLQ is a first-class feature, not a log file you grep through.
- **Browse and filter**: See all dead-lettered events, filter by connector, trigger, error type, or date range - **Inspect**: Click any event to see the full payload, the trigger configuration at the time of processing, and the error chain - **Reprocess**: Fix the underlying issue (update a signing secret, fix a broken API key, adjust a trigger condition) and reprocess individual events or a batch - **Discard**: Mark events as intentionally unprocessable. Maybe a test event leaked into production, or the event is for a customer who no longer exists. Discarding removes it from the DLQ without processing. - **Alert**: Set alerts on DLQ depth. "If more than 10 events are in the DLQ, notify me." A growing DLQ means something is systematically broken, not just a one-off failure.
## Idempotency
Retries mean your agent might process the same event more than once. If your agent sends a welcome email on every retry attempt, the customer gets five emails. That's worse than sending none.
### Making actions idempotent
An action is idempotent if running it multiple times produces the same result as running it once. Some actions are naturally idempotent: updating a CRM field to "VIP" is the same whether you do it once or five times. Others are not: sending an email or posting a Slack message creates a new message every time.
For non-idempotent actions, use these strategies:
- **Deduplication keys**: ClawJolt tracks processed event IDs. If your agent has already successfully processed event `evt_abc123`, a retry of the same event skips the action. This is automatic for all ClawJolt triggers. - **Check before acting**: Have your agent check if the action was already performed. Before sending a welcome email, check if one was sent to this customer in the last hour. Before creating a CRM record, check if one already exists. This adds latency but prevents duplicates. - **Idempotency tokens**: For API calls that support them (Stripe, many payment providers), pass an idempotency key derived from the event ID. The API guarantees the action only executes once per key, regardless of how many times you call it.
## Monitoring Failed Triggers
Error handling is only as good as your visibility into what's failing.
- **Error rate dashboard**: Track the percentage of triggers that succeed on first attempt, succeed after retry, and end up in the DLQ. A healthy system has under 1% DLQ rate. - **Error pattern alerts**: ClawJolt groups similar errors together. "17 triggers failed in the last hour with 'Slack API rate limit exceeded'" is more useful than 17 individual alerts. - **Weekly DLQ review**: Schedule a recurring review of dead-lettered events. Most are one-off issues, but patterns in the DLQ reveal systemic problems you should fix.
Build your error handling before you need it. The first time a downstream service goes down for 2 hours and your triggers handle it gracefully — retrying, falling back, preserving events for reprocessing — is the moment trigger automation earns its keep.
Automate your agent triggers
ClawJolt connects real-world events to your OpenClaw agent — no code needed.