Skip to main content

Errors and Retry

The webhook system includes a robust retry mechanism to ensure reliable delivery of notifications. This section describes how error and retry handling works, along with best practices to optimize your webhook endpoint for consistent delivery.

Error Responses and Retry Mechanism

When the platform sends a webhook notification, it expects an HTTP response from your endpoint. The response code determines whether the notification was successfully delivered or if a retry is needed:

  • 2xx Status Code: A 2xx response (e.g., 200 OK or 204 No Content) indicates successful delivery. No further retries will be attempted for that notification.
  • Non-2xx Status Code: Any other status code (e.g., 4xx or 5xx) will trigger the retry mechanism.

Retry Schedule

The retry mechanism employs an exponential backoff strategy, gradually increasing the interval between retries. Below is the approximate retry schedule:

  1. First Retry: 1 minute after the initial delivery attempt.
  2. Subsequent Retries: Retried at exponential interval, doubling the delay each time (e.g., 2 minutes, 4 minutes, 8 minutes, etc.) up to 13 retries.

The retry process will stop once either a 2xx response is received, or the maximum retry limit is reached.

Note: If retries exceed the expiration threshold you set for event processing, you may need to adjust your expiration policy to account for potential delays.

Common Causes of Errors and Solutions

Handling errors effectively can minimize failed deliveries and ensure reliable webhook processing. Below are common causes of errors and recommended solutions:

  1. Endpoint Unavailability: Ensure that your endpoint is consistently available and can handle incoming traffic. Using a load balancer or deploying across multiple servers can help reduce downtime.

  2. Network Issues: Occasionally, network interruptions can lead to failed deliveries. Consider monitoring your network infrastructure for latency or connectivity issues, and implement retry logic for outbound connections if you are forwarding notifications to other services.

  3. Timeouts: The platform allows a 60-second response window. If your endpoint takes too long to process the request, the connection may time out.

    • Solution: Keep webhook processing lightweight. Offload heavy tasks to background jobs or asynchronous processing queues, and respond immediately with a 2xx status code once the payload is received.
  4. Client Errors (4xx Codes): These indicate an issue with your endpoint setup, such as missing or malformed parameters.

    • Solution: Validate incoming payloads and ensure your endpoint can parse all expected fields. For unexpected or new fields, use flexible JSON parsing to avoid strict errors.
  5. Server Errors (5xx Codes): These errors often arise from server misconfiguration or temporary issues on your end.

    • Solution: Monitor logs and set up alerts to detect and quickly resolve 5xx errors in your application.

Best Practices for Reliable Error Handling

To ensure smooth operation and minimize disruptions, follow these best practices:

  • Log All Webhook Requests: Maintain a log of incoming webhook requests, including response statuses and any error messages. This can help with troubleshooting and identifying patterns in delivery failures.

  • Implement Idempotent Processing: Since retries may result in duplicate notifications, design your endpoint to handle duplicate payloads gracefully using the idempotency_key to detect and ignore duplicates.

  • Send Accurate Status Codes: Always return appropriate HTTP status codes to communicate the state of processing. Use:

    • 2xx for successful processing
    • 5xx if there is a server-side issue on your end
  • Set Up Monitoring and Alerts: Use monitoring tools to track your webhook endpoint's availability, error rates, and response times. Alerts for high failure rates or prolonged downtime allow you to address issues quickly.

Handling Retries with Exponential Backoff

The exponential backoff strategy helps reduce load during high-error periods, spacing out retry attempts to allow time for issue resolution. However, consider the following adjustments if your system experiences consistent failures:

  • Adjust Expiration Thresholds: Ensure your system's expiration policy is compatible with the maximum retry duration.
  • Optimize for Scalability: Design your endpoint to handle surges in retries during backoff periods, such as by using serverless functions or auto-scaling.

By implementing these error handling strategies and optimizing your endpoint for reliability, you can reduce missed notifications and maintain seamless integration with the Hub platform's webhook system.