Error Handling & Retries
Production integrations fail. Networks drop, services restart, rate limits are hit, and payloads occasionally have bad data. This guide covers every error code you'll encounter, when to retry vs. when to give up, and the retry infrastructure that makes your integration resilient.
Error response structure
All API errors use the same envelope:
{
"success": false,
"error": {
"code": "CDP_ETL.VALIDATION.REQUEST_INVALID",
"message": "records[2].email: invalid email format",
"details": {
"path": "records[2].email",
"value": "not-an-email"
}
}
}The code field is the machine-readable identifier. message is human-readable. details varies by error type — always include both in your logs.
Every response also includes an x-correlation-id header. Log it. If you need to contact support, this ID lets them trace the request end-to-end.
Error code reference
4xx — Client errors
| HTTP | Code | Meaning | Retryable? |
|---|---|---|---|
| 400 | CDP_ETL.VALIDATION.REQUEST_INVALID | Malformed JSON, missing required field, or field value failed type/format check | No — fix the data |
| 401 | CDP_ETL.AUTH.UNAUTHORIZED | Token missing, expired, or revoked | No — fix the token |
| 401 | CDP_ETL.AUTH.TOKEN_EXPIRED | Token has expired | No — refresh the token |
| 401 | CDP_ETL.AUTH.TOKEN_INVALID | Token is malformed or unrecognized | No — check the token |
| 403 | CDP_ETL.AUTH.FORBIDDEN | Token valid but lacks the required scope | No — add the scope |
| 404 | CDP_ETL.NOT_FOUND | Resource (list, import job, audience) doesn't exist | No |
| 409 | CDP_ETL.* (conflict) | Concurrent write to the same key raced | Yes — retry with backoff |
| 413 | CDP_ETL.VALIDATION.REQUEST_INVALID | Request body > 10 MB | No — reduce payload |
| 422 | CDP_ETL.VALIDATION.REQUEST_SCHEMA | Field name doesn't exist in the object schema | No — fix field names |
| 429 | CDP_ETL.* (rate limited) | Token quota exceeded | Yes — respect Retry-After |
5xx — Server errors
| HTTP | Code | Meaning | Retryable? |
|---|---|---|---|
| 500 | CDP_ETL.INTERNAL.UNHANDLED_EXCEPTION | Unexpected server error | Yes |
| 502 | CDP_ETL.* | Upstream service unavailable | Yes |
| 503 | CDP_ETL.* | Planned maintenance or overload | Yes |
| 504 | CDP_ETL.* | Request took > 30 s | Yes |
Do not retry 4xx errors except 429 and 409. The data is bad or the credentials are wrong — retrying won't fix it. Log the error and alert.
Retry strategy
Use exponential backoff with jitter. The jitter prevents thundering-herd retries when many clients fail simultaneously.
import os, random, time, requests
API_KEY = os.environ["EXPERITURE_API_KEY"]
BASE_URL = "https://api.experiture.ai/public/v1"
RETRYABLE_STATUS = {409, 429, 500, 502, 503, 504}
def with_retry(
fn,
max_attempts: int = 5,
base_delay: float = 1.0,
max_delay: float = 60.0,
):
for attempt in range(max_attempts):
resp = fn()
if resp.status_code < 400:
return resp.json()
if resp.status_code not in RETRYABLE_STATUS:
resp.raise_for_status() # Non-retryable — propagate immediately
if attempt == max_attempts - 1:
resp.raise_for_status() # Exhausted attempts
# Respect Retry-After for rate limits
retry_after = float(resp.headers.get("Retry-After", 0))
backoff = min(max_delay, base_delay * (2 ** attempt) + random.uniform(0, 1))
sleep_for = max(retry_after, backoff)
logger.warning(
"API error %s (attempt %d/%d), retrying in %.1fs",
resp.status_code, attempt + 1, max_attempts, sleep_for,
)
time.sleep(sleep_for)Usage:
def do_upsert():
return requests.post(
f"{BASE_URL}/records/profiles/upsert",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
"Idempotency-Key": idempotency_key,
},
json={"record": record, "matchKey": "email"},
)
result = with_retry(do_upsert)Idempotency: the foundation of safe retries
Without idempotency keys, a retry on an append operation creates a duplicate row. With them, re-sending the same request returns the cached original response — no side effects.
Always pass an Idempotency-Key on writes. Generate a key tied to the logical event, not the HTTP attempt:
# CORRECT — same key on every retry of this event
key = event["event_id"] # from Stripe, Shopify, Segment, etc.
# WRONG — generates a new key on each attempt, defeating idempotency
key = str(uuid.uuid4()) # called inside the retry loopFor events where you don't have a natural upstream ID, derive a stable key from the content:
import hashlib, json, uuid
def stable_key(record: dict) -> str:
canonical = json.dumps(record, sort_keys=True, ensure_ascii=True)
return str(uuid.UUID(hashlib.md5(canonical.encode()).hexdigest()))The same Idempotency-Key + body combination returns the cached response for 24 hours.
Rate limit handling
Rate limit responses include a Retry-After header telling you exactly how long to wait:
HTTP/1.1 429 Too Many Requests
Retry-After: 2.5
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1745255432Always check Retry-After before calculating your own backoff:
def handle_rate_limit(response: requests.Response, attempt: int) -> None:
retry_after = float(response.headers.get("Retry-After", 0))
backoff = min(60, (2 ** attempt) + random.random())
sleep_for = max(retry_after, backoff)
time.sleep(sleep_for)Also monitor X-RateLimit-Remaining on every response — don't wait for a 429 to discover you're near the limit:
remaining = int(response.headers.get("X-RateLimit-Remaining", 9999))
if remaining < 10:
logger.warning("Rate limit nearly exhausted: %d remaining", remaining)
time.sleep(0.5) # Voluntary backpressureHandling CDP_ETL.VALIDATION.REQUEST_SCHEMA in production
422 CDP_ETL.VALIDATION.REQUEST_SCHEMA means you sent a field that doesn't exist in the object schema. This is a configuration error, not a data error. Handle it differently from other failures:
resp = requests.post(
f"{BASE_URL}/records/profiles/upsert",
headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
json={"record": record, "matchKey": "email"},
)
if resp.status_code == 422:
error = resp.json().get("error", {})
if error.get("code") == "CDP_ETL.VALIDATION.REQUEST_SCHEMA":
# Don't retry — alert the team and drop to DLQ
alerting.fire(
title="CDP schema mismatch",
message=f"Field not found: {error.get('message')}",
severity="warning",
)
dlq.publish({"record": record, "error": error.get("message"), "type": "CDP_ETL.VALIDATION.REQUEST_SCHEMA"})
else:
resp.raise_for_status()
elif not resp.ok:
resp.raise_for_status()The schema mismatch DLQ lets you replay records after you add the missing field to the schema, without losing data.
Dead-letter queue pattern
Not every failure should crash your service. Route non-retryable errors to a DLQ for later investigation and replay:
from enum import Enum
class FailureType(Enum):
TRANSIENT = "transient" # Retry
SCHEMA = "schema" # Fix schema, then replay
BAD_DATA = "bad_data" # Fix source data
AUTH = "auth" # Fix credentials
def classify_error(status_code: int, error_code: str) -> FailureType:
if status_code in RETRYABLE_STATUS:
return FailureType.TRANSIENT
if error_code in ("CDP_ETL.VALIDATION.REQUEST_SCHEMA",):
return FailureType.SCHEMA
if error_code in ("CDP_ETL.VALIDATION.REQUEST_INVALID",):
return FailureType.BAD_DATA
if status_code in (401, 403):
return FailureType.AUTH
return FailureType.BAD_DATA
def write_with_dlq(record: dict, idempotency_key: str):
def do_upsert():
return requests.post(
f"{BASE_URL}/records/profiles/upsert",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
"Idempotency-Key": idempotency_key,
},
json={"record": record, "matchKey": "email"},
)
resp_data = with_retry(do_upsert)
if resp_data is None:
# with_retry raised on non-retryable
return
# If we land here something unexpected happened — check for error
resp = do_upsert()
if not resp.ok:
error = resp.json().get("error", {})
failure_type = classify_error(resp.status_code, error.get("code", ""))
if failure_type == FailureType.TRANSIENT:
raise requests.HTTPError(response=resp)
dlq.publish({
"record": record,
"idempotency_key": idempotency_key,
"error_code": error.get("code"),
"error_message": error.get("message"),
"failure_type": failure_type.value,
"failed_at": datetime.utcnow().isoformat() + "Z",
})Webhook integration: returning the right HTTP status
When your webhook handler catches CDP errors, what you return to the webhook provider controls whether it retries:
@app.post("/webhooks/stripe")
async def handle_stripe(request: Request, stripe_signature: str = Header(None)):
# ... verify signature, parse event ...
resp = requests.post(
f"{BASE_URL}/records/profiles/upsert",
headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
json={"record": record, "matchKey": "email"},
)
error = resp.json().get("error", {}) if not resp.ok else {}
failure_type = classify_error(resp.status_code, error.get("code", "")) if not resp.ok else None
if failure_type == FailureType.TRANSIENT:
# Tell Stripe to retry
raise HTTPException(status_code=503, detail="upstream_unavailable")
if failure_type == FailureType.SCHEMA:
# Bad data — don't retry, but log loudly
logger.error("Schema mismatch on event %s: %s", event["id"], error.get("message"))
dlq.publish({"event": event, "error": error.get("message")})
return {"ok": True} # Return 200 to avoid infinite retry
if failure_type == FailureType.BAD_DATA:
# Also return 200 — bad data won't improve on retry
logger.warning("Validation error on event %s: %s", event["id"], error.get("message"))
return {"ok": True}
if failure_type == FailureType.AUTH:
# This is your problem, not Stripe's — alert, return 200
alerting.fire("CDP auth error", severity="critical")
return {"ok": True}
return {"ok": True}Logging and observability
Always log enough context to debug a failure without retrying the original request:
import structlog
log = structlog.get_logger()
def write_profile(record: dict, event_id: str):
resp = requests.post(
f"{BASE_URL}/records/profiles/upsert",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
"Idempotency-Key": event_id,
},
json={"record": record, "matchKey": "email"},
)
if resp.ok:
result = resp.json()
log.info("profile.written",
email=record.get("email"),
event_id=event_id,
operation=result.get("data", {}).get("operation"),
)
return result
else:
error = resp.json().get("error", {})
log.error("profile.write_failed",
email=record.get("email"),
event_id=event_id,
status_code=resp.status_code,
error_code=error.get("code"),
error_message=error.get("message"),
correlation_id=resp.headers.get("x-correlation-id"), # Include this in support tickets
)
resp.raise_for_status()Quick reference: what to do with each error
| Error | Action |
|---|---|
429 (rate limited) | Sleep for Retry-After, then retry with same idempotency key |
409 (conflict) | Retry with exponential backoff; idempotency ensures correct result |
5xx | Retry with exponential backoff |
422 CDP_ETL.VALIDATION.REQUEST_SCHEMA | Log + alert + DLQ; fix schema and replay |
400 CDP_ETL.VALIDATION.REQUEST_INVALID | Log + DLQ; fix source data and replay |
401 CDP_ETL.AUTH.UNAUTHORIZED / CDP_ETL.AUTH.TOKEN_EXPIRED | Alert immediately; rotate/fix credentials |
403 CDP_ETL.AUTH.FORBIDDEN | Add required scope to token; do not retry |
413 | Reduce batch size; re-send |
404 CDP_ETL.NOT_FOUND | Verify resource ID; do not retry |
See Also
- Real-time Record Writes — idempotency and retry in the single-write context
- Batch Record Writes — per-record error handling in batch responses
- Webhook Handler Integration — HTTP status codes for webhook providers
- Rate Limits API reference