Identity Resolution
Identity resolution is the process of deciding which profile row a new write should merge into. Getting this right is the difference between a clean CDP and a proliferating mess of duplicates.
How matchKey works
Every upsert operation requires (or defaults to) a matchKey — the field the API uses to find an existing row.
Incoming record: { email: "jane@example.com", tier: "gold" }
matchKey: "email"
→ Query: SELECT id FROM profiles WHERE email = 'jane@example.com'
→ Found: merge tier = "gold" into row #1042
→ Not found: INSERT new rowThe matchKey field must exist in the object schema and must be indexed as an identity key. You can confirm which fields are identity keys:
curl https://api.experiture.ai/public/v1/metadata/objects/profiles \
-H "Authorization: Bearer <your_access_token>" \
| jq '.data.identityKeys'Choosing the right matchKey
Decision table
| Key | Best for | Watch out for |
|---|---|---|
email | B2C signups, newsletter lists, most webhook sources | Customers change emails; format inconsistency creates dupes |
customer_id | Any authenticated flow with a stable internal ID | Must be present on every write — missing it causes inserts instead of merges |
phone | SMS flows, phone-first signup | Format inconsistency is rampant; normalize to E.164 before writing |
external_id | CRM/ERP sync where you control the key space | Best option when you own the source of truth ID |
anonymous_id | Pre-authentication event attribution | Short-lived; must be linked to a durable key at login |
Default key
If you omit matchKey on an upsert, the API uses the object's configured primary key. For profiles, this is typically email unless your workspace is configured differently. Explicit is safer — always pass matchKey.
Progressive identity: linking identifiers over time
Real users don't arrive with a full profile. They accumulate identifiers over multiple touchpoints:
Session 1 (anonymous): anonymous_id = "anon_abc" → row A
Session 2 (signup): email = "jane@co.com" → row B (new)
Post-login link: email + anonymous_id on same row → merge A into B
Purchase: email + customer_id = "cust_99" → row B enriched
SMS opt-in: email + phone = "+12125551234" → row B enrichedTo link identifiers, write a single record containing both the old and new identifier, using the identifier you're confident about as the matchKey:
import os, requests
API_KEY = os.environ["EXPERITURE_API_KEY"]
BASE_URL = "https://api.experiture.ai/public/v1"
def upsert_profile(record: dict, match_key: str, idempotency_key: str):
requests.post(
f"{BASE_URL}/records/profiles/upsert",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
"Idempotency-Key": idempotency_key,
},
json={"record": record, "matchKey": match_key},
).raise_for_status()
# At login — link anonymous session to authenticated user
upsert_profile(
record={
"email": session["authenticated_email"],
"anonymous_id": session["anonymous_id"],
"last_login_at": session["timestamp"],
},
match_key="email",
idempotency_key=f"login:{session['id']}",
)After this write, anonymous_id = "anon_abc" is stored on the same row as email = "jane@co.com". Future writes using either identifier resolve to the same profile.
Identity collision: when two rows share an identifier
Collisions happen when a write tries to associate an identifier that's already owned by a different row. Example:
Row #100: email = "jane@co.com", customer_id = "cust_99"
Row #200: email = "bob@co.com"
Write: { email: "bob@co.com", customer_id: "cust_99" }
matchKey: "email"Row #200 would now hold customer_id = "cust_99" — but that ID is already on row #100. The CDP records both, creating an ambiguous identity graph.
Prevention strategies:
-
Write
customer_idbefore email — ifcustomer_idis the more authoritative key, use it asmatchKeyto let the CDP find (or create) the right row first. -
Validate upstream — before writing, check that the
customer_idyou're about to write isn't already associated with a different email in your own system. -
Use
external_idfor CRM sync — when your CRM is the system of record, use the CRM's record ID asexternal_id. This is a key you control and can guarantee is unique per person.
Handling email changes
When a customer changes their email, you have two options:
Option 1: Write the new email using customer_id as matchKey
upsert_profile(
record={
"customer_id": event["customer_id"],
"email": event["new_email"],
},
match_key="customer_id",
idempotency_key=f"email_change:{event['id']}",
)This finds the existing row by the stable customer_id and updates the email field — no duplicate created.
Option 2: Null out the old email explicitly
If you're using email as matchKey everywhere and don't have customer_id, write both the old email (to find the row) and the new email (to update it):
upsert_profile(
record={
"email": event["old_email"], # matchKey resolves the row
# Note: "email" can only appear once — use a custom field or two-step approach
# (in practice you'd need to handle this as a two-step or custom field)
},
match_key="email",
idempotency_key=f"email_change:{event['id']}",
)This is awkward with email as both matchKey and the field being changed. This is a strong argument for keeping a stable customer_id in your schema from day one.
Normalizing identifiers before writing
Inconsistent formatting is the most common cause of duplicate profiles. Apply normalization in your integration layer, before any write to the CDP.
import re, phonenumbers
def normalize_email(raw: str) -> str:
return raw.strip().lower()
def normalize_phone(raw: str, default_region: str = "US") -> str | None:
try:
parsed = phonenumbers.parse(raw, default_region)
if not phonenumbers.is_valid_number(parsed):
return None
return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
except phonenumbers.NumberParseException:
return None
def build_profile_record(event: dict) -> dict:
record = {}
if email := event.get("email"):
record["email"] = normalize_email(email)
if phone := event.get("phone"):
normalized = normalize_phone(phone)
if normalized:
record["phone"] = normalized
# If normalization fails, omit the field — don't write malformed data
for field in ("first_name", "last_name", "customer_id", "external_id"):
if value := event.get(field):
record[field] = value
return recordKey normalization rules:
| Identifier | Rule |
|---|---|
strip().lower() — always | |
| Phone | Normalize to E.164 (+12125551234) — reject on failure |
| Names | strip() only — don't lowercase; "McDonald" matters |
| External IDs | strip() — preserve casing if your source system is case-sensitive |
Checking for duplicates
If you suspect duplicate profiles have accumulated, use the Audience API to size segments that should be mutually exclusive:
# Count profiles with both email AND customer_id set
# (if your dedup is working, this should be nearly everyone)
curl -X POST https://api.experiture.ai/public/v1/audiences/aud_dedup_check/preview \
-H "Authorization: Bearer <your_access_token>" \
-d '{ "limit": 0 }'The more direct path is a SQL query against your warehouse if the CDP's Parquet export is available.
Merge semantics at the field level
Within a single profile row, each upsert applies field-level last-write-wins:
- A field written at T=10:01 and again at T=10:03 takes the T=10:03 value.
- A field written at T=10:01 and not included in the T=10:03 write keeps the T=10:01 value.
- A field written as
nullat T=10:03 is cleared, regardless of what it held before.
There is no automatic conflict resolution based on timestamps — the CDP accepts writes in the order they arrive. For out-of-order scenarios (like replaying historical events), maintain your own *_updated_at timestamp per logical field group and validate in your application layer before writing.
See Also
- Building a Unified Profile — how to accumulate profile data across sources
- Real-time Record Writes — upsert mechanics, null vs. absent semantics
- Bulk File Import — for importing large historical sets with matchKey-based dedup
- Metadata API — confirm identity keys for your workspace