Guides
Identity Resolution

Identity Resolution

Identity resolution is the process of deciding which profile row a new write should merge into. Getting this right is the difference between a clean CDP and a proliferating mess of duplicates.


How matchKey works

Every upsert operation requires (or defaults to) a matchKey — the field the API uses to find an existing row.

Incoming record: { email: "jane@example.com", tier: "gold" }
matchKey: "email"

→ Query: SELECT id FROM profiles WHERE email = 'jane@example.com'
→ Found:  merge tier = "gold" into row #1042
→ Not found: INSERT new row

The matchKey field must exist in the object schema and must be indexed as an identity key. You can confirm which fields are identity keys:

curl https://api.experiture.ai/public/v1/metadata/objects/profiles \
  -H "Authorization: Bearer <your_access_token>" \
  | jq '.data.identityKeys'

Choosing the right matchKey

Decision table

KeyBest forWatch out for
emailB2C signups, newsletter lists, most webhook sourcesCustomers change emails; format inconsistency creates dupes
customer_idAny authenticated flow with a stable internal IDMust be present on every write — missing it causes inserts instead of merges
phoneSMS flows, phone-first signupFormat inconsistency is rampant; normalize to E.164 before writing
external_idCRM/ERP sync where you control the key spaceBest option when you own the source of truth ID
anonymous_idPre-authentication event attributionShort-lived; must be linked to a durable key at login

Default key

If you omit matchKey on an upsert, the API uses the object's configured primary key. For profiles, this is typically email unless your workspace is configured differently. Explicit is safer — always pass matchKey.


Progressive identity: linking identifiers over time

Real users don't arrive with a full profile. They accumulate identifiers over multiple touchpoints:

Session 1 (anonymous):  anonymous_id = "anon_abc"       → row A
Session 2 (signup):     email = "jane@co.com"           → row B (new)
Post-login link:        email + anonymous_id on same row → merge A into B
Purchase:               email + customer_id = "cust_99"  → row B enriched
SMS opt-in:             email + phone = "+12125551234"   → row B enriched

To link identifiers, write a single record containing both the old and new identifier, using the identifier you're confident about as the matchKey:

import os, requests
 
API_KEY = os.environ["EXPERITURE_API_KEY"]
BASE_URL = "https://api.experiture.ai/public/v1"
 
def upsert_profile(record: dict, match_key: str, idempotency_key: str):
    requests.post(
        f"{BASE_URL}/records/profiles/upsert",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
            "Idempotency-Key": idempotency_key,
        },
        json={"record": record, "matchKey": match_key},
    ).raise_for_status()
 
# At login — link anonymous session to authenticated user
upsert_profile(
    record={
        "email": session["authenticated_email"],
        "anonymous_id": session["anonymous_id"],
        "last_login_at": session["timestamp"],
    },
    match_key="email",
    idempotency_key=f"login:{session['id']}",
)

After this write, anonymous_id = "anon_abc" is stored on the same row as email = "jane@co.com". Future writes using either identifier resolve to the same profile.


Identity collision: when two rows share an identifier

Collisions happen when a write tries to associate an identifier that's already owned by a different row. Example:

Row #100: email = "jane@co.com", customer_id = "cust_99"
Row #200: email = "bob@co.com"

Write:    { email: "bob@co.com", customer_id: "cust_99" }
matchKey: "email"

Row #200 would now hold customer_id = "cust_99" — but that ID is already on row #100. The CDP records both, creating an ambiguous identity graph.

Prevention strategies:

  1. Write customer_id before email — if customer_id is the more authoritative key, use it as matchKey to let the CDP find (or create) the right row first.

  2. Validate upstream — before writing, check that the customer_id you're about to write isn't already associated with a different email in your own system.

  3. Use external_id for CRM sync — when your CRM is the system of record, use the CRM's record ID as external_id. This is a key you control and can guarantee is unique per person.


Handling email changes

When a customer changes their email, you have two options:

Option 1: Write the new email using customer_id as matchKey

upsert_profile(
    record={
        "customer_id": event["customer_id"],
        "email": event["new_email"],
    },
    match_key="customer_id",
    idempotency_key=f"email_change:{event['id']}",
)

This finds the existing row by the stable customer_id and updates the email field — no duplicate created.

Option 2: Null out the old email explicitly

If you're using email as matchKey everywhere and don't have customer_id, write both the old email (to find the row) and the new email (to update it):

upsert_profile(
    record={
        "email": event["old_email"],           # matchKey resolves the row
        # Note: "email" can only appear once — use a custom field or two-step approach
        # (in practice you'd need to handle this as a two-step or custom field)
    },
    match_key="email",
    idempotency_key=f"email_change:{event['id']}",
)

This is awkward with email as both matchKey and the field being changed. This is a strong argument for keeping a stable customer_id in your schema from day one.


Normalizing identifiers before writing

Inconsistent formatting is the most common cause of duplicate profiles. Apply normalization in your integration layer, before any write to the CDP.

import re, phonenumbers
 
def normalize_email(raw: str) -> str:
    return raw.strip().lower()
 
def normalize_phone(raw: str, default_region: str = "US") -> str | None:
    try:
        parsed = phonenumbers.parse(raw, default_region)
        if not phonenumbers.is_valid_number(parsed):
            return None
        return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
    except phonenumbers.NumberParseException:
        return None
 
def build_profile_record(event: dict) -> dict:
    record = {}
 
    if email := event.get("email"):
        record["email"] = normalize_email(email)
 
    if phone := event.get("phone"):
        normalized = normalize_phone(phone)
        if normalized:
            record["phone"] = normalized
        # If normalization fails, omit the field — don't write malformed data
 
    for field in ("first_name", "last_name", "customer_id", "external_id"):
        if value := event.get(field):
            record[field] = value
 
    return record

Key normalization rules:

IdentifierRule
Emailstrip().lower() — always
PhoneNormalize to E.164 (+12125551234) — reject on failure
Namesstrip() only — don't lowercase; "McDonald" matters
External IDsstrip() — preserve casing if your source system is case-sensitive

Checking for duplicates

If you suspect duplicate profiles have accumulated, use the Audience API to size segments that should be mutually exclusive:

# Count profiles with both email AND customer_id set
# (if your dedup is working, this should be nearly everyone)
curl -X POST https://api.experiture.ai/public/v1/audiences/aud_dedup_check/preview \
  -H "Authorization: Bearer <your_access_token>" \
  -d '{ "limit": 0 }'

The more direct path is a SQL query against your warehouse if the CDP's Parquet export is available.


Merge semantics at the field level

Within a single profile row, each upsert applies field-level last-write-wins:

  • A field written at T=10:01 and again at T=10:03 takes the T=10:03 value.
  • A field written at T=10:01 and not included in the T=10:03 write keeps the T=10:01 value.
  • A field written as null at T=10:03 is cleared, regardless of what it held before.

There is no automatic conflict resolution based on timestamps — the CDP accepts writes in the order they arrive. For out-of-order scenarios (like replaying historical events), maintain your own *_updated_at timestamp per logical field group and validate in your application layer before writing.


See Also