Skip to main content

Matching Algorithms

SanctionsWise API uses a sophisticated 5-layer matching engine to achieve high accuracy while minimizing false positives.


Overview​

The matching engine processes each query through five distinct layers, combining scores to produce a final confidence value:

Query Input
│
├─► Layer 1: Exact Match (normalized)
│
├─► Layer 2: Fuzzy Match (Jaro-Winkler + SequenceMatcher)
│
├─► Layer 3: Phonetic Match (Soundex + Double Metaphone)
│
├─► Layer 4: Semantic Match (Bedrock Titan + S3 Vectors)
│
└─► Layer 5: Identifier Match (passport, tax ID, etc.)

↓
Combined Score → Confidence Threshold → Match Decision

Layer 1: Exact Match​

What it does: Compares normalized versions of names character-by-character.

Normalization process:

  1. Convert to uppercase
  2. Remove titles (Mr., Dr., Sheikh, etc.)
  3. Remove special characters
  4. Collapse multiple spaces

Example:

Query:    "Dr. Vladimir V. Putin"
Normalized: "VLADIMIR V PUTIN"

Entity: "VLADIMIR V PUTIN"
Result: Exact match (score: 1.0)

Best for: Known exact name matches, repeat screenings


Layer 2: Fuzzy Match​

What it does: Detects near-matches using string similarity algorithms.

Jaro-Winkler Similarity (40% weight)​

Measures the minimum number of single-character transpositions required to change one string into another. Gives extra weight to matching prefixes.

Example:

Query:  "MARTHA"
Entity: "MARHTA" (transposition of 'H' and 'T')
Score: 0.944

SequenceMatcher (30% weight)​

Finds the longest contiguous matching subsequence, good for partial name matches.

Example:

Query:  "JOHN SMITH"
Entity: "JOHN ALEXANDER SMITH"
Score: 0.71 (shares "JOHN" and "SMITH" subsequences)

Combined Fuzzy Score​

fuzzy_score = (0.40 * jaro_winkler) + (0.30 * sequence_ratio) + (0.30 * phonetic)

Best for: Typos, transpositions, partial names, data entry errors


Layer 3: Phonetic Match (30% weight)​

What it does: Matches names that sound alike but are spelled differently.

Soundex Algorithm​

Classic phonetic algorithm encoding names by sound. Returns a 4-character code: first letter + 3 digits.

How it works:

  1. Keep first letter
  2. Map consonants to digits (B,F,P,V → 1; C,G,J,K,Q,S,X,Z → 2; etc.)
  3. Remove adjacent duplicates
  4. Pad/truncate to 4 characters

Example:

Name       | Soundex
-----------|--------
ROBERT | R163
RUPERT | R163
SMITH | S530
SMYTHE | S530

Double Metaphone Algorithm​

Advanced phonetic algorithm returning two codes:

  • Primary code: Most common pronunciation
  • Alternate code: Non-English variants (Germanic, Slavic patterns)

Example:

Name      | Primary | Alternate
----------|---------|----------
SCHMIDT | XMT | SMT
TCHAIKOVSKY | XKFSK | TKFSK

Phonetic Scoring​

phonetic_score = (
0.4 * (soundex_match) +
0.4 * (metaphone_primary_match) +
0.2 * (metaphone_alternate_match)
)

Real Example:

Query:  "Vladimer Pootin"  (common misspellings)
Entity: "VLADIMIR PUTIN"

Soundex: V435 vs V435 → Match!
Metaphone: FLTMR vs FLTMR → Match!
Phonetic Score: 0.60

Best for: Transliteration variants, accent variations, spelling errors based on pronunciation


Layer 4: Semantic Match (S3 Vectors + Bedrock Titan)​

What it does: Uses AI embeddings to find contextually similar names, even with significant textual differences.

How It Works​

  1. Embedding Generation: Query name is converted to a 1024-dimensional vector using Amazon Bedrock Titan Text Embeddings v2
  2. Vector Search: S3 Vectors performs approximate nearest neighbor search against indexed entity embeddings
  3. Similarity Scoring: Cosine similarity measures contextual alignment

Architecture:

Query: "Russian President Putin"
│
├─► Bedrock Titan Embed → [0.12, -0.45, 0.78, ...] (1024 dims)
│
└─► S3 Vectors Query
│
├─► "VLADIMIR VLADIMIROVICH PUTIN" → similarity: 0.89
├─► "PUTIN, VLADIMIR VLADIMIROVICH" → similarity: 0.87
└─► "VLADIMIR POTANIN" → similarity: 0.45

Semantic vs Traditional Matching​

QueryTraditional MatchSemantic Match
"Bank of Russia"May miss "CENTRAL BANK OF THE RUSSIAN FEDERATION"Finds it (similar context)
"Russian oligarch Abramovich"Requires exact nameFinds "ROMAN ARKADYEVICH ABRAMOVICH"
"Kim Jong-un's regime"No matchFinds related DPRK entities

Best for: Contextual queries, alias discovery, cross-language matches, descriptive queries


Layer 5: Identifier Match​

What it does: Matches on document numbers (passport, tax ID, etc.) with normalization and fuzzy matching.

Supported Identifier Types​

TypeAliasesExample
passportpassport_number, travel_documentAB1234567
national_idssn, id_number123-45-6789
tax_idtin, ein, vat_number98-7654321
registrationcompany_number, business_registration12345678

Matching Modes​

Exact Match:

Query:    {"passport": "AB123456"}
Entity: {"passport": "AB-123-456"}
Normalized: Both → "AB123456"
Result: Exact match (confidence: 1.0)

Partial Match: (for long identifiers ≥8 chars)

Query:    {"passport": "AB123456789"}
Entity: {"passport": "AB12345678901234"}
Result: Partial match (confidence: 0.9)

Identifier Bonus​

When an identifier matches, the overall confidence score receives a +15% bonus:

if identifier_matched:
final_score = min(base_score + 0.15, 1.0)
match_type = "identifier_confirmed"

Best for: KYC verification, document-based screening, high-confidence matches


Combined Scoring Formula​

The final confidence score combines all matching layers:

# Base score from fuzzy matching
base_score = (
0.40 * jaro_winkler_similarity +
0.30 * sequence_matcher_ratio +
0.30 * phonetic_similarity
)

# Entity type bonus (if types match)
if query_type == entity_type:
base_score *= 1.05 # +5%

# Identifier match bonus
if identifier_matched:
base_score += 0.15 # +15%

# Semantic enhancement
if semantic_match:
final_score = (base_score * 0.80) + (semantic_score * 0.20)
else:
final_score = base_score

Configuration Reference​

All matching weights are configurable:

ParameterDefaultDescription
MatchingWeightJaro0.40Jaro-Winkler weight
MatchingWeightSequence0.30SequenceMatcher weight
MatchingWeightPhonetic0.30Phonetic similarity weight
MatchingTypeBonus0.05Entity type match bonus
MatchingIdentifierBonus0.15Identifier match bonus
SemanticWeight0.20Semantic score weight

Performance Characteristics​

MetricValueNotes
Single entity (warm)~10msWith entity cache
Single entity (cold)~1000msFirst request
Batch (100 entities)~500msAmortized loading
Semantic search+200msBedrock + S3 Vectors
Entities per second100+Batch mode

Best Practices​

  1. Set appropriate thresholds:

    • 0.95+ for automated pass-through
    • 0.85 for standard screening
    • 0.70 for enhanced due diligence
  2. Use entity types when known:

    {"name": "Acme Corp", "entity_type": "organization"}
  3. Include identifiers for KYC:

    {
    "name": "John Smith",
    "identifiers": {"passport": "AB123456"}
    }
  4. Use batch endpoint for bulk screening

  5. Monitor match types in responses:

    • exact: Direct match
    • fuzzy: Near match
    • fuzzy+semantic: AI-enhanced
    • identifier_confirmed: Document verified

For API integration details, see the API Reference