Skip to main content

Matching Algorithms

SanctionsWise API uses a sophisticated multi-layer matching engine to achieve high accuracy while minimizing false positives.


Overview​

The matching engine processes each query through multiple distinct layers, combining scores to produce a final confidence value:

Query Input
│
├─► Alias Pre-filter (alias lookup with boost)
│
├─► Layer 1: Exact Match (normalized)
│
├─► Layer 2: Fuzzy Match (Jaro-Winkler + SequenceMatcher)
│
├─► Name Component Matching (family name comparison)
│
├─► Layer 3: Phonetic Match (Soundex + Double Metaphone)
│
├─► Layer 4: Semantic Match (Bedrock Titan + S3 Vectors)
│
└─► Layer 5: Identifier Match (passport, tax ID, etc.)

↓
Enhanced Score → Adaptive Threshold → Match Decision

Alias Pre-filter​

What it does: Before scoring, checks if the query name matches any known alias of a sanctioned entity. Entities with alias matches receive a confidence boost.

3-Tier Alias Lookup​

The alias pre-filter uses a tiered approach for maximum recall:

TierMethodExample
Tier 1Exact alias match"Viktor Bout" matches alias "Viktor Bout"
Tier 2Family name match"Bout" matches entity family name "BOUT"
Tier 3Soundex match"Baut" (soundex B300) matches "BOUT" (soundex B300)

Boost: Entities matched via alias pre-filter receive a +15% confidence boost on their final score.

Example:

Query:  "Viktor Bout"
Alias Index: "Viktor Bout" → entity sw_ofac_sdn_8279

Result: Entity flagged for +0.15 alias boost
Base fuzzy score 0.7378 → Enhanced score 0.8878+

Best for: Matching against known aliases, transliteration variants, and common name forms


Layer 1: Exact Match​

What it does: Compares normalized versions of names character-by-character.

Normalization process:

  1. Convert to uppercase
  2. Remove titles (Mr., Dr., Sheikh, etc.)
  3. Remove special characters
  4. Collapse multiple spaces

Example:

Query:    "Dr. Vladimir V. Putin"
Normalized: "VLADIMIR V PUTIN"

Entity: "VLADIMIR V PUTIN"
Result: Exact match (score: 1.0)

Best for: Known exact name matches, repeat screenings


Layer 2: Fuzzy Match​

What it does: Detects near-matches using string similarity algorithms.

Jaro-Winkler Similarity (40% weight)​

Measures the minimum number of single-character transpositions required to change one string into another. Gives extra weight to matching prefixes.

Example:

Query:  "MARTHA"
Entity: "MARHTA" (transposition of 'H' and 'T')
Score: 0.944

SequenceMatcher (30% weight)​

Finds the longest contiguous matching subsequence, good for partial name matches.

Example:

Query:  "JOHN SMITH"
Entity: "JOHN ALEXANDER SMITH"
Score: 0.71 (shares "JOHN" and "SMITH" subsequences)

Combined Fuzzy Score​

fuzzy_score = (0.40 * jaro_winkler) + (0.30 * sequence_ratio) + (0.30 * phonetic)

Best for: Typos, transpositions, partial names, data entry errors


Name Component Matching​

What it does: Decomposes names into components (given name, family name, middle name) and compares them individually for a more targeted match.

How It Works​

  1. Name Decomposition: Both query and entity names are parsed into components

    • Detects inverted format: "BOUT, Viktor" → family="BOUT", given="Viktor"
    • Standard format: "Viktor Bout" → given="Viktor", family="Bout"
  2. Family Name Comparison: Compares family names using Jaro-Winkler similarity

  3. Bonus Application: If family name similarity >= 0.90, applies a +10% confidence bonus

Example:

Query:    "Viktor Bout"
Entity: "BOUT, Viktor Anatolijevitch"

Decomposed:
Query family: "Bout"
Entity family: "BOUT"
JW similarity: 1.0 (exact after normalization)

Result: +0.10 name component bonus applied

Best for: Names with shared family names but different given names, inverted name formats, names with middle names or patronymics


Layer 3: Phonetic Match (30% weight)​

What it does: Matches names that sound alike but are spelled differently.

Soundex Algorithm​

Classic phonetic algorithm encoding names by sound. Returns a 4-character code: first letter + 3 digits.

How it works:

  1. Keep first letter
  2. Map consonants to digits (B,F,P,V → 1; C,G,J,K,Q,S,X,Z → 2; etc.)
  3. Remove adjacent duplicates
  4. Pad/truncate to 4 characters

Example:

Name       | Soundex
-----------|--------
ROBERT | R163
RUPERT | R163
SMITH | S530
SMYTHE | S530

Double Metaphone Algorithm​

Advanced phonetic algorithm returning two codes:

  • Primary code: Most common pronunciation
  • Alternate code: Non-English variants (Germanic, Slavic patterns)

Example:

Name      | Primary | Alternate
----------|---------|----------
SCHMIDT | XMT | SMT
TCHAIKOVSKY | XKFSK | TKFSK

Phonetic Scoring​

phonetic_score = (
0.4 * (soundex_match) +
0.4 * (metaphone_primary_match) +
0.2 * (metaphone_alternate_match)
)

Real Example:

Query:  "Vladimer Pootin"  (common misspellings)
Entity: "VLADIMIR PUTIN"

Soundex: V435 vs V435 → Match!
Metaphone: FLTMR vs FLTMR → Match!
Phonetic Score: 0.60

Best for: Transliteration variants, accent variations, spelling errors based on pronunciation


Layer 4: Semantic Match (S3 Vectors + Bedrock Titan)​

What it does: Uses AI embeddings to find contextually similar names, even with significant textual differences.

How It Works​

  1. Embedding Generation: Query name is converted to a 1024-dimensional vector using Amazon Bedrock Titan Text Embeddings v2
  2. Vector Search: S3 Vectors performs approximate nearest neighbor search against indexed entity embeddings
  3. Similarity Scoring: Cosine similarity measures contextual alignment

Architecture:

Query: "Russian President Putin"
│
├─► Bedrock Titan Embed → [0.12, -0.45, 0.78, ...] (1024 dims)
│
└─► S3 Vectors Query
│
├─► "VLADIMIR VLADIMIROVICH PUTIN" → similarity: 0.89
├─► "PUTIN, VLADIMIR VLADIMIROVICH" → similarity: 0.87
└─► "VLADIMIR POTANIN" → similarity: 0.45

Semantic vs Traditional Matching​

QueryTraditional MatchSemantic Match
"Bank of Russia"May miss "CENTRAL BANK OF THE RUSSIAN FEDERATION"Finds it (similar context)
"Russian oligarch Abramovich"Requires exact nameFinds "ROMAN ARKADYEVICH ABRAMOVICH"
"Kim Jong-un's regime"No matchFinds related DPRK entities

Best for: Contextual queries, alias discovery, cross-language matches, descriptive queries


Layer 5: Identifier Match​

What it does: Matches on document numbers (passport, tax ID, etc.) with normalization and fuzzy matching.

Supported Identifier Types​

TypeAliasesExample
passportpassport_number, travel_documentAB1234567
national_idssn, id_number123-45-6789
tax_idtin, ein, vat_number98-7654321
registrationcompany_number, business_registration12345678

Matching Modes​

Exact Match:

Query:    {"passport": "AB123456"}
Entity: {"passport": "AB-123-456"}
Normalized: Both → "AB123456"
Result: Exact match (confidence: 1.0)

Partial Match: (for long identifiers ≥8 chars)

Query:    {"passport": "AB123456789"}
Entity: {"passport": "AB12345678901234"}
Result: Partial match (confidence: 0.9)

Identifier Bonus​

When an identifier matches, the overall confidence score receives a +15% bonus:

if identifier_matched:
final_score = min(base_score + 0.15, 1.0)
match_type = "identifier_confirmed"

Best for: KYC verification, document-based screening, high-confidence matches


Combined Scoring Formula​

The final confidence score combines all matching layers:

# Base score from fuzzy matching
base_score = (
0.40 * jaro_winkler_similarity +
0.30 * sequence_matcher_ratio +
0.30 * phonetic_similarity
)

# Alias pre-filter boost (+15% if alias match found)
if alias_match:
base_score += 0.15

# Name component bonus (+10% if family name matches)
if family_name_similarity >= 0.90:
base_score += 0.10

# Cap at 1.0
base_score = min(base_score, 1.0)

# Entity type bonus (if types match)
if query_type == entity_type:
base_score *= 1.05 # +5%

# Identifier match bonus
if identifier_matched:
base_score += 0.15 # +15%

# Semantic enhancement
if semantic_match:
final_score = (base_score * 0.80) + (semantic_score * 0.20)
else:
final_score = base_score

Configuration Reference​

All matching weights are configurable:

ParameterDefaultDescription
MatchingWeightJaro0.40Jaro-Winkler weight
MatchingWeightSequence0.30SequenceMatcher weight
MatchingWeightPhonetic0.30Phonetic similarity weight
AliasBoost0.15Alias pre-filter match bonus
NameComponentBonus0.10Family name match bonus
MatchingTypeBonus0.05Entity type match bonus
MatchingIdentifierBonus0.15Identifier match bonus
SemanticWeight0.20Semantic score weight

Performance Characteristics​

MetricValueNotes
Single entity (warm)~350msWith entity cache and enhanced matching
Single entity (cold)~2000msFirst request (entity + alias index loading)
P95 latency~475ms95th percentile across screening queries
Batch (100 entities)~500msAmortized loading
Semantic search+200msBedrock + S3 Vectors

Best Practices​

  1. Set appropriate thresholds:

    • 0.95+ for automated pass-through
    • 0.85 for standard screening
    • 0.70 for enhanced due diligence
  2. Use entity types when known:

    {"name": "Acme Corp", "entity_type": "organization"}
  3. Include identifiers for KYC:

    {
    "name": "John Smith",
    "identifiers": {"passport": "AB123456"}
    }
  4. Use batch endpoint for bulk screening

  5. Monitor match types in responses:

    • exact: Direct match
    • fuzzy: Near match (base fuzzy only)
    • enhanced_fuzzy: Near match with alias and/or name component boost
    • fuzzy+semantic: AI-enhanced
    • identifier_confirmed: Document verified

For API integration details, see the API Reference