Matching Algorithms
SanctionsWise API uses a sophisticated 5-layer matching engine to achieve high accuracy while minimizing false positives.
Overview​
The matching engine processes each query through five distinct layers, combining scores to produce a final confidence value:
Query Input
│
├─► Layer 1: Exact Match (normalized)
│
├─► Layer 2: Fuzzy Match (Jaro-Winkler + SequenceMatcher)
│
├─► Layer 3: Phonetic Match (Soundex + Double Metaphone)
│
├─► Layer 4: Semantic Match (Bedrock Titan + S3 Vectors)
│
└─► Layer 5: Identifier Match (passport, tax ID, etc.)
↓
Combined Score → Confidence Threshold → Match Decision
Layer 1: Exact Match​
What it does: Compares normalized versions of names character-by-character.
Normalization process:
- Convert to uppercase
- Remove titles (Mr., Dr., Sheikh, etc.)
- Remove special characters
- Collapse multiple spaces
Example:
Query: "Dr. Vladimir V. Putin"
Normalized: "VLADIMIR V PUTIN"
Entity: "VLADIMIR V PUTIN"
Result: Exact match (score: 1.0)
Best for: Known exact name matches, repeat screenings
Layer 2: Fuzzy Match​
What it does: Detects near-matches using string similarity algorithms.
Jaro-Winkler Similarity (40% weight)​
Measures the minimum number of single-character transpositions required to change one string into another. Gives extra weight to matching prefixes.
Example:
Query: "MARTHA"
Entity: "MARHTA" (transposition of 'H' and 'T')
Score: 0.944
SequenceMatcher (30% weight)​
Finds the longest contiguous matching subsequence, good for partial name matches.
Example:
Query: "JOHN SMITH"
Entity: "JOHN ALEXANDER SMITH"
Score: 0.71 (shares "JOHN" and "SMITH" subsequences)
Combined Fuzzy Score​
fuzzy_score = (0.40 * jaro_winkler) + (0.30 * sequence_ratio) + (0.30 * phonetic)
Best for: Typos, transpositions, partial names, data entry errors
Layer 3: Phonetic Match (30% weight)​
What it does: Matches names that sound alike but are spelled differently.
Soundex Algorithm​
Classic phonetic algorithm encoding names by sound. Returns a 4-character code: first letter + 3 digits.
How it works:
- Keep first letter
- Map consonants to digits (B,F,P,V → 1; C,G,J,K,Q,S,X,Z → 2; etc.)
- Remove adjacent duplicates
- Pad/truncate to 4 characters
Example:
Name | Soundex
-----------|--------
ROBERT | R163
RUPERT | R163
SMITH | S530
SMYTHE | S530
Double Metaphone Algorithm​
Advanced phonetic algorithm returning two codes:
- Primary code: Most common pronunciation
- Alternate code: Non-English variants (Germanic, Slavic patterns)
Example:
Name | Primary | Alternate
----------|---------|----------
SCHMIDT | XMT | SMT
TCHAIKOVSKY | XKFSK | TKFSK
Phonetic Scoring​
phonetic_score = (
0.4 * (soundex_match) +
0.4 * (metaphone_primary_match) +
0.2 * (metaphone_alternate_match)
)
Real Example:
Query: "Vladimer Pootin" (common misspellings)
Entity: "VLADIMIR PUTIN"
Soundex: V435 vs V435 → Match!
Metaphone: FLTMR vs FLTMR → Match!
Phonetic Score: 0.60
Best for: Transliteration variants, accent variations, spelling errors based on pronunciation
Layer 4: Semantic Match (S3 Vectors + Bedrock Titan)​
What it does: Uses AI embeddings to find contextually similar names, even with significant textual differences.
How It Works​
- Embedding Generation: Query name is converted to a 1024-dimensional vector using Amazon Bedrock Titan Text Embeddings v2
- Vector Search: S3 Vectors performs approximate nearest neighbor search against indexed entity embeddings
- Similarity Scoring: Cosine similarity measures contextual alignment
Architecture:
Query: "Russian President Putin"
│
├─► Bedrock Titan Embed → [0.12, -0.45, 0.78, ...] (1024 dims)
│
└─► S3 Vectors Query
│
├─► "VLADIMIR VLADIMIROVICH PUTIN" → similarity: 0.89
├─► "PUTIN, VLADIMIR VLADIMIROVICH" → similarity: 0.87
└─► "VLADIMIR POTANIN" → similarity: 0.45
Semantic vs Traditional Matching​
| Query | Traditional Match | Semantic Match |
|---|---|---|
| "Bank of Russia" | May miss "CENTRAL BANK OF THE RUSSIAN FEDERATION" | Finds it (similar context) |
| "Russian oligarch Abramovich" | Requires exact name | Finds "ROMAN ARKADYEVICH ABRAMOVICH" |
| "Kim Jong-un's regime" | No match | Finds related DPRK entities |
Best for: Contextual queries, alias discovery, cross-language matches, descriptive queries
Layer 5: Identifier Match​
What it does: Matches on document numbers (passport, tax ID, etc.) with normalization and fuzzy matching.
Supported Identifier Types​
| Type | Aliases | Example |
|---|---|---|
| passport | passport_number, travel_document | AB1234567 |
| national_id | ssn, id_number | 123-45-6789 |
| tax_id | tin, ein, vat_number | 98-7654321 |
| registration | company_number, business_registration | 12345678 |
Matching Modes​
Exact Match:
Query: {"passport": "AB123456"}
Entity: {"passport": "AB-123-456"}
Normalized: Both → "AB123456"
Result: Exact match (confidence: 1.0)
Partial Match: (for long identifiers ≥8 chars)
Query: {"passport": "AB123456789"}
Entity: {"passport": "AB12345678901234"}
Result: Partial match (confidence: 0.9)
Identifier Bonus​
When an identifier matches, the overall confidence score receives a +15% bonus:
if identifier_matched:
final_score = min(base_score + 0.15, 1.0)
match_type = "identifier_confirmed"
Best for: KYC verification, document-based screening, high-confidence matches
Combined Scoring Formula​
The final confidence score combines all matching layers:
# Base score from fuzzy matching
base_score = (
0.40 * jaro_winkler_similarity +
0.30 * sequence_matcher_ratio +
0.30 * phonetic_similarity
)
# Entity type bonus (if types match)
if query_type == entity_type:
base_score *= 1.05 # +5%
# Identifier match bonus
if identifier_matched:
base_score += 0.15 # +15%
# Semantic enhancement
if semantic_match:
final_score = (base_score * 0.80) + (semantic_score * 0.20)
else:
final_score = base_score
Configuration Reference​
All matching weights are configurable:
| Parameter | Default | Description |
|---|---|---|
| MatchingWeightJaro | 0.40 | Jaro-Winkler weight |
| MatchingWeightSequence | 0.30 | SequenceMatcher weight |
| MatchingWeightPhonetic | 0.30 | Phonetic similarity weight |
| MatchingTypeBonus | 0.05 | Entity type match bonus |
| MatchingIdentifierBonus | 0.15 | Identifier match bonus |
| SemanticWeight | 0.20 | Semantic score weight |
Performance Characteristics​
| Metric | Value | Notes |
|---|---|---|
| Single entity (warm) | ~10ms | With entity cache |
| Single entity (cold) | ~1000ms | First request |
| Batch (100 entities) | ~500ms | Amortized loading |
| Semantic search | +200ms | Bedrock + S3 Vectors |
| Entities per second | 100+ | Batch mode |
Best Practices​
-
Set appropriate thresholds:
- 0.95+ for automated pass-through
- 0.85 for standard screening
- 0.70 for enhanced due diligence
-
Use entity types when known:
{"name": "Acme Corp", "entity_type": "organization"} -
Include identifiers for KYC:
{
"name": "John Smith",
"identifiers": {"passport": "AB123456"}
} -
Use batch endpoint for bulk screening
-
Monitor match types in responses:
exact: Direct matchfuzzy: Near matchfuzzy+semantic: AI-enhancedidentifier_confirmed: Document verified
For API integration details, see the API Reference