HeadlinesBriefing favicon HeadlinesBriefing.com

Cross-Script Name Matching via Byte-Level Transformers

Towards Data Science •
×

A new approach tackles one of the most frustrating problems in identity matching: finding "Владимир Путин" when your database only contains "Vladimir Putin." Traditional methods like edit distance and phonetic codes fail completely when scripts don't share characters. The solution? Train a model on raw UTF-8 bytes instead of characters. This isn't an obscure edge case—immigration databases, hospital record systems, and financial compliance pipelines deal with this daily.

Researchers built a compact transformer encoder with just 4 million parameters that processes names as byte sequences. Using a 4-stage LLM pipeline with Llama-3.1-8B and Qwen3-30B, they generated 4.67 million cross-script name pairs from Wikidata. The model achieved 0.775 MRR and 0.897 R@10 across Arabic, Russian, Chinese, Japanese, Hebrew, Hindi, Greek, and Korean—bridging the Latin/non-Latin performance gap by 10x over classical baselines.

The key insight is that every Unicode character decomposes deterministically into 1-4 bytes from a fixed 256-symbol alphabet. By training contrastively on enough phonetic pairs, the model learns to map "Владимир" and "Vladimir" to nearby vectors despite being completely different byte sequences. No tokenizer, no pretrained backbone, no script detection required—just raw bytes and a small transformer. The full code is on GitHub.