HeadlinesBriefing favicon HeadlinesBriefing.com

Finnish dictionary shrinks from 3 GB to 10 MB with Rust FST

Hacker News •
×

When the Finnish‑English dictionary Taskusanakirja (tsk) hit a 3 GB SQLite file, its creator faced a memory nightmare on modest laptops. The original Go‑based trie handled a few hundred thousand entries in about 60 MB, but Finnish’s aggressive agglutination pushed the data set into tens of millions of forms, breaking the static binary model for offline use on low‑end hardware.

After a stop‑gap SQLite solution delivered instant search but forced users to download the bulky database, the developer rewrote the extractor in Rust, inspired by Andrew Gallant’s fst crate. By converting the ordered string map into a finite‑state transducer, the 3 GB payload collapsed to roughly 10 MB—a 300× reduction—while preserving prefix, fuzzy and suffix queries essential for language learners even on older CPUs.

The resulting Pro build sits under 20 MB, three times smaller than the free version’s original footprint, and runs as a single static executable on any platform. Because the dictionary is immutable at runtime, the fst’s main weakness—lack of dynamic updates—doesn’t matter, delivering instant lookups without the overhead of B‑trees or full‑text indexes, making it suitable for schools with limited bandwidth.