HeadlinesBriefing favicon HeadlinesBriefing.com

Building POI Pipeline for Running App Reveals LLM Limitations

Hacker News •
×

I'm developing In the Long Run, a running app that maps Strava mileage onto famous routes worldwide. To enrich the experience with points of interest, I built a data pipeline processing Geonames dumps, reducing 13 million locations to roughly 725,000 relevant sites. The stack uses Python, Apache Parquet for storage, and Duck DB for queries, with Shapely handling geographic calculations.

The pipeline filters administrative divisions and applies population/elevation thresholds, then matches candidates to routes within 50km buffers. For Route 66 (3,787 km), this yielded 14,181 POIs. Early results revealed cultural bias: the Wikipedia link signal favored English-speaking regions, missing significant landmarks in other areas. This bias emerged because Geonames stores canonical names in local languages, requiring careful cross-referencing.

For enrichment, I used Anthropic's Haiku model to rate POIs by significance, fetching Wikipedia summaries and checking multilingual Wikidata coverage. The LLM approach faced challenges: inconsistent output formats and costly processing (around $10 for major routes). Initial prompts caused hallucinations, like misidentifying Central Park locations, requiring grounded metadata in subsequent iterations.

The project demonstrates that AI works best as support infrastructure rather than primary logic. Working with familiar tools like SQL while incrementally adopting new technologies—Parquet and Duck DB—proved more effective than wholesale stack changes. LLMs excel at specific tasks but demand careful prompting and validation to avoid costly errors.