HeadlinesBriefing favicon HeadlinesBriefing.com

Classical NLP Stands Up in Kaggle Author Attribution Test

Towards Data Science •
×

The Spooky Author Identification challenge on Kaggle asks a model to label a single gothic sentence as written by Edgar Allan Poe, Mary Shelley or H. P. Lovecraft. Because the three authors share themes, surface keywords provide little signal; stylistic cues such as function words, punctuation and rhythm become decisive. The experiment tests how far classical NLP pipelines can push performance when representation choices are tuned.

The author built a stepwise suite of models, beginning with a fast Vowpal Wabbit baseline that hashed lower‑cased words and bigrams. Adding separate namespaces for punctuation and character n‑grams yielded a richer VW model that improved accuracy and macro‑F1 on a stratified 70/30 holdout. This increment demonstrated that lightweight style features can meaningfully boost linear classifiers.

Next, a TF‑IDF pipeline combined word unigrams‑bigrams with character 2‑to‑5‑grams, producing a sparse matrix that rivaled the VW results. The author then stacked out‑of‑fold predictions from both VW and TF‑IDF models into a final ensemble, achieving the best multiclass log‑loss on the validation set. The survey also compared BM25, Word2Vec and FastText, confirming that dense embeddings offered no clear advantage for this task.

The notebook and code are publicly available on GitHub, allowing practitioners to replicate the workflow or adapt the feature engineering tricks to other authorship problems. By extracting style signals without neural networks, the study shows that classical NLP still competes with modern embeddings on niche, low‑resource classification tasks.