HeadlinesBriefing favicon HeadlinesBriefing.com

Robust Variable Selection Boosts Scoring Model Accuracy

Towards Data Science •
×

Stratified cross-validation ensures variables hold across data subsets. The Credit Scoring Dataset (32,581 loans) uses four rules to eliminate unstable variables. Rule 1 drops continuous variables failing Kruskal-Wallis tests in any fold. Rule 2 removes categorical variables with Cramér’s V below 10% in any fold. Rule 3 eliminates correlated continuous pairs (Spearman ≥60%), keeping the variable with the strongest default link. Rule 4 applies similar logic to categorical variables. This leaves 7 auditable variables: 5 continuous (person_income, person_age, etc.) and 2 categorical (person_home_ownership, loan_intent). Each passes tests on all four folds.

The dataset’s continuous variables include loan amount and borrower income. Categorical variables track defaults and loan purposes. By applying rules to training folds only, the method prevents data leakage. Variables failing any fold’s test are excluded. This approach avoids the pitfall of variables performing well on training data but failing in production. The final selection shows stable associations with default across all data splits.

Correlation checks reveal loan_grade and cb_person_cred_hist_length were dropped due to redundancy. Loan_intent survived despite weak full-dataset correlation because it passed fold-specific tests. The process emphasizes stability over raw performance metrics. Regulators and stakeholders can audit each decision, ensuring transparency. Future work will examine monotonicity and temporal stability of these variables.

Cramér’s V and Kruskal-Wallis tests form the statistical backbone. The method prioritizes variables consistently mattering across subsets, not just the full dataset. This reduces overfitting risks and improves model generalizability. For credit scoring, where accuracy and fairness are critical, such rigor is essential. The approach balances statistical rigor with practical interpretability, making it valuable for financial institutions deploying production models.