Every credit model in production has the same quiet problem: it was trained on a biased sample. The approved applicants. The ones where outcomes are known. But lending decisions extend to everyone — including the people the model was never trained on.
This project traces that problem through five phases of analysis, using 307,000 Home Credit loan applications enriched with a synthetic rejection population. We asked three questions. First: how bad is the selection bias, exactly? Second: can three established correction methods fix it? Third: does any of this change how the model treats different groups of people?
The answers surprised us in some places and confirmed what we expected in others. Selection bias turns out to be a calibration problem more than a discrimination problem — and its effects concentrate precisely where the training data was thinnest.
Understanding the Borrowers
Before fitting a single model, we spent time understanding the 307,511 loan applications in the dataset — who these borrowers are, what drives their default risk, and how those risk factors interact. That groundwork ended up shaping every modeling decision that followed.
The three things that drive default
External credit scores dominate. The composite score we built from three bureau sources separates the bottom quintile (roughly 19% default rate) from the top quintile (roughly 2%) — a spread no other feature comes close to replicating. Age and employment history add independent signal, but they're operating in the shadow of the external scores.
Age is the second most powerful variable, and it interacts with almost everything. Borrowers in their 20s default at 11.4%. By their 60s, that rate is 4.9%. What's striking is that this age gradient holds up within education groups, income brackets, and housing types — it's not a proxy for something else.
Debt burden amplifies baseline risk rather than creating its own. A high debt-to-income ratio barely moves the needle for older, higher-educated, home-owning borrowers. For young renters with limited credit history, it compounds existing vulnerability substantially.
Risk isn't driven by a single factor. It emerges from the interaction of baseline vulnerability — age, education — financial stress — debt-to-income, credit load — and stability signals — housing type, income source. External scores capture and compress all of these dynamics into a single number.
Building the synthetic rejection population
The Home Credit dataset doesn't include rejected applicants by design — outcomes are only observed for approved loans. To study selection bias, we constructed a synthetic rejection rule using the same risk factors identified in the EDA: external scores, debt burden, age, housing type, and income stability.
The result: an overall approval rate of 63.5%, with approval rates ranging from 36.4% for borrowers in their 20s to 79.7% for those over 60. The approved group has a 5.1% default rate; the rejected population's true default rate is 13.3%. That gap is the selection distortion we're trying to measure and correct.
Baseline Models & The Size of the Problem
The baseline exercise is deceptively simple: train on approved applicants only — because that's the only data where outcomes are observed — and then test on everyone. If selection bias matters, performance should degrade as the test population moves away from the training distribution.
It does. But the way it degrades is specific, and that specificity matters for how you'd fix it.
| Model | Test Population | AUC | Log-Loss | ECE |
|---|---|---|---|---|
| Logistic Regression | Approved | 0.7032 | 0.1857 | 0.0018 |
| Logistic Regression | Full population | 0.7479 | 0.2491 | 0.0031 |
| Logistic Regression | Rejected | 0.7114 | 0.3596 | 0.0057 |
| XGBoost | Approved | 0.7236 | 0.1823 | 0.0016 |
| XGBoost | Full population | 0.7627 | 0.2448 | 0.0034 |
| XGBoost | Rejected | 0.7286 | 0.3537 | 0.0077 |
The AUC numbers tell a somewhat counterintuitive story. Logistic regression is actually a slightly better ranker on rejected applicants (0.7114) than on approved ones (0.7032). This isn't a coincidence — rejected applicants default at 13.3% versus 5.1% for approved applicants. A larger signal-to-noise ratio makes ranking easier even when the probabilities themselves are miscalibrated. The model can tell who is riskier relative to whom; it just can't tell you how risky they are in absolute terms.
ECE tells the real story. Calibration error nearly triples for logistic regression and nearly quintuples for XGBoost on rejected applicants. For any application that relies on the probability estimate — pricing, reserve-setting, portfolio risk assessment — this is a material problem.
Selection bias in this dataset is primarily a calibration distortion, not a discrimination failure. AUC moves modestly across populations. ECE moves sharply. A model can rank borrowers correctly while being systematically wrong about the actual probability of default for anyone it wasn't trained to see.
Three Ways to Fix It
Once we'd quantified the problem, we tested three established reject inference methods — each with a different philosophy about how to handle missing outcome data for rejected applicants.
Parcelling
Score all applicants with the baseline model, sort into risk bands, and assign imputed default labels to rejected applicants based on observed rates in those same bands — with a 1.2× upward adjustment. Industry standard, single-pass, assumption-heavy.
Expectation-Maximization
Iterative. Each rejected applicant is duplicated as two weighted rows — one labeled default, one non-default — with weights equal to the current soft probability. Repeats until parameters stabilize. No stochastic noise. Converges on the true population structure.
Inverse Probability Weighting
Never guesses rejected outcomes. Instead, upweights approved applicants who look like they could have been rejected. The model sees more of its own borderline cases. Clean assumptions — but information loss is real (ESS dropped to 36.8%).
Parcelling: the industry standard
The band composition table from parcelling reveals the selection structure clearly: in the lowest-risk band, approved applicants outnumber rejected 18,456 to 1,532. In the highest-risk band, that ratio inverts to 2,208 approved versus 17,781 rejected. The 1.2× adjustment factor slightly overshoots the true rejected default rate (imputed: 14.9%, actual: 13.3%).
The sensitivity analysis is the more instructive output. AUC is almost perfectly stable across all 16 combinations of band count and adjustment factor — ranging only from 0.7430 to 0.7479. ECE is highly sensitive to the adjustment factor. At 1.0 (no inflation), ECE is competitive with the baseline. At 2.0, it's substantially worse. This illustrates parcelling's core limitation: the adjustment factor is an unverifiable assumption. You cannot calibrate it against data you don't have.
EM: closing in on the true population
After 100 iterations using the weighted-expansion approach, the EM algorithm's final parameter change was 0.014 — well below the 0.001 tolerance on a per-coefficient basis once accounting for the full model. The final imputed default rate for rejected applicants was 13.31%, nearly identical to the true 13.32%. This is the EM's main accomplishment: without ever seeing rejected outcomes, it recovered the population-level default rate with high accuracy.
IPW: reweighting without imputing
The propensity model (XGBoost) achieved an AUC of 0.9399 — meaning the approval decisions are almost perfectly predictable from the feature set. This is expected, since our approval filter is built from observable features. A high propensity AUC has an important implication: IPW has limited room to add new information, because the model can already extrapolate into the rejected region by following feature-to-outcome relationships it learned from approved applicants.
The effective sample size after trimming at the 99th percentile is 46,756 — about 36.8% of nominal. The information cost of reweighting is substantial.
How they compare
| Method | AUC (full) | AUC (rejected) | ECE (full) | ECE (rejected) | Mean predicted |
|---|---|---|---|---|---|
| Baseline LR | 0.7479 | 0.7114 | 0.0031 | 0.0057 | 0.0803 |
| Parcelling | 0.7470 | 0.7109 | 0.0047 | 0.0089 | 0.0854 |
| EM | 0.7479 | 0.7113 | 0.0027 | 0.0072 | 0.0802 |
| IPW | 0.7450 | 0.7125 | 0.0033 | 0.0078 | 0.0795 |
| True default rate | — | — | — | — | 0.0807 |
No correction method improved discrimination meaningfully — AUC moves by less than 0.003 across all methods. The reason is that our approval filter operates entirely on observable features the model can already access. The model partially self-corrects for selection bias just by following the feature-to-outcome relationship into the rejected region of feature space. EM improves full-population ECE to 0.0027 — better than the baseline — while IPW delivers the best rejected-population AUC at 0.7125. Parcelling is the weakest method on both dimensions due to its fixed multiplicative adjustment.
Not Just Whether — But When
Binary classification flattens time. A borrower who defaults in month 3 and one who defaults in month 30 both get a label of 1 — but they represent fundamentally different operational problems. Early defaults hit provisioning immediately. Late defaults may be partially offset by months of payments already received. A model that ignores timing misses that distinction entirely.
More importantly for this project: selection bias has a temporal dimension that binary models can't see. Because the approval filter disproportionately screens out high-risk borrowers, and high-risk borrowers default earlier, the approved-only training set is missing the fastest defaulters. The baseline model learns to underestimate early default hazard.
We built three survival models: an approved-only Cox PH baseline, an IPW-reweighted Cox model, and a DeepHitSingle neural survival model. All three use the top 25 features by XGBoost importance and are evaluated on the same test set.
| Model | C-Index | IBS | IBS vs. null (0.25) |
|---|---|---|---|
| Cox Baseline (approved only) | 0.6808 | 0.0606 | 0.1894 |
| Cox IPW-reweighted | 0.6805 | 0.0592 | 0.1908 |
| DeepHitSingle (neural) | 0.7216 | 0.0590 | 0.1910 |
DeepHitSingle outperforms both Cox models on C-index by a meaningful margin — 0.0408 over the approved-only baseline. This reflects its ability to model non-linear and non-proportional hazard relationships that Cox PH cannot capture by assumption. When the proportional hazards constraint is the binding limitation, relaxing it matters.
The IBS story is where selection-bias correction shows up. IPW reduces IBS from 0.0606 to 0.0592 — a 2.3% improvement. DeepHitSingle reaches essentially the same IBS (0.0590) without an explicit correction, consistent with the Phase 2 pattern: flexible models partially self-correct for observable selection mechanisms.
Survival analysis makes visible a distortion that binary classification cannot detect: the approved-only model systematically underestimates early-period default hazard because the fastest defaulters were disproportionately rejected. IPW corrects this in the first 12 months. For a lender, that's exactly where provisioning decisions are made.
Bias, Variance, and Why Simple Models Hold Up
The bias-variance analysis produced one finding we expected and one that required explanation.
XGBoost, as expected, shows a modest positive overfitting gap — training AUC of 0.7725 versus validation AUC of 0.7560. The hyperparameter sweep confirms depth=4 as the right operating point. Going deeper widens the gap without improving validation AUC.
Logistic regression shows a negative overfitting gap of -0.0315. Its validation AUC (0.7445) is higher than its training AUC (0.7130). This needs explanation.
The reason is structural. The validation set includes rejected applicants, who default at 13.3% — far higher than the approved-only training set's 5.1%. That higher base rate makes it easier for any model to separate high-risk from low-risk borrowers. Logistic regression, because of its simplicity, also avoids memorizing approval-specific patterns that would fail on rejected applicants. Its high bias turns out to be a form of robustness to population shift.
Logistic regression's underfitting is partially self-correcting in the presence of selection bias. Its inability to memorize approval-specific patterns means it generalizes more stably when tested outside the approved population. XGBoost is the right choice for discrimination. Logistic regression — or IPW-corrected LR — is the right choice when the probability estimate itself matters.
Who Gets the Wrong Probability?
Aggregate metrics hide group-level disparities. A model can be well-calibrated on average while being systematically wrong for specific demographic segments. In consumer lending, that distinction carries legal weight — ECOA, Reg B, and CFPB disparate-impact guidance all require evaluation at the subgroup level, not just in aggregate.
We evaluated three demographic axes: gender, education, and age. For each, we measured calibration error (ECE), mean predicted probability, false positive rate, and false negative rate under both the baseline and IPW-corrected models — all on the full-population test set, including rejected applicants.
Where IPW helps: age
The baseline model's ECE gap across age groups was 0.0099. IPW cuts that to 0.0054 — a 45% reduction. The improvement is concentrated on borrowers under 30, where ECE falls from 0.0139 to 0.0107. This is mechanistically consistent: age was the strongest input to the approval filter, so younger applicants were most over-represented in the rejected population. IPW correctly concentrates its corrective effect on the group that selection affected most.
Where IPW is neutral: gender and education
Gender ECE gap changes by +0.0004. Education ECE gap changes by +0.0008. Both within sampling noise. Neither represents a meaningful directional shift. Gender and education weren't primary drivers of the approval filter, so there was no selection-induced calibration distortion along those axes for IPW to correct — and it didn't introduce new distortions either.
The threshold problem
At a fixed population-level decision threshold, IPW widens the FNR gap on age from 0.4193 to 0.4589. This sounds bad, but the mechanism is specific: IPW pushes predicted probabilities upward for younger, higher-risk applicants (who look most like the rejected population), which at a fixed threshold improves detection of young defaulters (under-30 FNR drops from 0.1741 to 0.1674) while worsening detection of older ones (60+ FNR rises from 0.5934 to 0.6262).
The 60+ group suffers most because their true default rate (5.1%) already sits below the population-level threshold. A small upward shift in their predicted probabilities is insufficient to flip their classification. This is a property of single-threshold decision rules, not a failure of IPW — it requires segment-specific thresholds to resolve.
IPW improves fairness precisely where selection bias was the underlying cause of disparity, and stays out of the way elsewhere. For regulatory reporting, that's the ideal behavior. For threshold-based error rate disparity, probability recalibration alone is not sufficient — segment-specific thresholds or explicit fairness constraints during training are required.