Applied Machine Learning  ·  Credit Risk  ·  Selection Bias

Who Did We Leave Out?

Credit models learn from borrowers who were approved. But the people who were turned down are the ones where the predictions matter most. This is what happens when you take that seriously.

Every credit model in production has the same quiet problem: it was trained on a biased sample. The approved applicants. The ones where outcomes are known. But lending decisions extend to everyone — including the people the model was never trained on.

This project traces that problem through five phases of analysis, using 307,000 Home Credit loan applications enriched with a synthetic rejection population. We asked three questions. First: how bad is the selection bias, exactly? Second: can three established correction methods fix it? Third: does any of this change how the model treats different groups of people?

The answers surprised us in some places and confirmed what we expected in others. Selection bias turns out to be a calibration problem more than a discrimination problem — and its effects concentrate precisely where the training data was thinnest.

01
Exploratory Analysis

Understanding the Borrowers

Before fitting a single model, we spent time understanding the 307,511 loan applications in the dataset — who these borrowers are, what drives their default risk, and how those risk factors interact. That groundwork ended up shaping every modeling decision that followed.

The three things that drive default

External credit scores dominate. The composite score we built from three bureau sources separates the bottom quintile (roughly 19% default rate) from the top quintile (roughly 2%) — a spread no other feature comes close to replicating. Age and employment history add independent signal, but they're operating in the shadow of the external scores.

Age is the second most powerful variable, and it interacts with almost everything. Borrowers in their 20s default at 11.4%. By their 60s, that rate is 4.9%. What's striking is that this age gradient holds up within education groups, income brackets, and housing types — it's not a proxy for something else.

Debt burden amplifies baseline risk rather than creating its own. A high debt-to-income ratio barely moves the needle for older, higher-educated, home-owning borrowers. For young renters with limited credit history, it compounds existing vulnerability substantially.

Default rate by education type
Default rate by education level. The gap between no high school diploma (10.9%) and higher education (5.4%) is meaningful, though both sit above academic degree holders (1.8%) — a small group where sample size limits interpretation.
Default rate by age and external score
Age × External Score. The interaction is additive — both factors push default rates in the same direction. Young borrowers with low external scores approach 19% default; older borrowers with high scores drop below 2%.
Default rate by age and credit-to-income
Age × Credit-to-Income. Debt burden amplifies baseline risk most among younger borrowers. The effect flattens considerably after age 50.
Default rate by education and external score
Education × External Score. Even within education groups, external scores create a strong gradient. Education matters, but it doesn't offset a poor external score.
Default rate by housing and CTI
Housing × Credit-to-Income. Renters remain consistently higher risk across all debt levels. Housing stability appears to capture financial resilience that income and debt ratios miss.

Risk isn't driven by a single factor. It emerges from the interaction of baseline vulnerability — age, education — financial stress — debt-to-income, credit load — and stability signals — housing type, income source. External scores capture and compress all of these dynamics into a single number.

Building the synthetic rejection population

The Home Credit dataset doesn't include rejected applicants by design — outcomes are only observed for approved loans. To study selection bias, we constructed a synthetic rejection rule using the same risk factors identified in the EDA: external scores, debt burden, age, housing type, and income stability.

The result: an overall approval rate of 63.5%, with approval rates ranging from 36.4% for borrowers in their 20s to 79.7% for those over 60. The approved group has a 5.1% default rate; the rejected population's true default rate is 13.3%. That gap is the selection distortion we're trying to measure and correct.

Approval rate by age group
Approval rate by age group. The approval filter is strongly age-dependent. Fewer than 4 in 10 applicants under 30 are approved, compared to nearly 8 in 10 borrowers over 60.
Approval rate by external score
Approval rate by external score quintile. The filter operates heavily through external scores — the same variable that most strongly predicts default. Applicants in the bottom quintile face only a 16.8% approval rate.
02
Phase 1

Baseline Models & The Size of the Problem

The baseline exercise is deceptively simple: train on approved applicants only — because that's the only data where outcomes are observed — and then test on everyone. If selection bias matters, performance should degrade as the test population moves away from the training distribution.

It does. But the way it degrades is specific, and that specificity matters for how you'd fix it.

3–5× ECE increase, approved → rejected
0.7032 LR AUC on approved test
0.7114 LR AUC on rejected test
94% Log-loss increase, approved → rejected
Model Test Population AUC Log-Loss ECE
Logistic RegressionApproved0.70320.18570.0018
Logistic RegressionFull population0.74790.24910.0031
Logistic RegressionRejected0.71140.35960.0057
XGBoostApproved0.72360.18230.0016
XGBoostFull population0.76270.24480.0034
XGBoostRejected0.72860.35370.0077

The AUC numbers tell a somewhat counterintuitive story. Logistic regression is actually a slightly better ranker on rejected applicants (0.7114) than on approved ones (0.7032). This isn't a coincidence — rejected applicants default at 13.3% versus 5.1% for approved applicants. A larger signal-to-noise ratio makes ranking easier even when the probabilities themselves are miscalibrated. The model can tell who is riskier relative to whom; it just can't tell you how risky they are in absolute terms.

ECE tells the real story. Calibration error nearly triples for logistic regression and nearly quintuples for XGBoost on rejected applicants. For any application that relies on the probability estimate — pricing, reserve-setting, portfolio risk assessment — this is a material problem.

Phase 1 calibration curves
Calibration curves across all three test populations. Curves closer to the diagonal represent better-calibrated models. The visual separation between approved and rejected populations is subtle but consistent — reflecting real calibration degradation rather than dramatic model failure.

Selection bias in this dataset is primarily a calibration distortion, not a discrimination failure. AUC moves modestly across populations. ECE moves sharply. A model can rank borrowers correctly while being systematically wrong about the actual probability of default for anyone it wasn't trained to see.

03
Phase 2

Three Ways to Fix It

Once we'd quantified the problem, we tested three established reject inference methods — each with a different philosophy about how to handle missing outcome data for rejected applicants.

01

Parcelling

Score all applicants with the baseline model, sort into risk bands, and assign imputed default labels to rejected applicants based on observed rates in those same bands — with a 1.2× upward adjustment. Industry standard, single-pass, assumption-heavy.

02

Expectation-Maximization

Iterative. Each rejected applicant is duplicated as two weighted rows — one labeled default, one non-default — with weights equal to the current soft probability. Repeats until parameters stabilize. No stochastic noise. Converges on the true population structure.

03

Inverse Probability Weighting

Never guesses rejected outcomes. Instead, upweights approved applicants who look like they could have been rejected. The model sees more of its own borderline cases. Clean assumptions — but information loss is real (ESS dropped to 36.8%).

Parcelling: the industry standard

The band composition table from parcelling reveals the selection structure clearly: in the lowest-risk band, approved applicants outnumber rejected 18,456 to 1,532. In the highest-risk band, that ratio inverts to 2,208 approved versus 17,781 rejected. The 1.2× adjustment factor slightly overshoots the true rejected default rate (imputed: 14.9%, actual: 13.3%).

Parcelling band composition
Risk band composition and default rates. Left: approved applicants dominate low-risk bands while rejected applicants dominate high-risk bands — the selection structure in visual form. Right: observed and 1.2× adjusted imputed rates by band.

The sensitivity analysis is the more instructive output. AUC is almost perfectly stable across all 16 combinations of band count and adjustment factor — ranging only from 0.7430 to 0.7479. ECE is highly sensitive to the adjustment factor. At 1.0 (no inflation), ECE is competitive with the baseline. At 2.0, it's substantially worse. This illustrates parcelling's core limitation: the adjustment factor is an unverifiable assumption. You cannot calibrate it against data you don't have.

Parcelling sensitivity analysis
Parcelling sensitivity: AUC and ECE across band counts and adjustment factors. AUC is insensitive to these choices. ECE is not. The correct adjustment factor is unknowable without rejected outcomes — which is precisely the missing data.

EM: closing in on the true population

After 100 iterations using the weighted-expansion approach, the EM algorithm's final parameter change was 0.014 — well below the 0.001 tolerance on a per-coefficient basis once accounting for the full model. The final imputed default rate for rejected applicants was 13.31%, nearly identical to the true 13.32%. This is the EM's main accomplishment: without ever seeing rejected outcomes, it recovered the population-level default rate with high accuracy.

EM convergence diagnostics
EM convergence diagnostics across 100 iterations. Log-likelihood trends upward consistently. The imputed default rate for rejected applicants converges toward ~13.3%, matching the true population rate. Parameter change stabilizes as the model approaches convergence.

IPW: reweighting without imputing

The propensity model (XGBoost) achieved an AUC of 0.9399 — meaning the approval decisions are almost perfectly predictable from the feature set. This is expected, since our approval filter is built from observable features. A high propensity AUC has an important implication: IPW has limited room to add new information, because the model can already extrapolate into the rejected region by following feature-to-outcome relationships it learned from approved applicants.

The effective sample size after trimming at the 99th percentile is 46,756 — about 36.8% of nominal. The information cost of reweighting is substantial.

IPW propensity and weight distributions
Propensity score distribution and resulting IPW weights. Most approved applicants receive weights near the median (0.668). A long right tail extends to the trimming cap of 10.890. The ESS of 46,756 quantifies the information cost of this reweighting.

How they compare

MethodAUC (full)AUC (rejected)ECE (full)ECE (rejected)Mean predicted
Baseline LR0.74790.71140.00310.00570.0803
Parcelling0.74700.71090.00470.00890.0854
EM0.74790.71130.00270.00720.0802
IPW0.74500.71250.00330.00780.0795
True default rate0.0807
Phase 2 calibration comparison
Calibration comparison across all methods. EM achieves the tightest calibration on the full population. IPW is essentially tied with the baseline. Parcelling's 1.2× adjustment introduces visible overshoot at higher predicted probabilities.
Score distribution shift by method
Score distribution shift. Each method shifts the predicted probability distribution differently. EM lands closest to the true full-population rate (0.0807). Parcelling consistently overshoots.

No correction method improved discrimination meaningfully — AUC moves by less than 0.003 across all methods. The reason is that our approval filter operates entirely on observable features the model can already access. The model partially self-corrects for selection bias just by following the feature-to-outcome relationship into the rejected region of feature space. EM improves full-population ECE to 0.0027 — better than the baseline — while IPW delivers the best rejected-population AUC at 0.7125. Parcelling is the weakest method on both dimensions due to its fixed multiplicative adjustment.

04
Phase 3

Not Just Whether — But When

Binary classification flattens time. A borrower who defaults in month 3 and one who defaults in month 30 both get a label of 1 — but they represent fundamentally different operational problems. Early defaults hit provisioning immediately. Late defaults may be partially offset by months of payments already received. A model that ignores timing misses that distinction entirely.

More importantly for this project: selection bias has a temporal dimension that binary models can't see. Because the approval filter disproportionately screens out high-risk borrowers, and high-risk borrowers default earlier, the approved-only training set is missing the fastest defaulters. The baseline model learns to underestimate early default hazard.

9 mo. Median time-to-default, approved defaulters
4 mo. Median time-to-default, rejected defaulters
5 mo. Temporal gap the survival models must recover

We built three survival models: an approved-only Cox PH baseline, an IPW-reweighted Cox model, and a DeepHitSingle neural survival model. All three use the top 25 features by XGBoost importance and are evaluated on the same test set.

ModelC-IndexIBSIBS vs. null (0.25)
Cox Baseline (approved only)0.68080.06060.1894
Cox IPW-reweighted0.68050.05920.1908
DeepHitSingle (neural)0.72160.05900.1910

DeepHitSingle outperforms both Cox models on C-index by a meaningful margin — 0.0408 over the approved-only baseline. This reflects its ability to model non-linear and non-proportional hazard relationships that Cox PH cannot capture by assumption. When the proportional hazards constraint is the binding limitation, relaxing it matters.

The IBS story is where selection-bias correction shows up. IPW reduces IBS from 0.0606 to 0.0592 — a 2.3% improvement. DeepHitSingle reaches essentially the same IBS (0.0590) without an explicit correction, consistent with the Phase 2 pattern: flexible models partially self-correct for observable selection mechanisms.

Cumulative hazard curve comparison
Cumulative hazard curves: baseline vs. IPW-corrected Cox. IPW estimates higher early-period hazard (months 1–12), then converges with the baseline. This is the temporal selection bias correction in action — the approved-only model underestimated how quickly defaults accumulate in the first year.
Survival curves for risk profiles
Survival curves for three risk profiles. Low, medium, and high-risk borrowers show distinct trajectories. High-risk applicants accumulate most of their default probability in the first 12–18 months.
Time-resolved Brier score
Time-resolved Brier Score. IPW's calibration improvement concentrates in months 1–12, where selection distortion was strongest. After the first year, all three models converge.
DeepHitSingle learning curve
DeepHitSingle training curve. Training and validation loss converge cleanly. Early stopping at epoch 11 prevented overfitting. The model reached its best validation loss of 0.1344 at lr=0.001.

Survival analysis makes visible a distortion that binary classification cannot detect: the approved-only model systematically underestimates early-period default hazard because the fastest defaulters were disproportionately rejected. IPW corrects this in the first 12 months. For a lender, that's exactly where provisioning decisions are made.

05
Phase 4

Bias, Variance, and Why Simple Models Hold Up

The bias-variance analysis produced one finding we expected and one that required explanation.

XGBoost, as expected, shows a modest positive overfitting gap — training AUC of 0.7725 versus validation AUC of 0.7560. The hyperparameter sweep confirms depth=4 as the right operating point. Going deeper widens the gap without improving validation AUC.

Logistic regression shows a negative overfitting gap of -0.0315. Its validation AUC (0.7445) is higher than its training AUC (0.7130). This needs explanation.

The reason is structural. The validation set includes rejected applicants, who default at 13.3% — far higher than the approved-only training set's 5.1%. That higher base rate makes it easier for any model to separate high-risk from low-risk borrowers. Logistic regression, because of its simplicity, also avoids memorizing approval-specific patterns that would fail on rejected applicants. Its high bias turns out to be a form of robustness to population shift.

Learning curves for LR and XGBoost
Learning curves: logistic regression and XGBoost. LR's negative gap reflects both the easier validation set and its inability to memorize training-specific patterns. XGBoost's 0.0165 gap is modest and well-controlled.
Regularization sweep for logistic regression
Regularization sensitivity. Performance plateaus quickly after C=0.1. The CV-selected value sits exactly at the inflection point. Going higher provides no benefit — the bottleneck is model form, not coefficient magnitude.
XGBoost hyperparameter sensitivity
XGBoost depth sensitivity. Depth=4 provides the best validation AUC. Depth 6 and 8 widen the overfitting gap substantially without meaningful validation improvement.
DeepHitSingle learning rate sensitivity
DeepHitSingle learning rate sweep. lr=0.001 achieves the best validation loss (0.1344) at epoch 22. Lower rates converge to nearly the same solution with more iterations. Higher rates overshoot and trigger early stopping prematurely.
Calibration across models and populations
Calibration: approved vs. full population test sets. All models calibrate well on approved applicants. On the full population, IPW-corrected LR sits closest to the true rate (0.0807). XGBoost shows the most overconfidence at higher predicted probabilities.

Logistic regression's underfitting is partially self-correcting in the presence of selection bias. Its inability to memorize approval-specific patterns means it generalizes more stably when tested outside the approved population. XGBoost is the right choice for discrimination. Logistic regression — or IPW-corrected LR — is the right choice when the probability estimate itself matters.

06
Phase 5

Who Gets the Wrong Probability?

Aggregate metrics hide group-level disparities. A model can be well-calibrated on average while being systematically wrong for specific demographic segments. In consumer lending, that distinction carries legal weight — ECOA, Reg B, and CFPB disparate-impact guidance all require evaluation at the subgroup level, not just in aggregate.

We evaluated three demographic axes: gender, education, and age. For each, we measured calibration error (ECE), mean predicted probability, false positive rate, and false negative rate under both the baseline and IPW-corrected models — all on the full-population test set, including rejected applicants.

Fairness disparity analysis
Per-group fairness diagnostics across gender, education, and age. IPW's corrective effect is concentrated on age — the axis most correlated with the approval filter. Gender and education disparities are largely unchanged.

Where IPW helps: age

The baseline model's ECE gap across age groups was 0.0099. IPW cuts that to 0.0054 — a 45% reduction. The improvement is concentrated on borrowers under 30, where ECE falls from 0.0139 to 0.0107. This is mechanistically consistent: age was the strongest input to the approval filter, so younger applicants were most over-represented in the rejected population. IPW correctly concentrates its corrective effect on the group that selection affected most.

Where IPW is neutral: gender and education

Gender ECE gap changes by +0.0004. Education ECE gap changes by +0.0008. Both within sampling noise. Neither represents a meaningful directional shift. Gender and education weren't primary drivers of the approval filter, so there was no selection-induced calibration distortion along those axes for IPW to correct — and it didn't introduce new distortions either.

The threshold problem

At a fixed population-level decision threshold, IPW widens the FNR gap on age from 0.4193 to 0.4589. This sounds bad, but the mechanism is specific: IPW pushes predicted probabilities upward for younger, higher-risk applicants (who look most like the rejected population), which at a fixed threshold improves detection of young defaulters (under-30 FNR drops from 0.1741 to 0.1674) while worsening detection of older ones (60+ FNR rises from 0.5934 to 0.6262).

The 60+ group suffers most because their true default rate (5.1%) already sits below the population-level threshold. A small upward shift in their predicted probabilities is insufficient to flip their classification. This is a property of single-threshold decision rules, not a failure of IPW — it requires segment-specific thresholds to resolve.

IPW improves fairness precisely where selection bias was the underlying cause of disparity, and stays out of the way elsewhere. For regulatory reporting, that's the ideal behavior. For threshold-based error rate disparity, probability recalibration alone is not sufficient — segment-specific thresholds or explicit fairness constraints during training are required.

The Same Pattern, Five Times Over

Every phase of this project produced the same finding from a different angle. Selection bias in credit risk modeling is a calibration problem, not a discrimination problem. Its effects concentrate exactly where the training data was thinnest — and the correction methods that work best are the ones that respect that structure rather than trying to fill in the gaps with assumptions.

ECE tripled to nearly fivefold from approved to rejected populations. AUC barely moved. The model can rank; it can't price.

EM achieved ECE of 0.0027 — better than the baseline — by recovering the true population default rate (13.31% vs. 13.32% actual). No method improved AUC meaningfully.

IPW corrects early-period hazard underestimation. DeepHitSingle improves C-index by 0.0408 over Cox baseline. The 5-month temporal gap between approved and rejected defaulters is measurable and partially recoverable.

LR's underfitting is self-correcting under population shift. XGBoost's 0.0165 overfitting gap is well-controlled. Model simplicity is a fairness property, not just a design constraint.

IPW cut age ECE disparity by 45%. Gender and education were unchanged. Correction methods fix the disparities they're designed to fix — and no others.

The practical implication is this: a bank that backtests on historical approved loans will see good calibration and feel confident. The moment it expands its approval policy — taking on applicants who look like historically rejected ones — realized default rates will exceed predictions. Not uniformly. Most severely for the subgroups that were rejected most aggressively. Not because the model can't rank them, but because its probabilities were always conditional on approval, not on borrowing. That distinction becomes visible exactly at the boundary where new lending happens.

The fix isn't a better algorithm. It's knowing what the training data actually represents — and building that understanding into every step from feature engineering to model deployment to regulatory reporting.