Assessing an individual’s creditworthiness has always relied on a complex blend of financial, behavioral, and market-driven factors. These signals shift constantly, making manual prediction both time-consuming and inconsistent. Modern ML models offer lenders and underwriters a more scalable alternative providing fast, explainable, and maintainable credit insights that balance fair pricing for borrowers with profitable decisioning for institutions.
To ground these concepts in a real example, this case study explores mortgage pricing using Freddie Mac’s 2024 Q1 Single-Family Loan-Level Dataset. While external economic forces also influence rate accuracy, this dataset provides a strong foundation for demonstrating how an ML-driven pricing pipeline operates in practice.
Once the champion model is selected, we integrate it into a GenAI layer designed specifically for mortgage lenders. This final step transforms raw model outputs into tailored, easy-to-understand explanations that support real-time pricing conversations and decision-making.
Before handing anything to a model, we need to get a feel for what the raw Freddie Mac data is actually telling us. This step is all about sanity-checking the dataset, understanding how rates are distributed, and seeing how the main drivers (credit score, LTV, DTI, etc.) behave.
Summary
Target variable – Interest Rate (%)
From the histogram and box plot of interest rates, we can see that most loans are clustered tightly around the ~6.7% mark, with a relatively small spread (standard deviation just over half a percent). The distribution looks roughly bell-shaped with a slight right tail, which lines up with a few higher-rate outliers pushing towards 9% on the upper end and a handful of older/legacy low-rate loans on the lower end.
Target variable – Interest Rate (%)
From the histogram and box plot of interest rates, we can see that most loans are clustered tightly around the ~6.7% mark, with a relatively small spread (standard deviation just over half a percent). The distribution looks roughly bell-shaped with a slight right tail, which lines up with a few higher-rate outliers pushing towards 9% on the upper end and a handful of older/legacy low-rate loans on the lower end.
The box plot confirms this story, the interquartile range (IQR) is fairly narrow, meaning most borrowers are being priced in a tight band. There are visible outliers both below and above the main cluster. These are important to flag because they could represent special programs, data entry issues, or edge-case borrowers that may distort the model if we don’t handle them carefully.
With the raw Freddie Mac tape explored, the next step is to reshape it into something a model can actually learn from. The aim here is simple: keep the economic story of the loan, strip away noise, and add structure where underwriters naturally think in buckets and interactions.
We start from the original loan-level fields and separate them into two groups: original features that we keep largely as-is, and engineered features that encode underwriting logic.
|
Feature Type |
Count |
Notes |
|
Original features |
11 |
Core credit, loan size, term, and high-level loan attributes |
|
Engineered features |
13 |
Risk score, interactions, buckets, and simplified categories |
|
Total (ex-target) |
24 |
Before final selection / pruning |
The feature engineering stage reduces the raw loan tape into a clean, structured view the model can learn from.
Once combined, these features form a final set of 20 predictors plus the target original_interest_rate. After removing any remaining nulls, the dataset is saved as a Delta table and becomes the foundation for model training.
|
Final Modeling Dataset |
Value |
|
Records |
214,929 |
|
Features (predictors) |
20 |
|
Target |
original_interest_rate |
|
Table name |
mortgage_data_features |
This gives us a compact, model-ready view of each loan that still feels very close to how a human underwriter would describe the file.
Correlation Analysis – What Drives Rate?
Before throwing models at the data, it’s worth sanity-checking how these features move with the interest rate. The simple Pearson correlations with original_interest_rate look like this:
|
Feature |
Correlation with Rate |
|
original_loan_term |
+0.251 |
|
risk_score |
+0.124 |
|
original_ltv |
+0.114 |
|
ltv_dti_interaction |
+0.098 |
|
num_units |
+0.078 |
|
credit_ltv_interaction |
+0.073 |
|
original_dti |
+0.046 |
|
original_upb |
−0.037 |
|
loan_per_unit |
−0.046 |
|
credit_score |
−0.184 |
The picture is reassuring. Longer terms tend to price higher, which comes through as the strongest direct linear relationship. Credit score behaves exactly as expected: as scores improve, rates come down, giving us a clear negative correlation.
Leverage and affordability show up with positive correlations: higher original_ltv, higher original_dti and, more importantly, their interaction ltv_dti_interaction all point towards higher pricing. The interaction terms are doing what they were designed to do, highlight the stacked-risk files where a borrower is both highly leveraged and already carrying a heavy debt load. risk_score pulls these ingredients together, and the positive correlation with rate confirms that this composite view is aligned with the way loans are priced.
Overall, the correlation analysis tells us two important things:
With feature engineering complete and the relationships to rate looking sensible, we’re in a good position to move on to the model training and MLflow tracking phase.
Prepare features and target variables, handle categorical encoding, and create train/test splits.
|
# Define feature groups numeric_features = [ 'credit_score','original_ltv', 'original_dti', 'original_upb','num_units','original_loan_term','risk_score','ltv_dti_interaction','credit_ltv_interaction','loan_per_unit' ] categorical_features = [ 'credit_score_category','loan_size_category','ltv_category','dti_category','property_type_simple','occupancy_simple','loan_purpose_simple','property_state']
temporal_features = [ 'first_payment_year','first_payment_quarter']
target = 'original_interest_rate' len(temporal_features)}")
# Create feature dataframe X = df[numeric_features + categorical_features + temporal_features].copy() y = df[target].copy() |
Split data into training (80%) and testing (20%) sets ensuring balanced representation.
|
# Train-test split with random state for reproducibility X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) |
Then setup MLflow experiment
Initialise MLflow experiment tracking to log all model training runs, parameters, metrics and artifacts.
|
# Set MLflow experiment experiment_name = "/Users/xxxxx/mortgage_pricing_models" # Try to create experiment, or use existing try: experiment_id = mlflow.create_experiment(experiment_name) print(f"✅ Created new experiment: {experiment_name}") except: experiment = mlflow.get_experiment_by_name(experiment_name) experiment_id = experiment.experiment_id print(f"✅ Using existing experiment: {experiment_name}")
mlflow.set_experiment(experiment_name)
print(f" Experiment ID: {experiment_id}") print(f"\n📊 All runs will be tracked in MLflow UI") print(f" Access at: https://{your_url}/ml/experiments/{experiment_id}") ) |
For the MLflow experiment I compared three models for predicting mortgage interest rates: a baseline Linear Regression, and two gradient boosting models, XGBoost and LightGBM. MLflow was used to track runs, metrics and parameters so we could pick a champion model based on out-of-sample performance.
|
Model |
Test RMSE |
Test MAE |
Test R² |
Test MAPE (%) |
|
XGBoost |
0.468013 |
0.357783 |
0.261302 |
5.417410 |
|
LightGBM |
0.468177 |
0.357707 |
0.260783 |
5.416796 |
|
Linear Regression |
0.503227 |
0.385939 |
0.145956 |
5.828756 |
XGBoost edges out the others with the lowest Root Mean Squere Error (RMSE) and Mean Absolute Error (MAE) and the highest R², so it is selected as the champion model. The difference between XGBoost and LightGBM is extremely small and not practically meaningful, but both clearly outperform the linear baseline, which struggles to capture the complexity in the pricing relationships.
Practical impact of model accuracy
Limitations and what the model misses
The reason the gradient boosting models outperform Linear Regression comes down to how mortgage pricing really works. The relationship between credit score, LTV, DTI and rate is highly non-linear and full of thresholds: a small change around an 80% LTV or a particular FICO band can move the price more than a simple linear slope would suggest. Gradient boosting handles these kinks and feature interactions naturally, whereas a linear model can only fit straight lines unless we manually engineer a large number of interaction and non-linear terms.
Overall, the takeaway from this MLflow run is that our dataset needs to expand beyond the current features to use external sources to help with our accuracy, however our model particularly XGBoost provides a solid, business-interpretable starting point for a mortgage pricing engine: accurate enough to guide offers, transparent enough to monitor, and still complemented by human oversight for final rate setting.
To move beyond “black box” predictions, we add an explainability layer on top of the XGBoost model. This combines SHAP values for technical transparency with a GenAI chat interface that turns those numbers into human language for brokers and borrowers.
SHAP (SHapley Additive exPlanations) values quantify how each feature pushes a prediction up or down relative to the portfolio average rate. For our XGBoost model, SHAP lets us see both global patterns and the story behind a single quote.
To show how this works in practice, we walk through three real loan applications. In each case, the model starts from the average rate of 6.736% and then adjusts up or down based on the borrower profile.
Sample 1 – High LTV, standard owner-occupied file
The predicted rate is 6.889%, about 0.15% above the base rate. The main upward pressure comes from a 95% LTV and a mid-tier credit score around 706, both of which increase perceived risk. This is partially offset by the loan being owner-occupied with a standard purpose and a 360-month term, which pull the rate back down a little. Overall, this is priced as a higher-risk, high-leverage loan with some positive mitigating factors.
Sample 2 – Strong borrower offsetting high LTV
Here the model predicts 6.519%, roughly 0.22% below the base rate. The borrower’s DTI of 50% and excellent credit score of 782 both have strong negative SHAP values, reducing the rate. Although the LTV is again 95% and the term is 360 months, which push the rate up, the combination of very strong credit and behaviourally acceptable DTI more than compensates, leading to a cheaper rate than average.
Together, these examples show that the model’s behaviour is consistent with underwriting intuition: riskier leverage and purposes push rates up; strong credit, equity and owner-occupancy pull them down.
SHAP gives us numbers and charts; the final step is to turn those into explanations a broker can read in a few seconds and repeat to a customer. For that, we add a GenAI layer on top of the XGBoost + SHAP pipeline.
The workflow is:
This turns the model into a conversational tool: a broker can request a quote, immediately see the numerical breakdown, and also receive a ready-made explanation that is consistent, compliant and easy to share with the borrower.
Together, the ML pricing model, SHAP explainability and the GenAI explanation layer give us a pricing system that is accurate, transparent and ready for real-world use. It allows brokers, auditors and borrowers to understand not just the rate, but the reasoning behind it, turning intelligent pricing into a clear, confident part of the lending process.