LOAN PREDICTION

LOAN APPROVAL PREDICTION

-With Classification Algorithm and building a flask web app, deploying to heroku.

Image for post

Image for post

Photo by Josh Appel from Unsplash

Hello, this is my final project for the second cohort mentorship program organized by the She Code Africa community. It has been an amazing journey so far.

In finance, a loan is the lending of money by one or more individuals, organizations, or other entities to other individuals, organizations, etc — Wikipedia.

Getting to predict a borrower who will pay back manually is very tedious, hence the need to automate the loan eligibility process based on customer information.

Before I go any further, let me walk you through the problem statement.

PROBLEM STATEMENT

There is a company called Dream Housing Finance that deals in all types of home loans. When customers apply for a loan in the company, the company accesses them using their information filled on the form given to them(the customers).

After proper information is ascertained, the company then predict if such customer using their information will be able to pay back or not after the stipulated time.

Like I have earlier mentioned, doing this manually takes a lot of time, hence, the need to automate through it.

The main aim of this project is to be able to predict the eligibility of customers for a loan using their segments. The more accurate the predictions are, the more beneficial it would be for Dream Housing Finance Company.

TYPE OF PROBLEM

This problem is a supervised classification problem because we need to predict whether the “Loan Status” of customers are either “Yes” or “No”.

This can be solved with any of the algorithm listed below:

i. Logistic regression

ii. Decision tree

iii. Random Forest

The above-listed algorithm is a few of the algorithms that can be used to solve this problem. What I mean is, the training of the data is not limited to the algorithm listed above.

DESCRIPTION OF COLUMNS

Two data sets are given, one is the training data set and the other is the testing data set. It is better to understand the data set and all the columns in it before trying to solve the problem, to avoid ending up confused. I’ll be explaining the columns below;

Image for post

Image for post

DESCRIPTION OF COLUMNS

Two data sets are given, one is the training data set and the other is the testing data set. It is better to understand the data set and all the columns in it before trying to solve the problem, to avoid ending up confused. I’ll be explaining the columns below;

Let me throw more light on it(columns):

Loan_ID:- As the name implies, it’s the ID number of loan applied for by the customer. In essence, it means the number used to identify the applicant.

Gender:- This is the gender of the applicant, either male or female.

Married:- The marital status of the applicant, either married or not married. If the applicant is married, it is represented with “Yes” and if not, “Not”.

Dependents:- Number of persons dependent on the applicant.

Education:- The level of education of the applicant is also a requirement for the approval of the loan. The category is either a graduate or not graduate.

Self_Employed:- This implies that he/she is their own employer i.e they are their boss. The category in the data set is either Yes(Self_employed) or No(not self_employed).

ApplicantIncome:- This is how much the applicant earns. The general assumption is the higher the income, the higher the chance of the applicant paying back.

CoapplicantIncome:- This is the amount the co-applicant earns.

LoanAmount:- This is the loan amount the applicant applied for in thousands. And the assumption is, the higher the amount lesser the chances to pay back.

Loan_Amount_Term:- This is the time take to pay back the loan. It is represented in months.

Credit_History:- This is the record of the applicant’s previous loan and if he/she obeyed the rules of payment or not.

Property_Area:- The area of the applicant property.

Loan_Status:- If the applicant is eligible for a loan, it’s yes represented by Y else it’s no represented by N.

DATA EXPLORATION

I’m going to work us through the codes and processes in predicting this data.

For this project, I used Jupyter Notebook as the Integrated Development Environment (IDE). Firstly, all the packages needed to explore the data was imported, before loading the data as shown below:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns</span> <span id="615e" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">train = pd.read_csv("loan train.csv") #to read the train data set 
test = pd.read_csv("loan test.csv") #to read the test data set</span>

Then I explored the data by checking the first few columns and rows then checked the summary of the statistics, the missing values, number of rows and columns, and listed out all the columns. The summary of the statistics can be check by using:

train.describe()  #this gives the summary of the statistics</span>

Image for post

Image for post

image of the statistics

From this statistics summary, it can be seen that there are some missing values. Explanation of the summary statistics row is below:

Image for post

Image for post

DATA CLEANING

After exploring the data set, found out there were some missing values and had to fill it using the median for numerical data and for categorical data used mode instead of mean. Median was chosen because outliers have a significant effect on the mean of data, if there are a lot of outliers it is safe to say that the mean is not the right way to go. It turned out there were a lot of outliers, so I used the median and mode to fill the missing values.

To check missing value

train.isnull().sum()</span>

Image for post

Image for post

Snippet code to show how the missing value was filled

train["LoanAmount"].fillna((train["LoanAmount"].median()), inplace=**True**)
train["Self_Employed"].fillna((train["Self_Employed"].mode()[0]), inplace=True)</span>

Image for post

Image for post

Now, it can be seen that the data is clean.

DATA VISUALIZATION

After cleaning the data set, I did the value count and then visualize the data.

train["Gender"].value_counts() 
sns.barplot(x="Gender", y="ApplicantIncome", data=train)</span>

Image for post

Image for post

You can find more plots here.

And my conclusion after visualizing the data is:

  • We can infer that percentage of married people who have got their loan approved is higher compared to non- married people.
  • The percentage of applicants with either 0 or 2 dependents have higher loan approval.
  • The percentage of applicants who are graduates have higher loan approval than the ones who are not graduates.
  • Even after the data analysis, there is still no unique factor to determine loan status.

FEATURE ENGINEERING

After which, columns with categorical data like gender, self_employed, etc were converted to numerical data as shown below:

cleanup_num = {"Loan_Status":   {"Y": 1, "N": 0},
              "Gender":  {"Male": 1, "Female": 0},
              "Married": {"Yes": 1, "No": 0},
              "Self_Employed": {"Yes": 1, "No": 0},
              "Education": {"Graduate": 0, "Not Graduate": 1} }</span><span id="35cf" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">train.replace(cleanup_num, inplace=**True**)</span>

MODELLING

After doing the EDA(Exploratory data analysis), it’s time to model/ train the data. Firstly, imported all the necessary packages for modeling the data, for this project three different algorithms(LogisticRegression, DecisionTreeClassifier, and RandomForestClassifier ) were used.

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=1)</span>

After which, the model was defined using:

model=LogisticRegression()</span>

Then save the model

pickle.dump(model,open('model.pkl','wb'))</span>

Next, evaluate the model by checking the accuracy score.

scoring = "accuracy"
score = cross_val_score(model, train_data, target, cv=k_fold, n_jobs=1, scoring = scoring)
print(score)
round(np.mean(score)*100, 2)</span>

After training the data, it was discovered that the LogisticRegression model was the best fit for the data because its accuracy was the highest(81.12%).

It was then used to fit the test data set and the prediction came out successful.

model.fit(train_data, target)

test_data = test.drop("Loan_ID", axis=1).copy()
prediction = model.predict(test_data)</span>

DEPLOYING THE MACHINE LEARNING INTO A WEB APP WITH FLASK

First, create a HTML or CSS file to get values from users(in form of a form), this file can be named “index.html” or you can download a template online.

Next, create a python file and type the code below:

from flask import Flask, request, jsonify, render_template
import pickle
import numpy as np</span><span id="1a9d" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">app = Flask(__name__)
model = pickle.load(open("model.pkl", "rb"))</span><span id="2ff1" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">[@app](http://twitter.com/app).route("/")
def home():
    return render_template("index.html")</span><span id="beb4" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">[@app](http://twitter.com/app).route("/predict",methods=["POST"])
def predict():</span><span id="d72d" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">int_features = [int(x) for x in request.form.values()]
    final_features = [np.array(int_features)]
    prediction = model.predict(final_features)</span><span id="0a7d" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">output = round(prediction[0], 2)</span><span id="bb64" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">return render_template("index.html", prediction_text= "PROBABILITY THAT YOUR LOAN WILL BE APPROVED IS ; {}".format(output))</span><span id="b972" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">[@app](http://twitter.com/app).route('/results',methods=['POST'])
def results():</span><span id="9762" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">data = request.get_json(force=True)
    prediction = model.predict([np.array(list(data.values()))])</span><span id="f4ea" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">output = prediction[0]
    return jsonify(output)</span><span id="d8e4" class="dh jj jk er jl b ax jp jq jr js jt jn s jo">if __name__ == "__main__":
    app.run(debug=True)</span>

When you run the code, the output should look like this:

Image for post

Image for post

Note that your web page’s form names should correspond with the column names in the dataset that was used to build the model.

DEPLOYING WEB APP TO HEROKU

Prerequisite:

  1. Have git installed — Click here to download it.
  2. Sign up for a Heroku account

Here, you use the normal way of pushing codes to Github using the codes below:

git init
git add .
git commitFirst Commit”
git remote add origin 'your_url_name'
git push -u origin master</span>

Steps on deploying to heroku:

  1. Login to your heroku account and create a new app.

2. Choose the Github as your deployment method,

3. Enter your repo name and click search.

4. Click on “Manually Deploy”

Image for post

Image for post

After it has successfully installed, you can now view the web app and test it.

Image for post

Image for post

I hope you enjoyed this tutorial, to better understand this blog post, find more codes here.

Thanks for reading.