Exploratory Data Analysis - EDA

Coursework Instructions:

All requirements are in the document

Coursework Sample Content Preview:

University
Exploratory Data Analysis - EDA
Scorecard model
Your Name
Course Name and Number
Professor Name
The Due Date
Word Count: 2570
Exploratory Data Analysis - EDA
Introduction
EDA is used to analyse and carry out an investigation on the dataset and summarize the main characteristics by using data visualization methods (Camizuli and Carranza, 2018). It is helpful in determining how to best manipulate the data source to get the best possible answers that are needed, thus making it easier to identify patterns, test hypotheses, spot anomalies, or check assumptions (Jebb, Parrigon and Woo, 2017).
EDA helps to better understand the variables in the dataset and the variables relationships. Most importantly, it helps to identify the best statistical techniques to be used to analyse the dataset. The dataset HMEQ provided reports the delinquency and characteristics for 5960 equity loans for homes (Gunnink and Burrough, 2019). The exploration is carried out in the dataset as below and the explanation is provided.
The first step is to import the relevant libraries as shown below;
# Import the required libraries.
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True)
Load the dataset to the notebook and user the head function to see the first few rows and columns of the dataset as below;
from google.colab import files
import io
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['Assignmentdataset__HMEQ.csv']))
df.head()
Checking the data types is important, because some variables might have the wrong data type. The data types are as below for the HMEQ dataset is as below;
# Check the data type
df.dtypes
BAD int64
LOAN int64
MORTDUE float64
VALUE float64
REASON object
JOB object
YOJ float64
DEROG float64
DELINQ float64
CLAGE float64
NINQ float64
CLNO float64
DEBTINC float64
dtype: object
Missing Values and Duplicate Handling
To handle the duplicate, first check the shape of the dataset, and print out the number of rows with duplicate values (Dung and Phuong, 2019). To know if the duplicates have been removed, use count to to find the number of rows before and after removing the duplicates.
# count the number of rows
df.count()
# Drop the duplicates
df = df.drop_duplicates()
df.head()
# Count the number of rows after the duplicates have been removed
df.count()
The next step is to identify and handle the missing values in the dataset, the missing values can be handled differently depending on how many they are (Ezzine and Benhlima, 2018). If only a few values are missing, then they can be dropped, but if there are many then it will be a good approach to replace the missing values with the mean or average of the corresponding column. Use the print and null method to display the missing values.
# Display the null values
print(df.isnull().sum())
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC 1267
dtype: int64
The null values are many as seen in the above printed output. Therefore, we replace the missing values with their mean, and for variables with data type of object we replace with the most common value (Che et al., 2018).
# Replace the null values with their mean
df=df.fillna(df.mean())
from collections import Counter
# to replace the null values in the object data type
# Create a counter object
reason_count = Counter(df['REASON'])
job_count = Counter(df['JOB'])
# Call the method of Counter
print(reason_count.most_common(2))
print(job_count.most_common(2))
# Replace the null values with the most common values
df["REASON"].fillna(value="DebtCon", inplace=True)
df["JOB"].fillna(value="Other", inplace=True)
#check if the null values have been replaced
print(df.isnull().sum())
Outlier Detection and Handling
An outlier is referred to as a point or multiple points that are distinct from other points. They are likely to be high or low, that is why it is important to detect the outliers and handle them accordingly. This is because the outliers reduce the accuracy of the model. Plot the box plot to see if there is any outlier.
sns.boxplot(x=df['DELINQ'])
sns.boxplot(x=df['YOJ'])
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
# removing the outliers
df_n = df[~((df < (Q1-1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df_n.shape
In order to run the model, it is important to handle the categorical variables, that is reason and job (Häse et al., 2021). The variable data fields were replaced with numbers as shown below’
from sklearn.model_selection import train_test_split
The dataset needs to be divided to train and test, therefore the data was divided and the test took a percentage of 25, while training took 75 percent.
from sklearn import metrics
# divide the data to training and testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
Estimating a Scorecard Using Logistics Regression Classifier
Predict the y variable using the x test dataset. The x test dataset was divided as above.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
print(y_pred)
The score accuracy and confusion matrix was obtained as shown below;
Check for the most important variables as below;
Calculating for R Squared
# get R squared and adjusted R squared
result = sm.OLS(ym, X2).fit()
print(result.rsquared, result.rsquared_adj)
The most important variables include the job, reason, value, bad, years at present job,number of recent credit inquiries, amount due on existing mortgage, number of major derogatory reports, number of credit lines, debt-to-income ratio, and age of the oldest credit line in months. The p-value was used to select the most important variable by using an alpha value of 0.05. The variables are important because they assist in operationalization of the concept in this analysis (Kaliyadan and Kulkarni, 2019). The variables help to carry out the analysis and develop a predictive model to predict the possible loan an individual is likely to get depending on the credit score of the customer (Kaliyadan and Kulkarni, 2019). The variables impact the target by helping to determine if an individual is eligible for a loan and how much amount should the person be given. To measure the performance of the model, a confusion matrix from metrics is used. The metrics consist of accuracy, precision, and recall. Precision as seen in the above model means the percentage of the results that is relevant. Recall on the other hand is the percentage of the total relevant results that have been classified by the algorithm correctly (Powers, 2020). The metric R squared as shown above is good and is used to determine how best the model fits the used dependent variable, and adjusted R Squared is used to adjust the metric hence prevent overfitting problems.
Why Banks Use Logistics Regression
Banks prefer using the logistics regression as the base classifier because it makes a positive difference when determining credit scores issued to customers. The model helps the banks to understand the various relationships and predict the best outcome for the credit score in order to make improved decisions. The model is also easier to implement and also efficient to train. The downside of the model is that it may cause overfitting if the total observation is found to be less than the total features.
Title: Big data analytics on enterprise credit risk evaluation
of e‑Business platform
Authors: Wang, F., Di...

Updated on January 26, 2024

Get the Whole Paper!

Not exactly what you need?

Do you need a custom essay? Order right now:

Order

👀 Other Visitors are Viewing These APA Essay Samples:

Application of Cost-Benefit Analysis in the HS2 East

4 pages/≈1100 words | 15 Sources | Harvard | Mathematics & Economics | Coursework |
Economic Environment in the East Midlands

3 pages/≈825 words | 10 Sources | Harvard | Mathematics & Economics | Coursework |
Econometric Regression Line: Father's Level of Education and Income

2 pages/≈550 words | 6 Sources | Harvard | Mathematics & Economics | Coursework |