CRISP-DM Data Preparation for the GE Employee Attrition

Coursework Instructions:

Use Case: GE Employee Attrition

Plan Definition: This should include the CRISP-DM Data Preparation phase:

Create a data analytic architecture pattern; include the details for full implementation. Details will need to address data quality, integrity, and protection specific to the organization, industry, and problem you are addressing.

You will create visualizations representing your solutions for various stakeholders that you need to identify. Develop a project plan detailing the involved stakeholders, the timeline, and strategies for professional and effective collaboration to be used to ensure success.

Coursework Sample Content Preview:

Employee Attrition
Author
Affiliation
Course
Instructor
Due Date
Employee Attrition
Data Understanding
Context
GE is keen on retaining its key employees since it has been noted that it has a high churn rate. The cost of losing an employee is estimated to cost GE 80 percent of the employee's annual income. GE invests heavily in its employees in order to stay competitive in the market. Thus, long hours and significant financial resources are spent training and upskilling employees. GE seeks to build a high-accuracy model that can predict employees most likely to churn. The model should be able to evaluate an employee's profile and give a real-time prediction of the likelihood of the employee churning. As a result, a thorough and efficient model for detecting employees likely to churn is required. Predicting employees that are likely to churn can help management intervene before losing the employee. As part of the project lifecycle, we will undertake the understanding and preparation of data according to the CRISP-DM methodology.
Data types
GE’s human resource data had a total of 1270 row and 35 columns. Of the 35 columns, 9 were categorical data while 26 were numeric data. The categorical data included; 'Attrition', 'BusinessTravel', 'Department' ,'EducationField', 'Gender' ,'JobRole' 'MaritalStatus', 'Over18', and 'OverTime'. Whereas the numeric data included; 'Age', 'DailyRate', 'DistanceFromHome' ,'Education', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', and 'YearsWithCurrManager'. Attrition is the dependent variable with on two categorical values that is “yes” and “no”. Attrition describes employees that did churn and those that did not. There were no missing values from the dataset. The remaining data set were the independent variables.
Descriptive Statistics
We undertook the descriptive statistics of the numerical values, and attached below is the snippet of the descriptive statistics output. The descriptive statistics reveal no values that were out of range. As a result, we did not find any reason to conclude their outliers. However, some variables seem not to add value to our dataset (Smart Vision Europe, n.d). This values include:” 'EmployeeCount', 'EmployeeNumber','StandardHours'. The descriptive statics revealed no meaningful statistics about the variables. 'EmployeeCount' had a mean, maximum, minimum, and standard deviation of 1, suggesting that its presence in the dataset is insignificant. 'EmployeeNumber' serves as a unique identifier of employees in the dataset, so apart from that, it has no significant value in the dataset. 'StandardHours' as well had a constant value of 80 in all the metrics in the descriptive analysis. Therefore, these variables will be omitted from the study.
Chart 1: Descriptive statistics
Correlation analysis
We did correlation analysis on the numerical variables as part of the descriptive statistics. Our finding reveals a moderately strong, positive correlation between "TotalWorkingYears" and "Age." Therefore, the elder an employee is, the more likely they will work for longer hours. It may be helpful to drop one of the values in this study since it will save on computational resources. However, we would like to retain it since it may improve the performance of the designated algorithm. In addition, our findings revealed a semi-strong positive correlation between "Job level, "Monthly income," and "Age." Moreover, we observed a weak correlation between "Age" and "yearsatCompany," "NumcompaniesWorked," 'YearsInCurrentRole,' 'YearsSinceLastPromotion,' and 'YearsWithCurrManager.' However, all the data will be retained since the correlations are weak.
Chart 2: Correlation analysis
Data Preparation
According to the Data Warehousing Institute (2000), the data preparation process has five steps. The outlined steps of data preprocessing include; data selection, data cleansing, data construction, data integration, and data formatting. A detailed analysis of the five steps is as follows;
1 Data selection
"Attrition" was identified as the target variable from the GE human resource dataset. The correlation analysis advised the dropping of 'EmployeeCount,' 'EmployeeNumber,' and 'StandardHours .'While the rest of the variables were deemed suitable
2. Cleaning Data
The data were checked for outliers using the descriptive statistics and intuitively concluded that there were no outliers. There were no missing values for all the data, so no data was replaced, neither deleted nor filled. In data quality assessment, we checked for consistency by checking the number of unique values in the catego...

Updated on January 26, 2024

Get the Whole Paper!

Not exactly what you need?

Do you need a custom essay? Order right now:

Order

👀 Other Visitors are Viewing These APA Essay Samples:

App Flow For A Gym In France (Bordeaux)

3 pages/≈825 words | No Sources | APA | IT & Computer Science | Coursework |
Information Architecture (IA): Aims and Main Components

1 page/≈275 words | 4 Sources | APA | IT & Computer Science | Coursework |
XML Concepts and Use

2 pages/≈550 words | 2 Sources | APA | IT & Computer Science | Coursework |