##### PROJECT: People Analytics (HR Analytics) - What factors mostly cause friction in the workplace? #####
# SUMMARY: One of Google's biggest success was applying Data Science to check if its employees are, in fact, committed to their jobs.
# Google identified several aspects that would influence if the employee had plans of leaving the company and what actually worked as a motivator.
# Through Analysis and ML, the results allowed Google to aim into the right direction, saving up to 7MM USD on recruiting processes per year.
# This project objective is to identify WHAT FACTORS MOSTLY CAUSE FRICTION IN THE WORKPLACE, to support HR development and prevent resignations.
# The dataset is provided by IBM and it is partially original, with some extra synthetic inputs (and some problems, as I decide on how to deal with them below).
# OBJECTIVE: Provide HR insights on how to hire professionals with a better fit and match, and how to maintain current employees.
# SOURCE 1: https://d3.harvard.edu/platform-digit/submission/people-analytics-at-google-using-data-to-make-google-a-great-place-to-work/
# SOURCE 2: https://www.sage.com/en-au/blog/case-study-how-google-uses-people-analytics/
# SOURCE 3: https://www.linkedin.com/pulse/people-analytics-takes-off-ten-things-weve-learned-josh-bersin/
# DATASET: https://github.com/IBM/employee-attrition-aif360/blob/master/data/emp_attrition.csv
# Let's begin!
##### 1. LIBRARIES & DATA LOAD #####
library(caret)
library(ggplot2)
library(gridExtra)
library(data.table)
library(car)
library(caTools)
library(corrplot)
library(rpart)
library(rpart.plot)
# Dataset load, quick browse
data_hr <- fread('data/HR_dataset.csv')
dim(data_hr) # DIM FUNCTION: Always useful to easily understand the dataset dimensions.
## [1] 23058 30
View(data_hr) # VIEW FUNCTION: To view in table format.
str(data_hr) # STR FUNCTION: No factor variable identified.
## Classes 'data.table' and 'data.frame': 23058 obs. of 30 variables:
## $ Age : int 41 37 41 37 37 37 41 41 41 41 ...
## $ Attrition : chr "Voluntary Resignation" "Voluntary Resignation" "Voluntary Resignation" "Voluntary Resignation" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
## $ Department : chr "Sales" "Human Resources" "Sales" "Human Resources" ...
## $ DistanceFromHome : int 1 6 1 6 6 6 1 1 1 1 ...
## $ Education : int 2 4 2 4 4 4 2 2 2 2 ...
## $ EducationField : chr "Life Sciences" "Human Resources" "Life Sciences" "Marketing" ...
## $ EnvironmentSatisfaction : int 2 1 2 1 1 1 2 2 2 4 ...
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ JobInvolvement : int 3 3 3 3 3 3 3 3 3 3 ...
## $ JobLevel : int 2 2 2 2 2 2 2 2 2 4 ...
## $ JobRole : chr "Sales Executive" "Sales Executive" "Sales Executive" "Sales Executive" ...
## $ JobSatisfaction : int 4 4 4 4 4 4 4 4 4 3 ...
## $ MaritalStatus : chr "Single" "Single" "Single" "Single" ...
## $ MonthlyIncome : int 5993 5993 5993 5993 5993 5993 5993 5993 5993 14756 ...
## $ NumCompaniesWorked : int 8 8 4 5 8 5 8 4 8 2 ...
## $ OverTime : chr "Yes" "Yes" "Yes" "Yes" ...
## $ PercentSalaryHike : int 11 11 11 11 11 11 11 11 11 14 ...
## $ PerformanceRating : int 3 4 3 3 3 3 3 3 3 3 ...
## $ RelationshipSatisfaction: int 1 1 1 1 1 1 1 1 1 3 ...
## $ StockOptionLevel : int 0 0 0 0 0 0 0 0 0 3 ...
## $ TotalWorkingYears : int 8 8 8 8 8 8 8 8 8 21 ...
## $ TrainingTimesLastYear : int 0 0 0 0 0 0 0 0 0 2 ...
## $ WorkLifeBalance : int 1 1 1 1 1 1 1 1 1 3 ...
## $ YearsAtCompany : int 6 6 6 6 6 6 6 6 6 5 ...
## $ YearsInCurrentRole : int 4 4 4 4 4 4 4 4 4 0 ...
## $ YearsSinceLastPromotion : int 0 0 0 0 0 0 0 0 0 0 ...
## $ YearsWithCurrManager : int 5 5 5 5 5 5 5 5 5 2 ...
## $ Employee Source : chr "Referral" "Referral" "Referral" "Referral" ...
## $ AgeStartedWorking : int 33 29 33 29 29 29 33 33 33 20 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(data_hr) # SUMMARY FUNCTION: It has been identified that some variables although categorical, have been displayed as numeric, which will require cleaning and adjustment.
## Age Attrition BusinessTravel Department
## Min. :18.00 Length:23058 Length:23058 Length:23058
## 1st Qu.:30.00 Class :character Class :character Class :character
## Median :36.00 Mode :character Mode :character Mode :character
## Mean :37.04
## 3rd Qu.:43.00
## Max. :60.00
## DistanceFromHome Education EducationField EnvironmentSatisfaction
## Min. : 1.000 Min. :1.000 Length:23058 Min. :1.00
## 1st Qu.: 2.000 1st Qu.:2.000 Class :character 1st Qu.:2.00
## Median : 7.000 Median :3.000 Mode :character Median :3.00
## Mean : 9.215 Mean :2.915 Mean :2.72
## 3rd Qu.:14.000 3rd Qu.:4.000 3rd Qu.:4.00
## Max. :29.000 Max. :5.000 Max. :4.00
## Gender JobInvolvement JobLevel JobRole
## Length:23058 Min. :1.00 Min. :1.000 Length:23058
## Class :character 1st Qu.:2.00 1st Qu.:1.000 Class :character
## Mode :character Median :3.00 Median :2.000 Mode :character
## Mean :2.73 Mean :2.044
## 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :4.00 Max. :5.000
## JobSatisfaction MaritalStatus MonthlyIncome NumCompaniesWorked
## Min. :1.000 Length:23058 Min. : 1009 Min. :0.000
## 1st Qu.:2.000 Class :character 1st Qu.: 2900 1st Qu.:1.000
## Median :3.000 Mode :character Median : 4898 Median :2.000
## Mean :2.725 Mean : 6416 Mean :2.691
## 3rd Qu.:4.000 3rd Qu.: 8120 3rd Qu.:4.000
## Max. :4.000 Max. :19999 Max. :9.000
## OverTime PercentSalaryHike PerformanceRating
## Length:23058 Min. :11.00 Min. :3.000
## Class :character 1st Qu.:12.00 1st Qu.:3.000
## Mode :character Median :14.00 Median :3.000
## Mean :15.22 Mean :3.155
## 3rd Qu.:18.00 3rd Qu.:3.000
## Max. :25.00 Max. :4.000
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :1.0000 Median :10.00
## Mean :2.713 Mean :0.7944 Mean :11.07
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :4.000 Max. :3.0000 Max. :40.00
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.00 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.00 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.00 Median : 3.000
## Mean :2.804 Mean :2.762 Mean : 6.91 Mean : 4.201
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.00 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.00 Max. :18.000
## YearsSinceLastPromotion YearsWithCurrManager Employee Source
## Min. : 0.000 Min. : 0.000 Length:23058
## 1st Qu.: 0.000 1st Qu.: 2.000 Class :character
## Median : 1.000 Median : 3.000 Mode :character
## Mean : 2.164 Mean : 4.091
## 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
## AgeStartedWorking
## Min. : 0.00
## 1st Qu.:20.00
## Median :25.00
## Mean :25.96
## 3rd Qu.:31.00
## Max. :60.00
##### 2. DATA WRANGLING #####
# Transforming categorical variables to factor type
View(data_hr)
data_hr$Attrition <- as.factor(data_hr$Attrition)
data_hr$BusinessTravel <- as.factor(data_hr$BusinessTravel)
data_hr$Department <- as.factor(data_hr$Department)
data_hr$Education <- as.factor(data_hr$Education)
data_hr$EducationField <- as.factor(data_hr$EducationField)
data_hr$'Employee Source' <- as.factor(data_hr$'Employee Source') # Quote due to space between words
data_hr$EnvironmentSatisfaction <- as.factor(data_hr$EnvironmentSatisfaction)
data_hr$Gender <- as.factor(data_hr$Gender)
data_hr$JobInvolvement <- as.factor(data_hr$JobInvolvement)
data_hr$JobLevel <- as.factor(data_hr$JobLevel)
data_hr$JobRole <- as.factor(data_hr$JobRole)
data_hr$JobSatisfaction <- as.factor(data_hr$JobSatisfaction)
data_hr$MaritalStatus <- as.factor(data_hr$MaritalStatus)
data_hr$OverTime <- as.factor(data_hr$OverTime)
data_hr$PerformanceRating <- as.factor(data_hr$PerformanceRating)
data_hr$RelationshipSatisfaction <- as.factor(data_hr$RelationshipSatisfaction)
data_hr$StockOptionLevel <- as.factor(data_hr$StockOptionLevel)
data_hr$WorkLifeBalance <- as.factor(data_hr$WorkLifeBalance)
str(data_hr) #Checking variables after tranformation.
## Classes 'data.table' and 'data.frame': 23058 obs. of 30 variables:
## $ Age : int 41 37 41 37 37 37 41 41 41 41 ...
## $ Attrition : Factor w/ 3 levels "Current employee",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 1 3 1 1 1 3 3 3 3 ...
## $ DistanceFromHome : int 1 6 1 6 6 6 1 1 1 1 ...
## $ Education : Factor w/ 5 levels "1","2","3","4",..: 2 4 2 4 4 4 2 2 2 2 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 1 2 3 1 3 2 2 2 2 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 1 2 1 1 1 2 2 2 4 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
## $ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 3 3 3 3 3 ...
## $ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 4 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 8 8 8 8 8 8 8 8 4 ...
## $ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ MonthlyIncome : int 5993 5993 5993 5993 5993 5993 5993 5993 5993 14756 ...
## $ NumCompaniesWorked : int 8 8 4 5 8 5 8 4 8 2 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ PercentSalaryHike : int 11 11 11 11 11 11 11 11 11 14 ...
## $ PerformanceRating : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 1 1 1 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 3 ...
## $ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 4 ...
## $ TotalWorkingYears : int 8 8 8 8 8 8 8 8 8 21 ...
## $ TrainingTimesLastYear : int 0 0 0 0 0 0 0 0 0 2 ...
## $ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 3 ...
## $ YearsAtCompany : int 6 6 6 6 6 6 6 6 6 5 ...
## $ YearsInCurrentRole : int 4 4 4 4 4 4 4 4 4 0 ...
## $ YearsSinceLastPromotion : int 0 0 0 0 0 0 0 0 0 0 ...
## $ YearsWithCurrManager : int 5 5 5 5 5 5 5 5 5 2 ...
## $ Employee Source : Factor w/ 9 levels "Adzuna","Company Website",..: 8 8 8 8 8 8 8 8 8 2 ...
## $ AgeStartedWorking : int 33 29 33 29 29 29 33 33 33 20 ...
## - attr(*, ".internal.selfref")=<externalptr>
# Transforming numeric variables to integer type
View(data_hr)
data_hr$DistanceFromHome <- as.integer(data_hr$DistanceFromHome)
data_hr$MonthlyIncome <- as.integer(data_hr$MonthlyIncome)
data_hr$PercentSalaryHike <- as.integer(data_hr$PercentSalaryHike)
# Drop the factor levels with 0 count
data <- droplevels(data_hr)
str(data_hr)
## Classes 'data.table' and 'data.frame': 23058 obs. of 30 variables:
## $ Age : int 41 37 41 37 37 37 41 41 41 41 ...
## $ Attrition : Factor w/ 3 levels "Current employee",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 1 3 1 1 1 3 3 3 3 ...
## $ DistanceFromHome : int 1 6 1 6 6 6 1 1 1 1 ...
## $ Education : Factor w/ 5 levels "1","2","3","4",..: 2 4 2 4 4 4 2 2 2 2 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 1 2 3 1 3 2 2 2 2 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 1 2 1 1 1 2 2 2 4 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
## $ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 3 3 3 3 3 ...
## $ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 4 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 8 8 8 8 8 8 8 8 4 ...
## $ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ MonthlyIncome : int 5993 5993 5993 5993 5993 5993 5993 5993 5993 14756 ...
## $ NumCompaniesWorked : int 8 8 4 5 8 5 8 4 8 2 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ PercentSalaryHike : int 11 11 11 11 11 11 11 11 11 14 ...
## $ PerformanceRating : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 1 1 1 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 3 ...
## $ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 4 ...
## $ TotalWorkingYears : int 8 8 8 8 8 8 8 8 8 21 ...
## $ TrainingTimesLastYear : int 0 0 0 0 0 0 0 0 0 2 ...
## $ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 3 ...
## $ YearsAtCompany : int 6 6 6 6 6 6 6 6 6 5 ...
## $ YearsInCurrentRole : int 4 4 4 4 4 4 4 4 4 0 ...
## $ YearsSinceLastPromotion : int 0 0 0 0 0 0 0 0 0 0 ...
## $ YearsWithCurrManager : int 5 5 5 5 5 5 5 5 5 2 ...
## $ Employee Source : Factor w/ 9 levels "Adzuna","Company Website",..: 8 8 8 8 8 8 8 8 8 2 ...
## $ AgeStartedWorking : int 33 29 33 29 29 29 33 33 33 20 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(data_hr)
## Age Attrition BusinessTravel
## Min. :18.00 Current employee :19370 Non-Travel : 2344
## 1st Qu.:30.00 Termination : 87 Travel_Frequently: 4378
## Median :36.00 Voluntary Resignation: 3601 Travel_Rarely :16336
## Mean :37.04
## 3rd Qu.:43.00
## Max. :60.00
##
## Department DistanceFromHome Education
## Human Resources : 1010 Min. : 1.000 1:2659
## Research & Development:15040 1st Qu.: 2.000 2:4436
## Sales : 7008 Median : 7.000 3:8930
## Mean : 9.215 4:6279
## 3rd Qu.:14.000 5: 754
## Max. :29.000
##
## EducationField EnvironmentSatisfaction Gender JobInvolvement
## Human Resources : 442 1:4490 Female: 9205 1: 1287
## Life Sciences :9513 2:4476 Male :13853 2: 5888
## Marketing :2484 3:7091 3:13644
## Medical :7267 4:7001 4: 2239
## Other :1291
## Technical Degree:2061
##
## JobLevel JobRole JobSatisfaction MaritalStatus
## 1:8594 Sales Executive :5067 1:4575 Divorced: 5163
## 2:8448 Research Scientist :4591 2:4371 Married :10543
## 3:3440 Laboratory Technician :4112 3:6938 Single : 7352
## 4:1563 Manufacturing Director :2346 4:7174
## 5:1013 Healthcare Representative:2069
## Manager :1521
## (Other) :3352
## MonthlyIncome NumCompaniesWorked OverTime PercentSalaryHike
## Min. : 1009 Min. :0.000 No :16524 Min. :11.00
## 1st Qu.: 2900 1st Qu.:1.000 Yes: 6534 1st Qu.:12.00
## Median : 4898 Median :2.000 Median :14.00
## Mean : 6416 Mean :2.691 Mean :15.22
## 3rd Qu.: 8120 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :19999 Max. :9.000 Max. :25.00
##
## PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 3:19478 1:4331 0:9873 Min. : 0.00
## 4: 3580 2:4762 1:9370 1st Qu.: 6.00
## 3:7164 2:2497 Median :10.00
## 4:6801 3:1318 Mean :11.07
## 3rd Qu.:15.00
## Max. :40.00
##
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 1: 1263 Min. : 0.00 Min. : 0.000
## 1st Qu.:2.000 2: 5374 1st Qu.: 3.00 1st Qu.: 2.000
## Median :3.000 3:14016 Median : 5.00 Median : 3.000
## Mean :2.804 4: 2405 Mean : 6.91 Mean : 4.201
## 3rd Qu.:3.000 3rd Qu.: 9.00 3rd Qu.: 7.000
## Max. :6.000 Max. :40.00 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager Employee Source
## Min. : 0.000 Min. : 0.000 Company Website:5327
## 1st Qu.: 0.000 1st Qu.: 2.000 Seek :3655
## Median : 1.000 Median : 3.000 Indeed :2471
## Mean : 2.164 Mean : 4.091 Jora :2408
## 3rd Qu.: 3.000 3rd Qu.: 7.000 LinkedIn :2294
## Max. :15.000 Max. :17.000 Recruit.net :2283
## (Other) :4620
## AgeStartedWorking
## Min. : 0.00
## 1st Qu.:20.00
## Median :25.00
## Mean :25.96
## 3rd Qu.:31.00
## Max. :60.00
##
View(data_hr) # I always check the dataset after some interactions to ensure nothing was lost or done incorrectly.
##### 3. FEATURE ENGINEERING #####
# As a Data professional, I always like to work on Feature Engineering because on some analysis, it can simplify and/or boost accuracy of the model, without changing the data.
# I have created a previous years of experience column to better visualize the employee's experience profile.
data_hr$PriorYearsOfExperience <- data_hr$TotalWorkingYears - data_hr$YearsAtCompany
View(data_hr)
# Job tenure is the measure of how long an employee is employed by his or her current employer.
# An employee's job tenure is very important, and employers often consider job tenure a criterion for hiring new employees.
# Job tenure can be long or short.
# We have created a new average tenure feature.
# To understand the average tenure of employees at previous companies.
data_hr$AverageTenure <- data_hr$PriorYearsOfExperience / data_hr$NumCompaniesWorked
View(data_hr)
# The mean of tenure stability produces values like INF due to the nature of its derivation.
# Let's substitute them to zero
summary(data_hr$AverageTenure)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1 Inf 4 Inf 372
data_hr$AverageTenure[!is.finite(data_hr$AverageTenure)] <- 0
summary(data_hr$AverageTenure) # Here I check the variable again to ensure the data has been dealt with in a proper way.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.3333 1.7725 1.5000 40.0000
View(data_hr)
# We analyze and split the data based on the Attrition column,
# Which indicates whether the employee was terminated from the company.
# I want to get rid of the "Terminated" employees in this data set since they represent a very small percentage of entries (0.38%).
data_hr_1 <- data_hr[data_hr$Attrition != 'Termination']
data_hr_1 <- droplevels(data_hr_1)
dim(data_hr_1)
## [1] 22971 32
summary(data_hr_1)
## Age Attrition BusinessTravel
## Min. :18.00 Current employee :19370 Non-Travel : 2344
## 1st Qu.:30.00 Voluntary Resignation: 3601 Travel_Frequently: 4363
## Median :36.00 Travel_Rarely :16264
## Mean :37.06
## 3rd Qu.:43.00
## Max. :60.00
##
## Department DistanceFromHome Education
## Human Resources : 1010 Min. : 1.000 1:2659
## Research & Development:14977 1st Qu.: 2.000 2:4421
## Sales : 6984 Median : 7.000 3:8890
## Mean : 9.191 4:6247
## 3rd Qu.:14.000 5: 754
## Max. :29.000
##
## EducationField EnvironmentSatisfaction Gender JobInvolvement
## Human Resources : 442 1:4482 Female: 9173 1: 1279
## Life Sciences :9494 2:4460 Male :13798 2: 5849
## Marketing :2484 3:7067 3:13612
## Medical :7215 4:6962 4: 2231
## Other :1291
## Technical Degree:2045
##
## JobLevel JobRole JobSatisfaction MaritalStatus
## 1:8547 Sales Executive :5051 1:4543 Divorced: 5148
## 2:8432 Research Scientist :4576 2:4355 Married :10502
## 3:3424 Laboratory Technician :4088 3:6914 Single : 7321
## 4:1563 Manufacturing Director :2338 4:7159
## 5:1005 Healthcare Representative:2061
## Manager :1513
## (Other) :3344
## MonthlyIncome NumCompaniesWorked OverTime PercentSalaryHike
## Min. : 1009 Min. :0.000 No :16476 Min. :11.00
## 1st Qu.: 2909 1st Qu.:1.000 Yes: 6495 1st Qu.:12.00
## Median : 4898 Median :2.000 Median :14.00
## Mean : 6418 Mean :2.688 Mean :15.22
## 3rd Qu.: 8120 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :19999 Max. :9.000 Max. :25.00
##
## PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 3:19407 1:4315 0:9826 Min. : 0.00
## 4: 3564 2:4739 1:9330 1st Qu.: 6.00
## 3:7132 2:2497 Median :10.00
## 4:6785 3:1318 Mean :11.08
## 3rd Qu.:15.00
## Max. :40.00
##
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 1: 1255 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 2: 5359 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 3:13960 Median : 5.000 Median : 3.000
## Mean :2.805 4: 2397 Mean : 6.914 Mean : 4.202
## 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :6.000 Max. :40.000 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager Employee Source
## Min. : 0.000 Min. : 0.000 Company Website:5307
## 1st Qu.: 0.000 1st Qu.: 2.000 Seek :3622
## Median : 1.000 Median : 3.000 Indeed :2459
## Mean : 2.167 Mean : 4.096 Jora :2398
## 3rd Qu.: 3.000 3rd Qu.: 7.000 LinkedIn :2294
## Max. :15.000 Max. :17.000 Recruit.net :2273
## (Other) :4618
## AgeStartedWorking PriorYearsOfExperience AverageTenure
## Min. : 0.00 Min. : 0.000 Min. : 0.0000
## 1st Qu.:20.00 1st Qu.: 0.000 1st Qu.: 0.0000
## Median :25.00 Median : 2.000 Median : 0.3333
## Mean :25.98 Mean : 4.165 Mean : 1.7700
## 3rd Qu.:31.00 3rd Qu.: 5.000 3rd Qu.: 1.5000
## Max. :60.00 Max. :40.000 Max. :40.0000
##
##### 4. EXPLORATORY DATA ANALYSIS #####
# Univariate Analysis: Plots
ggplot(data_hr) + geom_bar(aes(x = Gender))

ggplot(data_hr) + geom_density(aes(x = Age))

ggplot(data_hr) + geom_bar(aes(x = Attrition))

ggplot(data_hr) + geom_bar(aes(x = Department))

ggplot(data_hr) + geom_bar(aes(x = JobRole))

ggplot(data_hr) + geom_bar(aes(x = Education)) + facet_grid(~EducationField)

# Multiplot Grid
p.TotalWorkingYears <- ggplot(data_hr) + geom_density(aes(TotalWorkingYears))
p.YearsAtCompany <- ggplot(data_hr) + geom_density(aes(YearsAtCompany))
p.YearsSinceLastPromotion <- ggplot(data_hr) + geom_density(aes(YearsSinceLastPromotion))
p.YearsWithCurrManager <- ggplot(data_hr) + geom_density(aes(YearsWithCurrManager))
p.YearsInCurrentRole <- ggplot(data_hr) + geom_density(aes(YearsInCurrentRole))
p.PriorYearsOfExperience <- ggplot(data_hr) + geom_density(aes(PriorYearsOfExperience))
# Grid arrangement
grid.arrange(p.TotalWorkingYears,
p.YearsAtCompany,
p.YearsSinceLastPromotion,
p.YearsWithCurrManager,
p.YearsInCurrentRole,
p.PriorYearsOfExperience,
nrow = 2,
ncol = 3)

# Previous experience
# Let's find out the proportion of employees with less than a few years of experience
# (Chosen values: 1, 3, 5, 7, 10 years)
length(which(data_hr$PriorYearsOfExperience < 1)) / length(data_hr$PriorYearsOfExperience)
## [1] 0.3246596
length(which(data_hr$PriorYearsOfExperience < 3)) / length(data_hr$PriorYearsOfExperience)
## [1] 0.5828346
length(which(data_hr$PriorYearsOfExperience < 5)) / length(data_hr$PriorYearsOfExperience)
## [1] 0.7085177
length(which(data_hr$PriorYearsOfExperience < 7)) / length(data_hr$PriorYearsOfExperience)
## [1] 0.7952121
length(which(data_hr$PriorYearsOfExperience < 10)) / length(data_hr$PriorYearsOfExperience)
## [1] 0.8589644
# Example insight:
# 58% of employees have less than 3 years of work experience before joining IBM
# Possible problems: underdeveloped skill sets, young employee base, immature "work mentality".
# Age
length(which(data_hr$Age < 30)) / length(data_hr$Age)
## [1] 0.2165409
# Example insight:
# Only 22% of employees are under 30, the employee base is not exactly as young as expected.
# Education
summary(data_hr$Education)
## 1 2 3 4 5
## 2659 4436 8930 6279 754
length(which(data_hr$Education == 3)) / length(data_hr$Education)
## [1] 0.3872842
length(which(data_hr$Education == 4)) / length(data_hr$Education)
## [1] 0.2723133
# Example insight:
# About 39% of employees are college graduates and 27% have held a master's degree.
# The pursuit of higher education may have led to a decrease in work experience.
# Boxplot showing the distribution of monthly salary for all 4 levels of job satisfaction from 1-4
ggplot(data = subset(data_hr, !is.na(JobSatisfaction)), aes(JobSatisfaction, MonthlyIncome)) +
geom_boxplot()

# Insight Example:
# There are no obvious signs that a higher salary leads to higher job satisfaction
# Correlation
cor(data_hr$TotalWorkingYears, data_hr$YearsAtCompany, use = "complete.obs")
## [1] 0.624816
cor(data_hr$YearsAtCompany, data_hr$YearsInCurrentRole, use = "complete.obs")
## [1] 0.7670497
cor(data_hr$YearsAtCompany, data_hr$YearsSinceLastPromotion, use = "complete.obs")
## [1] 0.6236737
cor(data_hr$YearsAtCompany, data_hr$YearsWithCurrManager, use = "complete.obs")
## [1] 0.7728072
cor(data_hr$TotalWorkingYears, data_hr$MonthlyIncome, use = "complete.obs")
## [1] 0.7582066
cor(data_hr$YearsAtCompany, data_hr$MonthlyIncome, use = "complete.obs")
## [1] 0.4981578
# Scatterplots
ggplot(data_hr) + geom_point(aes(TotalWorkingYears, MonthlyIncome)) # Total compensation vs. Total Work Years seems to make sense.

ggplot(data_hr) + geom_point(aes(YearsAtCompany, MonthlyIncome)) # There is a significant reduction on amount of workers that earn less than 7.5k monthly after 10 years at the company.

# Let's investigate the relationship between work-life balance and monthly income
ggplot(data = subset(data_hr, !is.na(WorkLifeBalance)), aes(WorkLifeBalance, MonthlyIncome)) +
geom_boxplot()

# Example insight:
# Employees who rated work-life balance equal to 1 also have significantly lower average monthly income .
# Low work-life balance and low pay? A problem that the HR department needs to examine.
# Checking the pay gap between men and women.
ggplot(data = subset(data_hr, !is.na(Gender)), aes(Gender, MonthlyIncome, fill = Gender)) +
geom_boxplot() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5, size = 10)) +
labs(x = "Gender", y = "Monthly Income", title = "Monthly Salary Between Genders") +
coord_flip()

# Example of insight
# There are no signs of gender discrimination; in fact, women earn slightly more, on average, disregarding all other factors.
# Salary per Position
# We could ask the HR why there are so many outliers in this category.
ggplot(data = subset(data_hr, !is.na(JobRole))) + geom_boxplot(aes(JobRole, MonthlyIncome)) +
ggtitle("Monthly Salary Per Position")

# Insight Example
# This one is VERY interesting, showing us that there are some people with incorrect data. How one would start working at an age below 15, 10, or even 5?
# This visualization shows us that the information may be incorrect and we could recommend the HR to review the info.
# Decision: I am not using this variable anymore on further analysis.
ggplot(data = subset(data_hr, !is.na(JobRole))) + geom_boxplot(aes(JobRole, AgeStartedWorking)) +
ggtitle("Age When Started in the Role")

ggplot(data = subset(data_hr, !is.na(JobRole))) + geom_boxplot(aes(JobRole, Age)) +
ggtitle("Age By Role")

ggplot(data = subset(data_hr, !is.na(JobRole))) + geom_boxplot(aes(JobRole, YearsAtCompany)) +
ggtitle("Company Time (in years)")

ggplot(data = na.omit(data_hr)) + geom_bar(aes(JobRole, fill = Education), position = "fill") +
ggtitle("Education Level By Role") +
ylab("Proportion")

# Multivariate analysis plots for commonly used variables during the hiring process
ggplot(data = data_hr_1) +
geom_bar(aes(x = Education , fill = Attrition), position = 'fill') +
facet_grid(.~Department)

ggplot(data = data_hr_1) +
geom_bar(aes(x = Education , fill = Attrition), position = 'fill') +
facet_grid(.~JobRole)

ggplot(data = data_hr_1) +
geom_bar(aes(x = EducationField , fill = Attrition), position = 'fill') +
facet_grid(.~JobRole) +
theme(axis.text.x = element_text(angle = -90, hjust = 0))

# Multivariate analysis plots for commonly used variables after the hiring process.
ggplot(data_hr_1) + geom_bar(aes(x = Age, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = Department, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = DistanceFromHome, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = `Employee Source`, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = JobRole, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = MaritalStatus, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = Education, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = EducationField, fill = Attrition),position ='fill')

ggplot(data_hr_1) + geom_bar(aes(x = Gender, fill = Attrition), position = 'fill')

# Multivariate analysis plots between some variables and employee status.
ggplot(data_hr_1) + geom_boxplot(aes(Attrition, MonthlyIncome)) # Insight: Average Salary of people that leave the company is slightly lower than people that stay. That could indicate they are searching for a career progression. HR could study a better progression or salary balance between peers.

ggplot(data_hr_1) + geom_boxplot(aes(Attrition, PercentSalaryHike))

ggplot(data_hr_1) + geom_bar(aes(TrainingTimesLastYear, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(BusinessTravel, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(OverTime, fill = Attrition), position = 'fill') # Insight: Search better ways to compensate employees that work on overtime or better ways to unload them. Overtimes should be temporary.

ggplot(data_hr_1) + geom_bar(aes(StockOptionLevel, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(EnvironmentSatisfaction, fill = Attrition), position = 'fill') # Insight: Run a survey and 1:1 HR interviews on people that are no so satisfied with the working environment and understand what could improve.

ggplot(data_hr_1) + geom_bar(aes(JobSatisfaction, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(JobInvolvement, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(RelationshipSatisfaction, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(WorkLifeBalance, fill = Attrition), position = 'fill')

##### 5. PREDICTIVE MODELING / MACHINE LEARNING #####
# Let's focus our work on trying to help HR recruit better in order to avoid friction and, consequently, resignations
# I will create 5 versions of the model and for each one I will explore the options and interpret the result.
# First version of the model with a few variables
# 5.1 - GLM (GENERALIZED LINEAR MODELS, BINOMIAL (LOGISTIC REGRESSION))
model_v1 <- glm(Attrition ~ Age + Department + DistanceFromHome + `Employee Source` +
JobRole + MaritalStatus + AverageTenure + PriorYearsOfExperience + Gender +
Education + EducationField,
family = binomial,
data = data_hr)
# Here I call a summary of the model which, mind you:
# I did not split the dataset into 70/30 for Training and Testing.
# The reason is that I want to find out what are the most significant variables that lead to employee resignation requests.
summary(model_v1)
##
## Call:
## glm(formula = Attrition ~ Age + Department + DistanceFromHome +
## `Employee Source` + JobRole + MaritalStatus + AverageTenure +
## PriorYearsOfExperience + Gender + Education + EducationField,
## family = binomial, data = data_hr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4738 -0.6239 -0.4962 -0.3553 2.7405
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.515415 0.198808 -2.593 0.009527 **
## Age -0.046402 0.002434 -19.062 < 2e-16 ***
## DepartmentResearch & Development -0.402413 0.102837 -3.913 9.11e-05 ***
## DepartmentSales 0.041108 0.106275 0.387 0.698901
## DistanceFromHome 0.022014 0.002497 8.816 < 2e-16 ***
## `Employee Source`Company Website 0.200175 0.074567 2.684 0.007264 **
## `Employee Source`GlassDoor -0.002062 0.089568 -0.023 0.981630
## `Employee Source`Indeed -0.048126 0.088966 -0.541 0.588545
## `Employee Source`Jora 0.202494 0.084534 2.395 0.016602 *
## `Employee Source`LinkedIn -0.086527 0.090292 -0.958 0.337911
## `Employee Source`Recruit.net -0.024145 0.088800 -0.272 0.785699
## `Employee Source`Referral 0.222132 0.147177 1.509 0.131226
## `Employee Source`Seek 0.039192 0.079096 0.495 0.620253
## JobRoleHuman Resources 0.092163 0.125250 0.736 0.461832
## JobRoleLaboratory Technician 0.313456 0.079749 3.931 8.48e-05 ***
## JobRoleManager -0.370055 0.121400 -3.048 0.002302 **
## JobRoleManufacturing Director -0.091942 0.094178 -0.976 0.328937
## JobRoleResearch Director -0.326907 0.125855 -2.597 0.009391 **
## JobRoleResearch Scientist 0.102218 0.078537 1.302 0.193080
## JobRoleSales Executive -0.030434 0.079097 -0.385 0.700414
## JobRoleSales Representative 0.484732 0.095181 5.093 3.53e-07 ***
## MaritalStatusMarried 0.179376 0.053279 3.367 0.000761 ***
## MaritalStatusSingle 0.740422 0.053393 13.867 < 2e-16 ***
## AverageTenure -0.016927 0.009230 -1.834 0.066663 .
## PriorYearsOfExperience 0.018901 0.005353 3.531 0.000414 ***
## GenderMale 0.033768 0.038421 0.879 0.379467
## Education2 0.096221 0.068965 1.395 0.162951
## Education3 0.129656 0.061109 2.122 0.033862 *
## Education4 0.120603 0.066456 1.815 0.069558 .
## Education5 -0.221560 0.134302 -1.650 0.099001 .
## EducationFieldLife Sciences -0.149802 0.143779 -1.042 0.297462
## EducationFieldMarketing -0.122315 0.152984 -0.800 0.423984
## EducationFieldMedical -0.176829 0.145066 -1.219 0.222859
## EducationFieldOther -0.170949 0.161651 -1.058 0.290274
## EducationFieldTechnical Degree 0.183255 0.154276 1.188 0.234898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20272 on 23057 degrees of freedom
## Residual deviance: 18904 on 23023 degrees of freedom
## AIC: 18974
##
## Number of Fisher Scoring iterations: 5
# Another example of insight:
# I might recommend the HR, for instance, to pay attention in recruiting especially to variables with GIVF > 2, such as:
# Department
# Job Role
# Average Tenure
# Prior Years of Experience
# Not only that, the variables tagged as three stars at the Summary '***' are also extra relevant, such as AGE.
vif(model_v1)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.197853 1 1.094465
## Department 2.027501 2 1.193274
## DistanceFromHome 1.321206 1 1.149437
## `Employee Source` 1.107922 8 1.006426
## JobRole 2.564522 8 1.060628
## MaritalStatus 1.042578 2 1.010479
## AverageTenure 2.478002 1 1.574167
## PriorYearsOfExperience 2.440072 1 1.562073
## Gender 1.019571 1 1.009738
## Education 1.121235 4 1.014407
## EducationField 1.648089 5 1.051231
# I will divide the data into training and test. We will work with data without records of fired people.
set.seed(2004)
index_training <- sample.split(Y = data_hr_1$Attrition, SplitRatio = 0.7)
data_hr_1_training <- subset(data_hr_1, train = T)
data_hr_1_test <- subset(data_hr_1, train = F)
# Second version of the model with training data
model_v2 <- glm(Attrition ~ Age + Department + DistanceFromHome + `Employee Source` +
JobRole + MaritalStatus + AverageTenure + PriorYearsOfExperience + Gender +
Education + EducationField,
family = binomial,
data = data_hr_1_training)
summary(model_v2)
##
## Call:
## glm(formula = Attrition ~ Age + Department + DistanceFromHome +
## `Employee Source` + JobRole + MaritalStatus + AverageTenure +
## PriorYearsOfExperience + Gender + Education + EducationField,
## family = binomial, data = data_hr_1_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4484 -0.6177 -0.4918 -0.3558 2.7300
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.499751 0.199492 -2.505 0.012241 *
## Age -0.044889 0.002446 -18.348 < 2e-16 ***
## DepartmentResearch & Development -0.427955 0.103053 -4.153 3.28e-05 ***
## DepartmentSales 0.025684 0.106499 0.241 0.809423
## DistanceFromHome 0.020372 0.002522 8.076 6.69e-16 ***
## `Employee Source`Company Website 0.183335 0.074868 2.449 0.014334 *
## `Employee Source`GlassDoor 0.006274 0.089680 0.070 0.944229
## `Employee Source`Indeed -0.080908 0.089734 -0.902 0.367244
## `Employee Source`Jora 0.183678 0.084958 2.162 0.030618 *
## `Employee Source`LinkedIn -0.079145 0.090405 -0.875 0.381325
## `Employee Source`Recruit.net -0.050665 0.089444 -0.566 0.571095
## `Employee Source`Referral 0.230121 0.147168 1.564 0.117897
## `Employee Source`Seek -0.005837 0.079828 -0.073 0.941708
## JobRoleHuman Resources 0.107348 0.125753 0.854 0.393302
## JobRoleLaboratory Technician 0.314968 0.080707 3.903 9.52e-05 ***
## JobRoleManager -0.402633 0.123788 -3.253 0.001144 **
## JobRoleManufacturing Director -0.083426 0.095273 -0.876 0.381221
## JobRoleResearch Director -0.292195 0.126243 -2.315 0.020637 *
## JobRoleResearch Scientist 0.111877 0.079359 1.410 0.158608
## JobRoleSales Executive -0.028140 0.079873 -0.352 0.724611
## JobRoleSales Representative 0.478077 0.096067 4.977 6.47e-07 ***
## MaritalStatusMarried 0.176289 0.053865 3.273 0.001065 **
## MaritalStatusSingle 0.747383 0.053896 13.867 < 2e-16 ***
## AverageTenure -0.021245 0.009467 -2.244 0.024825 *
## PriorYearsOfExperience 0.019787 0.005399 3.665 0.000248 ***
## GenderMale 0.030982 0.038752 0.800 0.424000
## Education2 0.067584 0.069195 0.977 0.328712
## Education3 0.092553 0.061236 1.511 0.130684
## Education4 0.071013 0.066760 1.064 0.287461
## Education5 -0.233758 0.134267 -1.741 0.081685 .
## EducationFieldLife Sciences -0.148858 0.143810 -1.035 0.300620
## EducationFieldMarketing -0.106268 0.152995 -0.695 0.487317
## EducationFieldMedical -0.202212 0.145203 -1.393 0.163736
## EducationFieldOther -0.137807 0.161652 -0.852 0.393940
## EducationFieldTechnical Degree 0.180977 0.154552 1.171 0.241608
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19951 on 22970 degrees of freedom
## Residual deviance: 18626 on 22936 degrees of freedom
## AIC: 18696
##
## Number of Fisher Scoring iterations: 5
# We can verify at this point that our subset "data_hr_1"
# provides a very similar result to the original, the only difference,
# and a significant one at that, is that we do not use the data of those
# who were fired by the company, which makes the analysis more precise.
vif(model_v2)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.196689 1 1.093933
## Department 2.033631 2 1.194175
## DistanceFromHome 1.321839 1 1.149712
## `Employee Source` 1.109202 8 1.006499
## JobRole 2.550173 8 1.060256
## MaritalStatus 1.043120 2 1.010610
## AverageTenure 2.441327 1 1.562475
## PriorYearsOfExperience 2.412397 1 1.553189
## Gender 1.018652 1 1.009283
## Education 1.123257 4 1.014635
## EducationField 1.650765 5 1.051401
# Predictions
threshold <- 0.5
predictions_v2 <- predict(model_v2, type = 'response', newdata = data_hr_1_test)
final_predictions_v2 <- ifelse(predictions_v2 > threshold, 'Voluntary Resignation', 'Current employee')
table(data_hr_1_test$Attrition, final_predictions_v2)
## final_predictions_v2
## Current employee Voluntary Resignation
## Current employee 19328 42
## Voluntary Resignation 3523 78
# There was a high rate of error at this prediction. Why? - Because the dataset is not balanced.
# But when we talk about data science, we talk about decision making and, in this particular scenario, I'm not deploying the ML Algorithm, but I am in fact running analysis to recommend the HR better practices in order to reduce voluntary resignations that would improve HR's efficiency.
# But I'm not done: I want to generate newer versions of this model with less variables, because in an eventual need for a ML model deploy for some application, it would be interesting to run a lighter version of it without sacrificing accuracy.
# Let's get some not so relevant variables out of the way.
# Third version of the model with training data and without education variables
model_v3 <- glm(Attrition ~ Age + Department + DistanceFromHome + `Employee Source` +
JobRole + MaritalStatus + AverageTenure + PriorYearsOfExperience + Gender,
family = binomial,
data = data_hr_1_training)
summary(model_v3)
##
## Call:
## glm(formula = Attrition ~ Age + Department + DistanceFromHome +
## `Employee Source` + JobRole + MaritalStatus + AverageTenure +
## PriorYearsOfExperience + Gender, family = binomial, data = data_hr_1_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3428 -0.6201 -0.4941 -0.3619 2.7143
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.594443 0.163302 -3.640 0.000272 ***
## Age -0.044338 0.002361 -18.781 < 2e-16 ***
## DepartmentResearch & Development -0.455831 0.097648 -4.668 3.04e-06 ***
## DepartmentSales 0.006375 0.100798 0.063 0.949567
## DistanceFromHome 0.023945 0.002219 10.792 < 2e-16 ***
## `Employee Source`Company Website 0.185836 0.074684 2.488 0.012835 *
## `Employee Source`GlassDoor 0.004131 0.089469 0.046 0.963174
## `Employee Source`Indeed -0.084488 0.089587 -0.943 0.345638
## `Employee Source`Jora 0.182141 0.084629 2.152 0.031378 *
## `Employee Source`LinkedIn -0.073833 0.090249 -0.818 0.413300
## `Employee Source`Recruit.net -0.058670 0.089241 -0.657 0.510903
## `Employee Source`Referral 0.237922 0.146800 1.621 0.105078
## `Employee Source`Seek -0.006818 0.079571 -0.086 0.931717
## JobRoleHuman Resources 0.099083 0.125594 0.789 0.430163
## JobRoleLaboratory Technician 0.312339 0.080556 3.877 0.000106 ***
## JobRoleManager -0.418085 0.123665 -3.381 0.000723 ***
## JobRoleManufacturing Director -0.079696 0.095061 -0.838 0.401826
## JobRoleResearch Director -0.308958 0.126075 -2.451 0.014263 *
## JobRoleResearch Scientist 0.119993 0.079265 1.514 0.130071
## JobRoleSales Executive -0.023432 0.079774 -0.294 0.768961
## JobRoleSales Representative 0.483836 0.095952 5.042 4.60e-07 ***
## MaritalStatusMarried 0.176480 0.053793 3.281 0.001035 **
## MaritalStatusSingle 0.747665 0.053772 13.904 < 2e-16 ***
## AverageTenure -0.019906 0.009465 -2.103 0.035453 *
## PriorYearsOfExperience 0.019187 0.005400 3.553 0.000381 ***
## GenderMale 0.033764 0.038690 0.873 0.382838
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19951 on 22970 degrees of freedom
## Residual deviance: 18668 on 22945 degrees of freedom
## AIC: 18720
##
## Number of Fisher Scoring iterations: 5
vif(model_v3)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.116616 1 1.056701
## Department 1.733091 2 1.147375
## DistanceFromHome 1.008189 1 1.004086
## `Employee Source` 1.080863 8 1.004872
## JobRole 2.519868 8 1.059464
## MaritalStatus 1.037566 2 1.009262
## AverageTenure 2.443412 1 1.563142
## PriorYearsOfExperience 2.418527 1 1.555161
## Gender 1.017671 1 1.008797
# Predictions
threshold <- 0.5
predictions_v3 <- predict(model_v3, type = 'response', newdata = data_hr_1_test)
final_predictions_v3 <- ifelse(predictions_v3 > threshold, 'Voluntary Resignation', 'Current employee')
table(data_hr_1_test$Attrition, final_predictions_v3)
## final_predictions_v3
## Current employee Voluntary Resignation
## Current employee 19328 42
## Voluntary Resignation 3541 60
# With this example, we show that it is in fact possible to increase efficiency without a strong hit on accuracy.
# Now, last example, with even less variables.
# Fourth version of the model with training data and without education and gender variables
model_v4 <- glm(Attrition ~ Age + Department + DistanceFromHome + `Employee Source` +
JobRole + MaritalStatus + AverageTenure + PriorYearsOfExperience,
family = binomial,
data = data_hr_1_training)
summary(model_v4)
##
## Call:
## glm(formula = Attrition ~ Age + Department + DistanceFromHome +
## `Employee Source` + JobRole + MaritalStatus + AverageTenure +
## PriorYearsOfExperience, family = binomial, data = data_hr_1_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3360 -0.6192 -0.4939 -0.3622 2.7205
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.569968 0.160865 -3.543 0.000395 ***
## Age -0.044408 0.002359 -18.822 < 2e-16 ***
## DepartmentResearch & Development -0.457114 0.097648 -4.681 2.85e-06 ***
## DepartmentSales 0.004776 0.100790 0.047 0.962208
## DistanceFromHome 0.023979 0.002218 10.810 < 2e-16 ***
## `Employee Source`Company Website 0.185968 0.074691 2.490 0.012780 *
## `Employee Source`GlassDoor 0.004217 0.089473 0.047 0.962404
## `Employee Source`Indeed -0.082065 0.089543 -0.916 0.359412
## `Employee Source`Jora 0.182210 0.084632 2.153 0.031321 *
## `Employee Source`LinkedIn -0.073105 0.090254 -0.810 0.417948
## `Employee Source`Recruit.net -0.058149 0.089234 -0.652 0.514631
## `Employee Source`Referral 0.240776 0.146746 1.641 0.100844
## `Employee Source`Seek -0.006816 0.079577 -0.086 0.931742
## JobRoleHuman Resources 0.100479 0.125614 0.800 0.423769
## JobRoleLaboratory Technician 0.315123 0.080478 3.916 9.02e-05 ***
## JobRoleManager -0.419678 0.123673 -3.393 0.000690 ***
## JobRoleManufacturing Director -0.082962 0.094978 -0.873 0.382397
## JobRoleResearch Director -0.310452 0.126056 -2.463 0.013785 *
## JobRoleResearch Scientist 0.120223 0.079252 1.517 0.129277
## JobRoleSales Executive -0.023015 0.079761 -0.289 0.772925
## JobRoleSales Representative 0.482258 0.095927 5.027 4.97e-07 ***
## MaritalStatusMarried 0.175136 0.053769 3.257 0.001125 **
## MaritalStatusSingle 0.745551 0.053714 13.880 < 2e-16 ***
## AverageTenure -0.019985 0.009465 -2.112 0.034727 *
## PriorYearsOfExperience 0.019266 0.005398 3.569 0.000358 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19951 on 22970 degrees of freedom
## Residual deviance: 18668 on 22946 degrees of freedom
## AIC: 18718
##
## Number of Fisher Scoring iterations: 5
vif(model_v4)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.115570 1 1.056205
## Department 1.732346 2 1.147252
## DistanceFromHome 1.007827 1 1.003906
## `Employee Source` 1.078323 8 1.004724
## JobRole 2.493476 8 1.058767
## MaritalStatus 1.035363 2 1.008726
## AverageTenure 2.443871 1 1.563288
## PriorYearsOfExperience 2.417511 1 1.554835
# Predictions
threshold <- 0.5
predictions_v4 <- predict(model_v4, type = 'response', newdata = data_hr_1_test)
final_predictions_v4 <- ifelse(predictions_v4 > threshold, 'Voluntary Resignation', 'Current employee')
table(data_hr_1_test$Attrition, final_predictions_v4)
## final_predictions_v4
## Current employee Voluntary Resignation
## Current employee 19326 44
## Voluntary Resignation 3545 56
# CONCLUSION: As a Data Scientist, the ideal scenario is to provide the BALANCE between model's speed and accuracy, therefore its efficiency. This is a particular hard task to do and many actions could be applied, such as dataset balancing, use of different parameters, different models and so on.
# The idea here was met with the very first classification algorithm: It was able to be verified the most important variables to the hiring process.
# But now, I want to offer a different approach: I want a provide a Tree for decision making, just to make life easier, something that would allow the HR to come up with a playbook whenever talking about hiring or maintaining people motivated.
# Fifth version of the model with training data and without education and gender variables and another algorithm.
# Recursive Partitioning and Regression Trees
model_v5 <- rpart(Attrition ~ Age + Department + DistanceFromHome + JobRole + MaritalStatus +
AverageTenure + PriorYearsOfExperience,
method = "class",
control = rpart.control(minsplit = 500, cp = 0),
data = data_hr_1_training)
# Now, let's plot the tree. How to read it:
# Every time there's a decision, left means "YES" and right means "NO".
# Example: Is currently employee's age >= 34? If YES, 62% of these employees tend to remain working for the company.
# IF NO, the remaining 38% of employees fall down on other categories that will be analysed and will spit out the probability of having a voluntary resignation or not.
# Amazing how in one dataset, age would be such an important factor. Data-driven decision are indeed powerful.
rpart.plot(model_v5)

##### 6. ENDING & FINAL REMARKS #####
# Now that's the end.
# THANK YOU for sticking around and I hope this project was useful to you.
# Is there room for improvement? Yes.
# Could I plug this into a BI tool and create different sorts of dashboards? Yes.
# Could I write a script and present this to the HR and board of directors? Yes.
# But I guess that already does the trick for an R portfolio project.
# Hope talking to you soon!
# CONTACT INFO:
# Helio Ribeiro
# helioribeiropro@gmail.com
# +55 (11) 9 3932-8049