##### PROJECT: People Analytics (HR Analytics) - What factors mostly cause friction in the workplace? #####

# SUMMARY: One of Google's biggest success was applying Data Science to check if its employees are, in fact, committed to their jobs.

# Google identified several aspects that would influence if the employee had plans of leaving the company and what actually worked as a motivator.

# Through Analysis and ML, the results allowed Google to aim into the right direction, saving up to 7MM USD on recruiting processes per year.

# This project objective is to identify WHAT FACTORS MOSTLY CAUSE FRICTION IN THE WORKPLACE, to support HR development and prevent resignations.

# The dataset is provided by IBM and it is partially original, with some extra synthetic inputs (and some problems, as I decide on how to deal with them below).

# OBJECTIVE: Provide HR insights on how to hire professionals with a better fit and match, and how to maintain current employees.

# SOURCE 1: https://d3.harvard.edu/platform-digit/submission/people-analytics-at-google-using-data-to-make-google-a-great-place-to-work/
# SOURCE 2: https://www.sage.com/en-au/blog/case-study-how-google-uses-people-analytics/
# SOURCE 3: https://www.linkedin.com/pulse/people-analytics-takes-off-ten-things-weve-learned-josh-bersin/
# DATASET: https://github.com/IBM/employee-attrition-aif360/blob/master/data/emp_attrition.csv

# Let's begin!

##### 1. LIBRARIES & DATA LOAD #####
library(caret)
library(ggplot2)
library(gridExtra)
library(data.table)
library(car)
library(caTools)
library(corrplot)
library(rpart)
library(rpart.plot)

# Dataset load, quick browse
data_hr <- fread('data/HR_dataset.csv')
dim(data_hr) # DIM FUNCTION: Always useful to easily understand the dataset dimensions.
## [1] 23058    30
View(data_hr) # VIEW FUNCTION: To view in table format.
str(data_hr) # STR FUNCTION: No factor variable identified.
## Classes 'data.table' and 'data.frame':   23058 obs. of  30 variables:
##  $ Age                     : int  41 37 41 37 37 37 41 41 41 41 ...
##  $ Attrition               : chr  "Voluntary Resignation" "Voluntary Resignation" "Voluntary Resignation" "Voluntary Resignation" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
##  $ Department              : chr  "Sales" "Human Resources" "Sales" "Human Resources" ...
##  $ DistanceFromHome        : int  1 6 1 6 6 6 1 1 1 1 ...
##  $ Education               : int  2 4 2 4 4 4 2 2 2 2 ...
##  $ EducationField          : chr  "Life Sciences" "Human Resources" "Life Sciences" "Marketing" ...
##  $ EnvironmentSatisfaction : int  2 1 2 1 1 1 2 2 2 4 ...
##  $ Gender                  : chr  "Female" "Female" "Female" "Female" ...
##  $ JobInvolvement          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ JobLevel                : int  2 2 2 2 2 2 2 2 2 4 ...
##  $ JobRole                 : chr  "Sales Executive" "Sales Executive" "Sales Executive" "Sales Executive" ...
##  $ JobSatisfaction         : int  4 4 4 4 4 4 4 4 4 3 ...
##  $ MaritalStatus           : chr  "Single" "Single" "Single" "Single" ...
##  $ MonthlyIncome           : int  5993 5993 5993 5993 5993 5993 5993 5993 5993 14756 ...
##  $ NumCompaniesWorked      : int  8 8 4 5 8 5 8 4 8 2 ...
##  $ OverTime                : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ PercentSalaryHike       : int  11 11 11 11 11 11 11 11 11 14 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 3 3 3 3 ...
##  $ RelationshipSatisfaction: int  1 1 1 1 1 1 1 1 1 3 ...
##  $ StockOptionLevel        : int  0 0 0 0 0 0 0 0 0 3 ...
##  $ TotalWorkingYears       : int  8 8 8 8 8 8 8 8 8 21 ...
##  $ TrainingTimesLastYear   : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ WorkLifeBalance         : int  1 1 1 1 1 1 1 1 1 3 ...
##  $ YearsAtCompany          : int  6 6 6 6 6 6 6 6 6 5 ...
##  $ YearsInCurrentRole      : int  4 4 4 4 4 4 4 4 4 0 ...
##  $ YearsSinceLastPromotion : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ YearsWithCurrManager    : int  5 5 5 5 5 5 5 5 5 2 ...
##  $ Employee Source         : chr  "Referral" "Referral" "Referral" "Referral" ...
##  $ AgeStartedWorking       : int  33 29 33 29 29 29 33 33 33 20 ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(data_hr) # SUMMARY FUNCTION: It has been identified that some variables although categorical, have been displayed as numeric, which will require cleaning and adjustment.
##       Age         Attrition         BusinessTravel      Department       
##  Min.   :18.00   Length:23058       Length:23058       Length:23058      
##  1st Qu.:30.00   Class :character   Class :character   Class :character  
##  Median :36.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :37.04                                                           
##  3rd Qu.:43.00                                                           
##  Max.   :60.00                                                           
##  DistanceFromHome   Education     EducationField     EnvironmentSatisfaction
##  Min.   : 1.000   Min.   :1.000   Length:23058       Min.   :1.00           
##  1st Qu.: 2.000   1st Qu.:2.000   Class :character   1st Qu.:2.00           
##  Median : 7.000   Median :3.000   Mode  :character   Median :3.00           
##  Mean   : 9.215   Mean   :2.915                      Mean   :2.72           
##  3rd Qu.:14.000   3rd Qu.:4.000                      3rd Qu.:4.00           
##  Max.   :29.000   Max.   :5.000                      Max.   :4.00           
##     Gender          JobInvolvement    JobLevel       JobRole         
##  Length:23058       Min.   :1.00   Min.   :1.000   Length:23058      
##  Class :character   1st Qu.:2.00   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :3.00   Median :2.000   Mode  :character  
##                     Mean   :2.73   Mean   :2.044                     
##                     3rd Qu.:3.00   3rd Qu.:3.000                     
##                     Max.   :4.00   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome   NumCompaniesWorked
##  Min.   :1.000   Length:23058       Min.   : 1009   Min.   :0.000     
##  1st Qu.:2.000   Class :character   1st Qu.: 2900   1st Qu.:1.000     
##  Median :3.000   Mode  :character   Median : 4898   Median :2.000     
##  Mean   :2.725                      Mean   : 6416   Mean   :2.691     
##  3rd Qu.:4.000                      3rd Qu.: 8120   3rd Qu.:4.000     
##  Max.   :4.000                      Max.   :19999   Max.   :9.000     
##    OverTime         PercentSalaryHike PerformanceRating
##  Length:23058       Min.   :11.00     Min.   :3.000    
##  Class :character   1st Qu.:12.00     1st Qu.:3.000    
##  Mode  :character   Median :14.00     Median :3.000    
##                     Mean   :15.22     Mean   :3.155    
##                     3rd Qu.:18.00     3rd Qu.:3.000    
##                     Max.   :25.00     Max.   :4.000    
##  RelationshipSatisfaction StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :1.0000   Median :10.00    
##  Mean   :2.713            Mean   :0.7944   Mean   :11.07    
##  3rd Qu.:4.000            3rd Qu.:1.0000   3rd Qu.:15.00    
##  Max.   :4.000            Max.   :3.0000   Max.   :40.00    
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany  YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.00   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.00   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.00   Median : 3.000    
##  Mean   :2.804         Mean   :2.762   Mean   : 6.91   Mean   : 4.201    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.00   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.00   Max.   :18.000    
##  YearsSinceLastPromotion YearsWithCurrManager Employee Source   
##  Min.   : 0.000          Min.   : 0.000       Length:23058      
##  1st Qu.: 0.000          1st Qu.: 2.000       Class :character  
##  Median : 1.000          Median : 3.000       Mode  :character  
##  Mean   : 2.164          Mean   : 4.091                         
##  3rd Qu.: 3.000          3rd Qu.: 7.000                         
##  Max.   :15.000          Max.   :17.000                         
##  AgeStartedWorking
##  Min.   : 0.00    
##  1st Qu.:20.00    
##  Median :25.00    
##  Mean   :25.96    
##  3rd Qu.:31.00    
##  Max.   :60.00
##### 2. DATA WRANGLING ##### 

# Transforming categorical variables to factor type
View(data_hr)
data_hr$Attrition                <- as.factor(data_hr$Attrition)
data_hr$BusinessTravel           <- as.factor(data_hr$BusinessTravel)
data_hr$Department               <- as.factor(data_hr$Department)
data_hr$Education                <- as.factor(data_hr$Education)
data_hr$EducationField           <- as.factor(data_hr$EducationField)
data_hr$'Employee Source'        <- as.factor(data_hr$'Employee Source') # Quote due to space between words
data_hr$EnvironmentSatisfaction  <- as.factor(data_hr$EnvironmentSatisfaction)
data_hr$Gender                   <- as.factor(data_hr$Gender)
data_hr$JobInvolvement           <- as.factor(data_hr$JobInvolvement)
data_hr$JobLevel                 <- as.factor(data_hr$JobLevel)
data_hr$JobRole                  <- as.factor(data_hr$JobRole)
data_hr$JobSatisfaction          <- as.factor(data_hr$JobSatisfaction)
data_hr$MaritalStatus            <- as.factor(data_hr$MaritalStatus)
data_hr$OverTime                 <- as.factor(data_hr$OverTime)
data_hr$PerformanceRating        <- as.factor(data_hr$PerformanceRating)
data_hr$RelationshipSatisfaction <- as.factor(data_hr$RelationshipSatisfaction)
data_hr$StockOptionLevel         <- as.factor(data_hr$StockOptionLevel)
data_hr$WorkLifeBalance          <- as.factor(data_hr$WorkLifeBalance)
str(data_hr) #Checking variables after tranformation.
## Classes 'data.table' and 'data.frame':   23058 obs. of  30 variables:
##  $ Age                     : int  41 37 41 37 37 37 41 41 41 41 ...
##  $ Attrition               : Factor w/ 3 levels "Current employee",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 1 3 1 1 1 3 3 3 3 ...
##  $ DistanceFromHome        : int  1 6 1 6 6 6 1 1 1 1 ...
##  $ Education               : Factor w/ 5 levels "1","2","3","4",..: 2 4 2 4 4 4 2 2 2 2 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 1 2 3 1 3 2 2 2 2 ...
##  $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 1 2 1 1 1 2 2 2 4 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ JobInvolvement          : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 3 3 3 3 3 ...
##  $ JobLevel                : Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 4 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 8 8 8 8 8 8 8 8 4 ...
##  $ JobSatisfaction         : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ MonthlyIncome           : int  5993 5993 5993 5993 5993 5993 5993 5993 5993 14756 ...
##  $ NumCompaniesWorked      : int  8 8 4 5 8 5 8 4 8 2 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ PercentSalaryHike       : int  11 11 11 11 11 11 11 11 11 14 ...
##  $ PerformanceRating       : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 1 1 1 1 ...
##  $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 3 ...
##  $ StockOptionLevel        : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 4 ...
##  $ TotalWorkingYears       : int  8 8 8 8 8 8 8 8 8 21 ...
##  $ TrainingTimesLastYear   : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ WorkLifeBalance         : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 3 ...
##  $ YearsAtCompany          : int  6 6 6 6 6 6 6 6 6 5 ...
##  $ YearsInCurrentRole      : int  4 4 4 4 4 4 4 4 4 0 ...
##  $ YearsSinceLastPromotion : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ YearsWithCurrManager    : int  5 5 5 5 5 5 5 5 5 2 ...
##  $ Employee Source         : Factor w/ 9 levels "Adzuna","Company Website",..: 8 8 8 8 8 8 8 8 8 2 ...
##  $ AgeStartedWorking       : int  33 29 33 29 29 29 33 33 33 20 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Transforming numeric variables to integer type
View(data_hr)
data_hr$DistanceFromHome  <- as.integer(data_hr$DistanceFromHome)
data_hr$MonthlyIncome     <- as.integer(data_hr$MonthlyIncome)
data_hr$PercentSalaryHike <- as.integer(data_hr$PercentSalaryHike)

# Drop the factor levels with 0 count
data <- droplevels(data_hr)
str(data_hr)
## Classes 'data.table' and 'data.frame':   23058 obs. of  30 variables:
##  $ Age                     : int  41 37 41 37 37 37 41 41 41 41 ...
##  $ Attrition               : Factor w/ 3 levels "Current employee",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 1 3 1 1 1 3 3 3 3 ...
##  $ DistanceFromHome        : int  1 6 1 6 6 6 1 1 1 1 ...
##  $ Education               : Factor w/ 5 levels "1","2","3","4",..: 2 4 2 4 4 4 2 2 2 2 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 1 2 3 1 3 2 2 2 2 ...
##  $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 1 2 1 1 1 2 2 2 4 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ JobInvolvement          : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 3 3 3 3 3 ...
##  $ JobLevel                : Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 4 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 8 8 8 8 8 8 8 8 4 ...
##  $ JobSatisfaction         : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ MonthlyIncome           : int  5993 5993 5993 5993 5993 5993 5993 5993 5993 14756 ...
##  $ NumCompaniesWorked      : int  8 8 4 5 8 5 8 4 8 2 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ PercentSalaryHike       : int  11 11 11 11 11 11 11 11 11 14 ...
##  $ PerformanceRating       : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 1 1 1 1 ...
##  $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 3 ...
##  $ StockOptionLevel        : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 4 ...
##  $ TotalWorkingYears       : int  8 8 8 8 8 8 8 8 8 21 ...
##  $ TrainingTimesLastYear   : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ WorkLifeBalance         : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 3 ...
##  $ YearsAtCompany          : int  6 6 6 6 6 6 6 6 6 5 ...
##  $ YearsInCurrentRole      : int  4 4 4 4 4 4 4 4 4 0 ...
##  $ YearsSinceLastPromotion : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ YearsWithCurrManager    : int  5 5 5 5 5 5 5 5 5 2 ...
##  $ Employee Source         : Factor w/ 9 levels "Adzuna","Company Website",..: 8 8 8 8 8 8 8 8 8 2 ...
##  $ AgeStartedWorking       : int  33 29 33 29 29 29 33 33 33 20 ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(data_hr)
##       Age                        Attrition               BusinessTravel 
##  Min.   :18.00   Current employee     :19370   Non-Travel       : 2344  
##  1st Qu.:30.00   Termination          :   87   Travel_Frequently: 4378  
##  Median :36.00   Voluntary Resignation: 3601   Travel_Rarely    :16336  
##  Mean   :37.04                                                          
##  3rd Qu.:43.00                                                          
##  Max.   :60.00                                                          
##                                                                         
##                   Department    DistanceFromHome Education
##  Human Resources       : 1010   Min.   : 1.000   1:2659   
##  Research & Development:15040   1st Qu.: 2.000   2:4436   
##  Sales                 : 7008   Median : 7.000   3:8930   
##                                 Mean   : 9.215   4:6279   
##                                 3rd Qu.:14.000   5: 754   
##                                 Max.   :29.000            
##                                                           
##           EducationField EnvironmentSatisfaction    Gender      JobInvolvement
##  Human Resources : 442   1:4490                  Female: 9205   1: 1287       
##  Life Sciences   :9513   2:4476                  Male  :13853   2: 5888       
##  Marketing       :2484   3:7091                                 3:13644       
##  Medical         :7267   4:7001                                 4: 2239       
##  Other           :1291                                                        
##  Technical Degree:2061                                                        
##                                                                               
##  JobLevel                      JobRole     JobSatisfaction  MaritalStatus  
##  1:8594   Sales Executive          :5067   1:4575          Divorced: 5163  
##  2:8448   Research Scientist       :4591   2:4371          Married :10543  
##  3:3440   Laboratory Technician    :4112   3:6938          Single  : 7352  
##  4:1563   Manufacturing Director   :2346   4:7174                          
##  5:1013   Healthcare Representative:2069                                   
##           Manager                  :1521                                   
##           (Other)                  :3352                                   
##  MonthlyIncome   NumCompaniesWorked OverTime    PercentSalaryHike
##  Min.   : 1009   Min.   :0.000      No :16524   Min.   :11.00    
##  1st Qu.: 2900   1st Qu.:1.000      Yes: 6534   1st Qu.:12.00    
##  Median : 4898   Median :2.000                  Median :14.00    
##  Mean   : 6416   Mean   :2.691                  Mean   :15.22    
##  3rd Qu.: 8120   3rd Qu.:4.000                  3rd Qu.:18.00    
##  Max.   :19999   Max.   :9.000                  Max.   :25.00    
##                                                                  
##  PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears
##  3:19478           1:4331                   0:9873           Min.   : 0.00    
##  4: 3580           2:4762                   1:9370           1st Qu.: 6.00    
##                    3:7164                   2:2497           Median :10.00    
##                    4:6801                   3:1318           Mean   :11.07    
##                                                              3rd Qu.:15.00    
##                                                              Max.   :40.00    
##                                                                               
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany  YearsInCurrentRole
##  Min.   :0.000         1: 1263         Min.   : 0.00   Min.   : 0.000    
##  1st Qu.:2.000         2: 5374         1st Qu.: 3.00   1st Qu.: 2.000    
##  Median :3.000         3:14016         Median : 5.00   Median : 3.000    
##  Mean   :2.804         4: 2405         Mean   : 6.91   Mean   : 4.201    
##  3rd Qu.:3.000                         3rd Qu.: 9.00   3rd Qu.: 7.000    
##  Max.   :6.000                         Max.   :40.00   Max.   :18.000    
##                                                                          
##  YearsSinceLastPromotion YearsWithCurrManager        Employee Source
##  Min.   : 0.000          Min.   : 0.000       Company Website:5327  
##  1st Qu.: 0.000          1st Qu.: 2.000       Seek           :3655  
##  Median : 1.000          Median : 3.000       Indeed         :2471  
##  Mean   : 2.164          Mean   : 4.091       Jora           :2408  
##  3rd Qu.: 3.000          3rd Qu.: 7.000       LinkedIn       :2294  
##  Max.   :15.000          Max.   :17.000       Recruit.net    :2283  
##                                               (Other)        :4620  
##  AgeStartedWorking
##  Min.   : 0.00    
##  1st Qu.:20.00    
##  Median :25.00    
##  Mean   :25.96    
##  3rd Qu.:31.00    
##  Max.   :60.00    
## 
View(data_hr) # I always check the dataset after some interactions to ensure nothing was lost or done incorrectly.

##### 3. FEATURE ENGINEERING ##### 

# As a Data professional, I always like to work on Feature Engineering because on some analysis, it can simplify and/or boost accuracy of the model, without changing the data.

# I have created a previous years of experience column to better visualize the employee's experience profile.
data_hr$PriorYearsOfExperience <- data_hr$TotalWorkingYears - data_hr$YearsAtCompany
View(data_hr)

# Job tenure is the measure of how long an employee is employed by his or her current employer.
# An employee's job tenure is very important, and employers often consider job tenure a criterion for hiring new employees.
# Job tenure can be long or short.

# We have created a new average tenure feature.
# To understand the average tenure of employees at previous companies.
data_hr$AverageTenure <- data_hr$PriorYearsOfExperience / data_hr$NumCompaniesWorked
View(data_hr)

# The mean of tenure stability produces values like INF due to the nature of its derivation.
# Let's substitute them to zero
summary(data_hr$AverageTenure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       1     Inf       4     Inf     372
data_hr$AverageTenure[!is.finite(data_hr$AverageTenure)] <- 0
summary(data_hr$AverageTenure) # Here I check the variable again to ensure the data has been dealt with in a proper way.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.3333  1.7725  1.5000 40.0000
View(data_hr)

# We analyze and split the data based on the Attrition column,
# Which indicates whether the employee was terminated from the company.
# I want to get rid of the "Terminated" employees in this data set since they represent a very small percentage of entries (0.38%).
data_hr_1 <- data_hr[data_hr$Attrition != 'Termination']
data_hr_1 <- droplevels(data_hr_1)
dim(data_hr_1)
## [1] 22971    32
summary(data_hr_1)
##       Age                        Attrition               BusinessTravel 
##  Min.   :18.00   Current employee     :19370   Non-Travel       : 2344  
##  1st Qu.:30.00   Voluntary Resignation: 3601   Travel_Frequently: 4363  
##  Median :36.00                                 Travel_Rarely    :16264  
##  Mean   :37.06                                                          
##  3rd Qu.:43.00                                                          
##  Max.   :60.00                                                          
##                                                                         
##                   Department    DistanceFromHome Education
##  Human Resources       : 1010   Min.   : 1.000   1:2659   
##  Research & Development:14977   1st Qu.: 2.000   2:4421   
##  Sales                 : 6984   Median : 7.000   3:8890   
##                                 Mean   : 9.191   4:6247   
##                                 3rd Qu.:14.000   5: 754   
##                                 Max.   :29.000            
##                                                           
##           EducationField EnvironmentSatisfaction    Gender      JobInvolvement
##  Human Resources : 442   1:4482                  Female: 9173   1: 1279       
##  Life Sciences   :9494   2:4460                  Male  :13798   2: 5849       
##  Marketing       :2484   3:7067                                 3:13612       
##  Medical         :7215   4:6962                                 4: 2231       
##  Other           :1291                                                        
##  Technical Degree:2045                                                        
##                                                                               
##  JobLevel                      JobRole     JobSatisfaction  MaritalStatus  
##  1:8547   Sales Executive          :5051   1:4543          Divorced: 5148  
##  2:8432   Research Scientist       :4576   2:4355          Married :10502  
##  3:3424   Laboratory Technician    :4088   3:6914          Single  : 7321  
##  4:1563   Manufacturing Director   :2338   4:7159                          
##  5:1005   Healthcare Representative:2061                                   
##           Manager                  :1513                                   
##           (Other)                  :3344                                   
##  MonthlyIncome   NumCompaniesWorked OverTime    PercentSalaryHike
##  Min.   : 1009   Min.   :0.000      No :16476   Min.   :11.00    
##  1st Qu.: 2909   1st Qu.:1.000      Yes: 6495   1st Qu.:12.00    
##  Median : 4898   Median :2.000                  Median :14.00    
##  Mean   : 6418   Mean   :2.688                  Mean   :15.22    
##  3rd Qu.: 8120   3rd Qu.:4.000                  3rd Qu.:18.00    
##  Max.   :19999   Max.   :9.000                  Max.   :25.00    
##                                                                  
##  PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears
##  3:19407           1:4315                   0:9826           Min.   : 0.00    
##  4: 3564           2:4739                   1:9330           1st Qu.: 6.00    
##                    3:7132                   2:2497           Median :10.00    
##                    4:6785                   3:1318           Mean   :11.08    
##                                                              3rd Qu.:15.00    
##                                                              Max.   :40.00    
##                                                                               
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         1: 1255         Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         2: 5359         1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000         3:13960         Median : 5.000   Median : 3.000    
##  Mean   :2.805         4: 2397         Mean   : 6.914   Mean   : 4.202    
##  3rd Qu.:3.000                         3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :6.000                         Max.   :40.000   Max.   :18.000    
##                                                                           
##  YearsSinceLastPromotion YearsWithCurrManager        Employee Source
##  Min.   : 0.000          Min.   : 0.000       Company Website:5307  
##  1st Qu.: 0.000          1st Qu.: 2.000       Seek           :3622  
##  Median : 1.000          Median : 3.000       Indeed         :2459  
##  Mean   : 2.167          Mean   : 4.096       Jora           :2398  
##  3rd Qu.: 3.000          3rd Qu.: 7.000       LinkedIn       :2294  
##  Max.   :15.000          Max.   :17.000       Recruit.net    :2273  
##                                               (Other)        :4618  
##  AgeStartedWorking PriorYearsOfExperience AverageTenure    
##  Min.   : 0.00     Min.   : 0.000         Min.   : 0.0000  
##  1st Qu.:20.00     1st Qu.: 0.000         1st Qu.: 0.0000  
##  Median :25.00     Median : 2.000         Median : 0.3333  
##  Mean   :25.98     Mean   : 4.165         Mean   : 1.7700  
##  3rd Qu.:31.00     3rd Qu.: 5.000         3rd Qu.: 1.5000  
##  Max.   :60.00     Max.   :40.000         Max.   :40.0000  
## 
##### 4. EXPLORATORY DATA ANALYSIS ##### 

# Univariate Analysis: Plots
ggplot(data_hr) + geom_bar(aes(x = Gender))

ggplot(data_hr) + geom_density(aes(x = Age))

ggplot(data_hr) + geom_bar(aes(x = Attrition))

ggplot(data_hr) + geom_bar(aes(x = Department))

ggplot(data_hr) + geom_bar(aes(x = JobRole))

ggplot(data_hr) + geom_bar(aes(x = Education)) + facet_grid(~EducationField)

# Multiplot Grid
p.TotalWorkingYears       <- ggplot(data_hr) + geom_density(aes(TotalWorkingYears))
p.YearsAtCompany          <- ggplot(data_hr) + geom_density(aes(YearsAtCompany))
p.YearsSinceLastPromotion <- ggplot(data_hr) + geom_density(aes(YearsSinceLastPromotion))
p.YearsWithCurrManager    <- ggplot(data_hr) + geom_density(aes(YearsWithCurrManager))
p.YearsInCurrentRole      <- ggplot(data_hr) + geom_density(aes(YearsInCurrentRole))
p.PriorYearsOfExperience  <- ggplot(data_hr) + geom_density(aes(PriorYearsOfExperience))

# Grid arrangement
grid.arrange(p.TotalWorkingYears, 
             p.YearsAtCompany, 
             p.YearsSinceLastPromotion, 
             p.YearsWithCurrManager, 
             p.YearsInCurrentRole, 
             p.PriorYearsOfExperience, 
             nrow = 2, 
             ncol = 3)

# Previous experience
# Let's find out the proportion of employees with less than a few years of experience 
# (Chosen values: 1, 3, 5, 7, 10 years)
length(which(data_hr$PriorYearsOfExperience < 1)) / length(data_hr$PriorYearsOfExperience)  
## [1] 0.3246596
length(which(data_hr$PriorYearsOfExperience < 3)) / length(data_hr$PriorYearsOfExperience)   
## [1] 0.5828346
length(which(data_hr$PriorYearsOfExperience < 5)) / length(data_hr$PriorYearsOfExperience)   
## [1] 0.7085177
length(which(data_hr$PriorYearsOfExperience < 7)) / length(data_hr$PriorYearsOfExperience)   
## [1] 0.7952121
length(which(data_hr$PriorYearsOfExperience < 10)) / length(data_hr$PriorYearsOfExperience)  
## [1] 0.8589644
# Example insight:
# 58% of employees have less than 3 years of work experience before joining IBM
# Possible problems: underdeveloped skill sets, young employee base, immature "work mentality".

# Age
length(which(data_hr$Age < 30)) / length(data_hr$Age)
## [1] 0.2165409
# Example insight:
# Only 22% of employees are under 30, the employee base is not exactly as young as expected.

# Education
summary(data_hr$Education)
##    1    2    3    4    5 
## 2659 4436 8930 6279  754
length(which(data_hr$Education == 3)) / length(data_hr$Education)
## [1] 0.3872842
length(which(data_hr$Education == 4)) / length(data_hr$Education)
## [1] 0.2723133
# Example insight:
# About 39% of employees are college graduates and 27% have held a master's degree.
# The pursuit of higher education may have led to a decrease in work experience.

# Boxplot showing the distribution of monthly salary for all 4 levels of job satisfaction from 1-4
ggplot(data = subset(data_hr, !is.na(JobSatisfaction)), aes(JobSatisfaction, MonthlyIncome)) + 
  geom_boxplot()

# Insight Example:
# There are no obvious signs that a higher salary leads to higher job satisfaction

# Correlation
cor(data_hr$TotalWorkingYears, data_hr$YearsAtCompany,          use = "complete.obs")
## [1] 0.624816
cor(data_hr$YearsAtCompany,    data_hr$YearsInCurrentRole,      use = "complete.obs")
## [1] 0.7670497
cor(data_hr$YearsAtCompany,    data_hr$YearsSinceLastPromotion, use = "complete.obs")
## [1] 0.6236737
cor(data_hr$YearsAtCompany,    data_hr$YearsWithCurrManager,    use = "complete.obs")
## [1] 0.7728072
cor(data_hr$TotalWorkingYears, data_hr$MonthlyIncome,           use = "complete.obs")
## [1] 0.7582066
cor(data_hr$YearsAtCompany,    data_hr$MonthlyIncome,           use = "complete.obs")  
## [1] 0.4981578
# Scatterplots
ggplot(data_hr) + geom_point(aes(TotalWorkingYears, MonthlyIncome)) # Total compensation vs. Total Work Years seems to make sense.

ggplot(data_hr) + geom_point(aes(YearsAtCompany, MonthlyIncome)) # There is a significant reduction on amount of workers that earn less than 7.5k monthly after 10 years at the company.

# Let's investigate the relationship between work-life balance and monthly income
ggplot(data = subset(data_hr, !is.na(WorkLifeBalance)), aes(WorkLifeBalance, MonthlyIncome)) + 
  geom_boxplot()

# Example insight:
# Employees who rated work-life balance equal to 1 also have significantly lower average monthly income .
# Low work-life balance and low pay? A problem that the HR department needs to examine.

# Checking the pay gap between men and women.
ggplot(data = subset(data_hr, !is.na(Gender)), aes(Gender, MonthlyIncome, fill = Gender)) +
  geom_boxplot() + 
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5, size = 10)) +
  labs(x = "Gender", y = "Monthly Income", title = "Monthly Salary Between Genders") +
  coord_flip()

# Example of insight
# There are no signs of gender discrimination; in fact, women earn slightly more, on average, disregarding all other factors.

# Salary per Position
# We could ask the HR why there are so many outliers in this category.
ggplot(data = subset(data_hr, !is.na(JobRole))) + geom_boxplot(aes(JobRole, MonthlyIncome)) +
  ggtitle("Monthly Salary Per Position")

# Insight Example
# This one is VERY interesting, showing us that there are some people with incorrect data. How one would start working at an age below 15, 10, or even 5?
# This visualization shows us that the information may be incorrect and we could recommend the HR to review the info.
# Decision: I am not using this variable anymore on further analysis.
ggplot(data = subset(data_hr, !is.na(JobRole))) + geom_boxplot(aes(JobRole, AgeStartedWorking)) +
  ggtitle("Age When Started in the Role")

ggplot(data = subset(data_hr, !is.na(JobRole))) + geom_boxplot(aes(JobRole, Age)) +
  ggtitle("Age By Role")

ggplot(data = subset(data_hr, !is.na(JobRole))) + geom_boxplot(aes(JobRole, YearsAtCompany)) +
  ggtitle("Company Time (in years)")

ggplot(data = na.omit(data_hr)) + geom_bar(aes(JobRole, fill = Education), position = "fill") +
  ggtitle("Education Level By Role") + 
  ylab("Proportion")

# Multivariate analysis plots for commonly used variables during the hiring process
ggplot(data = data_hr_1) + 
  geom_bar(aes(x = Education , fill = Attrition), position = 'fill') + 
  facet_grid(.~Department)

ggplot(data = data_hr_1) + 
  geom_bar(aes(x = Education , fill = Attrition), position = 'fill') + 
  facet_grid(.~JobRole)

ggplot(data = data_hr_1) + 
  geom_bar(aes(x = EducationField , fill = Attrition), position = 'fill') + 
  facet_grid(.~JobRole) + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0))

# Multivariate analysis plots for commonly used variables after the hiring process.
ggplot(data_hr_1) + geom_bar(aes(x = Age, fill = Attrition), position = 'fill') 

ggplot(data_hr_1) + geom_bar(aes(x = Department, fill = Attrition), position = 'fill') 

ggplot(data_hr_1) + geom_bar(aes(x = DistanceFromHome, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = `Employee Source`, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = JobRole, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = MaritalStatus, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = Education, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(x = EducationField, fill = Attrition),position ='fill')

ggplot(data_hr_1) + geom_bar(aes(x = Gender, fill = Attrition), position = 'fill')

# Multivariate analysis plots between some variables and employee status.
ggplot(data_hr_1) + geom_boxplot(aes(Attrition, MonthlyIncome)) # Insight: Average Salary of people that leave the company is slightly lower than people that stay. That could indicate they are searching for a career progression. HR could study a better progression or salary balance between peers. 

ggplot(data_hr_1) + geom_boxplot(aes(Attrition, PercentSalaryHike))

ggplot(data_hr_1) + geom_bar(aes(TrainingTimesLastYear, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(BusinessTravel, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(OverTime, fill = Attrition), position = 'fill') # Insight: Search better ways to compensate employees that work on overtime or better ways to unload them. Overtimes should be temporary.

ggplot(data_hr_1) + geom_bar(aes(StockOptionLevel, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(EnvironmentSatisfaction, fill = Attrition), position = 'fill') # Insight: Run a survey and 1:1 HR interviews on people that are no so satisfied with the working environment and understand what could improve.

ggplot(data_hr_1) + geom_bar(aes(JobSatisfaction, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(JobInvolvement, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(RelationshipSatisfaction, fill = Attrition), position = 'fill')

ggplot(data_hr_1) + geom_bar(aes(WorkLifeBalance, fill = Attrition), position = 'fill')

##### 5. PREDICTIVE MODELING / MACHINE LEARNING ##### 

# Let's focus our work on trying to help HR recruit better in order to avoid friction and, consequently, resignations

# I will create 5 versions of the model and for each one I will explore the options and interpret the result.

# First version of the model with a few variables
# 5.1 - GLM (GENERALIZED LINEAR MODELS, BINOMIAL (LOGISTIC REGRESSION))
model_v1 <- glm(Attrition ~ Age + Department + DistanceFromHome + `Employee Source` + 
                   JobRole + MaritalStatus + AverageTenure + PriorYearsOfExperience + Gender + 
                   Education + EducationField, 
                 family = binomial, 
                 data = data_hr)

# Here I call a summary of the model which, mind you:
# I did not split the dataset into 70/30 for Training and Testing.
# The reason is that I want to find out what are the most significant variables that lead to employee resignation requests.
summary(model_v1)
## 
## Call:
## glm(formula = Attrition ~ Age + Department + DistanceFromHome + 
##     `Employee Source` + JobRole + MaritalStatus + AverageTenure + 
##     PriorYearsOfExperience + Gender + Education + EducationField, 
##     family = binomial, data = data_hr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4738  -0.6239  -0.4962  -0.3553   2.7405  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -0.515415   0.198808  -2.593 0.009527 ** 
## Age                              -0.046402   0.002434 -19.062  < 2e-16 ***
## DepartmentResearch & Development -0.402413   0.102837  -3.913 9.11e-05 ***
## DepartmentSales                   0.041108   0.106275   0.387 0.698901    
## DistanceFromHome                  0.022014   0.002497   8.816  < 2e-16 ***
## `Employee Source`Company Website  0.200175   0.074567   2.684 0.007264 ** 
## `Employee Source`GlassDoor       -0.002062   0.089568  -0.023 0.981630    
## `Employee Source`Indeed          -0.048126   0.088966  -0.541 0.588545    
## `Employee Source`Jora             0.202494   0.084534   2.395 0.016602 *  
## `Employee Source`LinkedIn        -0.086527   0.090292  -0.958 0.337911    
## `Employee Source`Recruit.net     -0.024145   0.088800  -0.272 0.785699    
## `Employee Source`Referral         0.222132   0.147177   1.509 0.131226    
## `Employee Source`Seek             0.039192   0.079096   0.495 0.620253    
## JobRoleHuman Resources            0.092163   0.125250   0.736 0.461832    
## JobRoleLaboratory Technician      0.313456   0.079749   3.931 8.48e-05 ***
## JobRoleManager                   -0.370055   0.121400  -3.048 0.002302 ** 
## JobRoleManufacturing Director    -0.091942   0.094178  -0.976 0.328937    
## JobRoleResearch Director         -0.326907   0.125855  -2.597 0.009391 ** 
## JobRoleResearch Scientist         0.102218   0.078537   1.302 0.193080    
## JobRoleSales Executive           -0.030434   0.079097  -0.385 0.700414    
## JobRoleSales Representative       0.484732   0.095181   5.093 3.53e-07 ***
## MaritalStatusMarried              0.179376   0.053279   3.367 0.000761 ***
## MaritalStatusSingle               0.740422   0.053393  13.867  < 2e-16 ***
## AverageTenure                    -0.016927   0.009230  -1.834 0.066663 .  
## PriorYearsOfExperience            0.018901   0.005353   3.531 0.000414 ***
## GenderMale                        0.033768   0.038421   0.879 0.379467    
## Education2                        0.096221   0.068965   1.395 0.162951    
## Education3                        0.129656   0.061109   2.122 0.033862 *  
## Education4                        0.120603   0.066456   1.815 0.069558 .  
## Education5                       -0.221560   0.134302  -1.650 0.099001 .  
## EducationFieldLife Sciences      -0.149802   0.143779  -1.042 0.297462    
## EducationFieldMarketing          -0.122315   0.152984  -0.800 0.423984    
## EducationFieldMedical            -0.176829   0.145066  -1.219 0.222859    
## EducationFieldOther              -0.170949   0.161651  -1.058 0.290274    
## EducationFieldTechnical Degree    0.183255   0.154276   1.188 0.234898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20272  on 23057  degrees of freedom
## Residual deviance: 18904  on 23023  degrees of freedom
## AIC: 18974
## 
## Number of Fisher Scoring iterations: 5
# Another example of insight:
# I might recommend the HR, for instance, to pay attention in recruiting especially to variables with GIVF > 2, such as:
# Department
# Job Role
# Average Tenure
# Prior Years of Experience

# Not only that, the variables tagged as three stars at the Summary '***' are also extra relevant, such as AGE.
vif(model_v1)
##                            GVIF Df GVIF^(1/(2*Df))
## Age                    1.197853  1        1.094465
## Department             2.027501  2        1.193274
## DistanceFromHome       1.321206  1        1.149437
## `Employee Source`      1.107922  8        1.006426
## JobRole                2.564522  8        1.060628
## MaritalStatus          1.042578  2        1.010479
## AverageTenure          2.478002  1        1.574167
## PriorYearsOfExperience 2.440072  1        1.562073
## Gender                 1.019571  1        1.009738
## Education              1.121235  4        1.014407
## EducationField         1.648089  5        1.051231
# I will divide the data into training and test. We will work with data without records of fired people.
set.seed(2004)
index_training <- sample.split(Y = data_hr_1$Attrition, SplitRatio = 0.7)
data_hr_1_training <- subset(data_hr_1, train = T)
data_hr_1_test <- subset(data_hr_1, train = F)

# Second version of the model with training data
model_v2 <- glm(Attrition ~ Age + Department + DistanceFromHome + `Employee Source` + 
                   JobRole + MaritalStatus + AverageTenure + PriorYearsOfExperience + Gender + 
                   Education + EducationField, 
                 family = binomial, 
                 data = data_hr_1_training)

summary(model_v2)
## 
## Call:
## glm(formula = Attrition ~ Age + Department + DistanceFromHome + 
##     `Employee Source` + JobRole + MaritalStatus + AverageTenure + 
##     PriorYearsOfExperience + Gender + Education + EducationField, 
##     family = binomial, data = data_hr_1_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4484  -0.6177  -0.4918  -0.3558   2.7300  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -0.499751   0.199492  -2.505 0.012241 *  
## Age                              -0.044889   0.002446 -18.348  < 2e-16 ***
## DepartmentResearch & Development -0.427955   0.103053  -4.153 3.28e-05 ***
## DepartmentSales                   0.025684   0.106499   0.241 0.809423    
## DistanceFromHome                  0.020372   0.002522   8.076 6.69e-16 ***
## `Employee Source`Company Website  0.183335   0.074868   2.449 0.014334 *  
## `Employee Source`GlassDoor        0.006274   0.089680   0.070 0.944229    
## `Employee Source`Indeed          -0.080908   0.089734  -0.902 0.367244    
## `Employee Source`Jora             0.183678   0.084958   2.162 0.030618 *  
## `Employee Source`LinkedIn        -0.079145   0.090405  -0.875 0.381325    
## `Employee Source`Recruit.net     -0.050665   0.089444  -0.566 0.571095    
## `Employee Source`Referral         0.230121   0.147168   1.564 0.117897    
## `Employee Source`Seek            -0.005837   0.079828  -0.073 0.941708    
## JobRoleHuman Resources            0.107348   0.125753   0.854 0.393302    
## JobRoleLaboratory Technician      0.314968   0.080707   3.903 9.52e-05 ***
## JobRoleManager                   -0.402633   0.123788  -3.253 0.001144 ** 
## JobRoleManufacturing Director    -0.083426   0.095273  -0.876 0.381221    
## JobRoleResearch Director         -0.292195   0.126243  -2.315 0.020637 *  
## JobRoleResearch Scientist         0.111877   0.079359   1.410 0.158608    
## JobRoleSales Executive           -0.028140   0.079873  -0.352 0.724611    
## JobRoleSales Representative       0.478077   0.096067   4.977 6.47e-07 ***
## MaritalStatusMarried              0.176289   0.053865   3.273 0.001065 ** 
## MaritalStatusSingle               0.747383   0.053896  13.867  < 2e-16 ***
## AverageTenure                    -0.021245   0.009467  -2.244 0.024825 *  
## PriorYearsOfExperience            0.019787   0.005399   3.665 0.000248 ***
## GenderMale                        0.030982   0.038752   0.800 0.424000    
## Education2                        0.067584   0.069195   0.977 0.328712    
## Education3                        0.092553   0.061236   1.511 0.130684    
## Education4                        0.071013   0.066760   1.064 0.287461    
## Education5                       -0.233758   0.134267  -1.741 0.081685 .  
## EducationFieldLife Sciences      -0.148858   0.143810  -1.035 0.300620    
## EducationFieldMarketing          -0.106268   0.152995  -0.695 0.487317    
## EducationFieldMedical            -0.202212   0.145203  -1.393 0.163736    
## EducationFieldOther              -0.137807   0.161652  -0.852 0.393940    
## EducationFieldTechnical Degree    0.180977   0.154552   1.171 0.241608    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19951  on 22970  degrees of freedom
## Residual deviance: 18626  on 22936  degrees of freedom
## AIC: 18696
## 
## Number of Fisher Scoring iterations: 5
# We can verify at this point that our subset "data_hr_1"
# provides a very similar result to the original, the only difference,
# and a significant one at that, is that we do not use the data of those
# who were fired by the company, which makes the analysis more precise.
vif(model_v2)
##                            GVIF Df GVIF^(1/(2*Df))
## Age                    1.196689  1        1.093933
## Department             2.033631  2        1.194175
## DistanceFromHome       1.321839  1        1.149712
## `Employee Source`      1.109202  8        1.006499
## JobRole                2.550173  8        1.060256
## MaritalStatus          1.043120  2        1.010610
## AverageTenure          2.441327  1        1.562475
## PriorYearsOfExperience 2.412397  1        1.553189
## Gender                 1.018652  1        1.009283
## Education              1.123257  4        1.014635
## EducationField         1.650765  5        1.051401
# Predictions
threshold <- 0.5
predictions_v2 <- predict(model_v2, type = 'response', newdata = data_hr_1_test)
final_predictions_v2 <- ifelse(predictions_v2 > threshold, 'Voluntary Resignation', 'Current employee')
table(data_hr_1_test$Attrition, final_predictions_v2)
##                        final_predictions_v2
##                         Current employee Voluntary Resignation
##   Current employee                 19328                    42
##   Voluntary Resignation             3523                    78
# There was a high rate of error at this prediction. Why? - Because the dataset is not balanced.
# But when we talk about data science, we talk about decision making and, in this particular scenario, I'm not deploying the ML Algorithm, but I am in fact running analysis to recommend the HR better practices in order to reduce voluntary resignations that would improve HR's efficiency.

# But I'm not done: I want to generate newer versions of this model with less variables, because in an eventual need for a ML model deploy for some application, it would be interesting to run a lighter version of it without sacrificing accuracy.

# Let's get some not so relevant variables out of the way.
# Third version of the model with training data and without education variables
model_v3 <- glm(Attrition ~ Age + Department + DistanceFromHome + `Employee Source` + 
                   JobRole + MaritalStatus + AverageTenure + PriorYearsOfExperience + Gender, 
                 family = binomial, 
                 data = data_hr_1_training)

summary(model_v3)
## 
## Call:
## glm(formula = Attrition ~ Age + Department + DistanceFromHome + 
##     `Employee Source` + JobRole + MaritalStatus + AverageTenure + 
##     PriorYearsOfExperience + Gender, family = binomial, data = data_hr_1_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3428  -0.6201  -0.4941  -0.3619   2.7143  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -0.594443   0.163302  -3.640 0.000272 ***
## Age                              -0.044338   0.002361 -18.781  < 2e-16 ***
## DepartmentResearch & Development -0.455831   0.097648  -4.668 3.04e-06 ***
## DepartmentSales                   0.006375   0.100798   0.063 0.949567    
## DistanceFromHome                  0.023945   0.002219  10.792  < 2e-16 ***
## `Employee Source`Company Website  0.185836   0.074684   2.488 0.012835 *  
## `Employee Source`GlassDoor        0.004131   0.089469   0.046 0.963174    
## `Employee Source`Indeed          -0.084488   0.089587  -0.943 0.345638    
## `Employee Source`Jora             0.182141   0.084629   2.152 0.031378 *  
## `Employee Source`LinkedIn        -0.073833   0.090249  -0.818 0.413300    
## `Employee Source`Recruit.net     -0.058670   0.089241  -0.657 0.510903    
## `Employee Source`Referral         0.237922   0.146800   1.621 0.105078    
## `Employee Source`Seek            -0.006818   0.079571  -0.086 0.931717    
## JobRoleHuman Resources            0.099083   0.125594   0.789 0.430163    
## JobRoleLaboratory Technician      0.312339   0.080556   3.877 0.000106 ***
## JobRoleManager                   -0.418085   0.123665  -3.381 0.000723 ***
## JobRoleManufacturing Director    -0.079696   0.095061  -0.838 0.401826    
## JobRoleResearch Director         -0.308958   0.126075  -2.451 0.014263 *  
## JobRoleResearch Scientist         0.119993   0.079265   1.514 0.130071    
## JobRoleSales Executive           -0.023432   0.079774  -0.294 0.768961    
## JobRoleSales Representative       0.483836   0.095952   5.042 4.60e-07 ***
## MaritalStatusMarried              0.176480   0.053793   3.281 0.001035 ** 
## MaritalStatusSingle               0.747665   0.053772  13.904  < 2e-16 ***
## AverageTenure                    -0.019906   0.009465  -2.103 0.035453 *  
## PriorYearsOfExperience            0.019187   0.005400   3.553 0.000381 ***
## GenderMale                        0.033764   0.038690   0.873 0.382838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19951  on 22970  degrees of freedom
## Residual deviance: 18668  on 22945  degrees of freedom
## AIC: 18720
## 
## Number of Fisher Scoring iterations: 5
vif(model_v3)
##                            GVIF Df GVIF^(1/(2*Df))
## Age                    1.116616  1        1.056701
## Department             1.733091  2        1.147375
## DistanceFromHome       1.008189  1        1.004086
## `Employee Source`      1.080863  8        1.004872
## JobRole                2.519868  8        1.059464
## MaritalStatus          1.037566  2        1.009262
## AverageTenure          2.443412  1        1.563142
## PriorYearsOfExperience 2.418527  1        1.555161
## Gender                 1.017671  1        1.008797
# Predictions
threshold <- 0.5
predictions_v3 <- predict(model_v3, type = 'response', newdata = data_hr_1_test)
final_predictions_v3 <- ifelse(predictions_v3 > threshold, 'Voluntary Resignation', 'Current employee')
table(data_hr_1_test$Attrition, final_predictions_v3)
##                        final_predictions_v3
##                         Current employee Voluntary Resignation
##   Current employee                 19328                    42
##   Voluntary Resignation             3541                    60
# With this example, we show that it is in fact possible to increase efficiency without a strong hit on accuracy.

# Now, last example, with even less variables.
# Fourth version of the model with training data and without education and gender variables
model_v4 <- glm(Attrition ~ Age + Department + DistanceFromHome + `Employee Source` + 
                   JobRole + MaritalStatus + AverageTenure + PriorYearsOfExperience, 
                 family = binomial, 
                 data = data_hr_1_training)

summary(model_v4)
## 
## Call:
## glm(formula = Attrition ~ Age + Department + DistanceFromHome + 
##     `Employee Source` + JobRole + MaritalStatus + AverageTenure + 
##     PriorYearsOfExperience, family = binomial, data = data_hr_1_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3360  -0.6192  -0.4939  -0.3622   2.7205  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -0.569968   0.160865  -3.543 0.000395 ***
## Age                              -0.044408   0.002359 -18.822  < 2e-16 ***
## DepartmentResearch & Development -0.457114   0.097648  -4.681 2.85e-06 ***
## DepartmentSales                   0.004776   0.100790   0.047 0.962208    
## DistanceFromHome                  0.023979   0.002218  10.810  < 2e-16 ***
## `Employee Source`Company Website  0.185968   0.074691   2.490 0.012780 *  
## `Employee Source`GlassDoor        0.004217   0.089473   0.047 0.962404    
## `Employee Source`Indeed          -0.082065   0.089543  -0.916 0.359412    
## `Employee Source`Jora             0.182210   0.084632   2.153 0.031321 *  
## `Employee Source`LinkedIn        -0.073105   0.090254  -0.810 0.417948    
## `Employee Source`Recruit.net     -0.058149   0.089234  -0.652 0.514631    
## `Employee Source`Referral         0.240776   0.146746   1.641 0.100844    
## `Employee Source`Seek            -0.006816   0.079577  -0.086 0.931742    
## JobRoleHuman Resources            0.100479   0.125614   0.800 0.423769    
## JobRoleLaboratory Technician      0.315123   0.080478   3.916 9.02e-05 ***
## JobRoleManager                   -0.419678   0.123673  -3.393 0.000690 ***
## JobRoleManufacturing Director    -0.082962   0.094978  -0.873 0.382397    
## JobRoleResearch Director         -0.310452   0.126056  -2.463 0.013785 *  
## JobRoleResearch Scientist         0.120223   0.079252   1.517 0.129277    
## JobRoleSales Executive           -0.023015   0.079761  -0.289 0.772925    
## JobRoleSales Representative       0.482258   0.095927   5.027 4.97e-07 ***
## MaritalStatusMarried              0.175136   0.053769   3.257 0.001125 ** 
## MaritalStatusSingle               0.745551   0.053714  13.880  < 2e-16 ***
## AverageTenure                    -0.019985   0.009465  -2.112 0.034727 *  
## PriorYearsOfExperience            0.019266   0.005398   3.569 0.000358 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19951  on 22970  degrees of freedom
## Residual deviance: 18668  on 22946  degrees of freedom
## AIC: 18718
## 
## Number of Fisher Scoring iterations: 5
vif(model_v4)
##                            GVIF Df GVIF^(1/(2*Df))
## Age                    1.115570  1        1.056205
## Department             1.732346  2        1.147252
## DistanceFromHome       1.007827  1        1.003906
## `Employee Source`      1.078323  8        1.004724
## JobRole                2.493476  8        1.058767
## MaritalStatus          1.035363  2        1.008726
## AverageTenure          2.443871  1        1.563288
## PriorYearsOfExperience 2.417511  1        1.554835
# Predictions
threshold <- 0.5
predictions_v4 <- predict(model_v4, type = 'response', newdata = data_hr_1_test)
final_predictions_v4 <- ifelse(predictions_v4 > threshold, 'Voluntary Resignation', 'Current employee')
table(data_hr_1_test$Attrition, final_predictions_v4)
##                        final_predictions_v4
##                         Current employee Voluntary Resignation
##   Current employee                 19326                    44
##   Voluntary Resignation             3545                    56
# CONCLUSION: As a Data Scientist, the ideal scenario is to provide the BALANCE between model's speed and accuracy, therefore its efficiency. This is a particular hard task to do and many actions could be applied, such as dataset balancing, use of different parameters, different models and so on.
# The idea here was met with the very first classification algorithm: It was able to be verified the most important variables to the hiring process.


# But now, I want to offer a different approach: I want a provide a Tree for decision making, just to make life easier, something that would allow the HR to come up with a playbook whenever talking about hiring or maintaining people motivated.
# Fifth version of the model with training data and without education and gender variables and another algorithm.
# Recursive Partitioning and Regression Trees
model_v5 <- rpart(Attrition ~ Age + Department + DistanceFromHome + JobRole + MaritalStatus + 
                     AverageTenure + PriorYearsOfExperience, 
                   method = "class", 
                   control = rpart.control(minsplit = 500, cp = 0),
                   data = data_hr_1_training)

# Now, let's plot the tree. How to read it:
# Every time there's a decision, left means "YES" and right means "NO".
# Example: Is currently employee's age >= 34? If YES, 62% of these employees tend to remain working for the company.
# IF NO, the remaining 38% of employees fall down on other categories that will be analysed and will spit out the probability of having a voluntary resignation or not.

# Amazing how in one dataset, age would be such an important factor. Data-driven decision are indeed powerful.
rpart.plot(model_v5)

##### 6. ENDING & FINAL REMARKS #####

# Now that's the end.

# THANK YOU for sticking around and I hope this project was useful to you.
# Is there room for improvement? Yes.
# Could I plug this into a BI tool and create different sorts of dashboards? Yes.
# Could I write a script and present this to the HR and board of directors? Yes.

# But I guess that already does the trick for an R portfolio project.

# Hope talking to you soon!

# CONTACT INFO:
# Helio Ribeiro
# helioribeiropro@gmail.com
# +55 (11) 9 3932-8049