R project 3
Omega Group plc- Pay Discrimination
At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.
The objective is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.
Loading the data
omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame
## Rows: 50
## Columns: 3
## $ salary <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…
Relationship Salary - Gender ?
The data frame omega contains the salaries for the sample of 50 executives in the company. We will test for a significant difference between the salaries of the male and female executives?
We can perform different types of analyses, and check whether they all lead to the same conclusion
. Confidence intervals . Hypothesis testing . Correlation analysis . Regression
We calculate summary statistics on salary by gender. Also, we create and print a dataframe where, for each gender, we show the mean, SD, sample size, the t-critical, the SE, the margin of error, and the low/high endpoints of a 95% confidence interval.
# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)
## gender min Q1 median Q3 max mean sd n missing
## 1 female 47033 60338 64618 70033 78800 64543 7567 26 0
## 2 male 54768 68331 74675 78568 84576 73239 7463 24 0
# Data frame with two rows (male-female) and having as columns gender, mean, SD, sample size,
# the t-critical value, the standard error, the margin of error,
# and the low/high endpoints of a 95% confidence interval
#Create dataframe
salary_gender <- omega %>%
select(gender, salary) %>%
group_by(gender) %>%
summarise(mean=mean(salary),
sd=sd(salary),
n=n(),
t_critical=qt(0.975, n - 1 ),
se= sd/sqrt(n),
margin_of_error = se*t_critical,
low_ci=mean-margin_of_error,
high_ci= mean+margin_of_error)
#Print data frame
salary_gender
## # A tibble: 2 × 9
## gender mean sd n t_critical se margin_of_error low_ci high_ci
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 female 64543. 7567. 26 2.06 1484. 3056. 61486. 67599.
## 2 male 73239. 7463. 24 2.07 1523. 3151. 70088. 76390.
Conclusion: Looking at the created Confidence Intervals, we can see that they do not overlap. Hence we can conclude that males indeed appear to get a significantly higher salary when compared to women.
We can also run a hypothesis testing, assuming as a null hypothesis that the mean difference in salaries is zero, or that, on average, men and women make the same amount of money.
# hypothesis testing using t.test()
t.test(salary~gender, data=omega)
##
## Welch Two Sample t-test
##
## data: salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -12973 -4420
## sample estimates:
## mean in group female mean in group male
## 64543 73239
# hypothesis testing using infer package
set.seed(1)
salary_in_null_world <- omega %>%
#Specify varaibles of interest
specify(formula=salary~gender) %>%
#Hypothesize a null of no difference
hypothesize(null="independence") %>%
#Generate simulated samples
generate(reps = 1000, type="permute") %>%
#Find mean difference of each sample
calculate(stat="diff in means",
order=c("female", "male"))
#Calculate diff in means
obs_diff_mean <- omega %>%
specify(formula=salary~gender) %>%
calculate(stat="diff in means", order=c("female", "male"))
obs_diff_mean
## Response: salary (numeric)
## Explanatory: gender (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -8696.
#Calculate CI
percentile_ci <- salary_in_null_world %>%
get_confidence_interval(level = 0.95, type = "percentile")
percentile_ci
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -4846. 4717.
#Visualize
visualize(salary_in_null_world, bins=15) +
shade_p_value(obs_stat=obs_diff_mean, direction="both") +
theme_minimal()

#Get p-value
salary_in_null_world %>%
get_p_value(obs_stat=obs_diff_mean, direction="both")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Conclusion: Both methods of hypothesis testing come to the conclusion that there is a significant difference between the salary of female and male executives. On the one hand side, this can be seen from the small p-value. As the p-value in both cases is smaller than the alpha=0.5 we can reject the null hypothesis (H0: there is no difference). We can also look at the confidence intervals again. Given that the confidence intervals do not contain 0, we can again conclude that there is indeed a significant difference between salary for males and females.
Relationship Experience - Gender?
At the board meeting, someone raised the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).
# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)
## gender min Q1 median Q3 max mean sd n missing
## 1 female 0 0.25 3.0 14.0 29 7.38 8.51 26 0
## 2 male 1 15.75 19.5 31.2 44 21.12 10.92 24 0
Conclusion: Looking at the confidence intervals and p-values we can concluded that there is a difference between experience between females and males. However, we have not yet tested the impact of experience on salary, so we cannot yet say how it affects salary or potential discrimination. However, experience appears to be an different variables between females and males, so further analysis would be appropriate.
# Method 1
#Create dataframe
experience_gender <- omega %>%
select(gender, experience) %>%
group_by(gender) %>%
summarise(mean=mean(experience),
sd=sd(experience),
n=n(),
t_critical=qt(0.975, n - 1 ),
se= sd/sqrt(n),
margin_of_error = se*t_critical,
low_ci=mean-margin_of_error,
high_ci= mean+margin_of_error)
#Print data frame
experience_gender
## # A tibble: 2 × 9
## gender mean sd n t_critical se margin_of_error low_ci high_ci
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 female 7.38 8.51 26 2.06 1.67 3.44 3.95 10.8
## 2 male 21.1 10.9 24 2.07 2.23 4.61 16.5 25.7
#Method 2
# hypothesis testing using t.test()
t.test(experience~gender, data=omega)
##
## Welch Two Sample t-test
##
## data: experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -19.35 -8.13
## sample estimates:
## mean in group female mean in group male
## 7.38 21.12
#Method 3
# hypothesis testing using infer package
set.seed(1)
experience_in_null_world <- omega %>%
#Specify varaibles of interest
specify(formula=experience~gender) %>%
#Hypothesize a null of no difference
hypothesize(null="independence") %>%
#Generate simulated samples
generate(reps = 1000, type="permute") %>%
#Find mean difference of each sample
calculate(stat="diff in means",
order=c("female", "male"))
#Calculate diff in means
obs_diff_mean <- omega %>%
specify(formula=experience~gender) %>%
calculate(stat="diff in means", order=c("female", "male"))
obs_diff_mean
## Response: experience (numeric)
## Explanatory: gender (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -13.7
#Calculate CI
percentile_ci <- experience_in_null_world %>%
get_confidence_interval(level = 0.95, type = "percentile")
percentile_ci
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -7.09 6.29
#Visualize
visualize(experience_in_null_world, bins=15) +
shade_p_value(obs_stat=obs_diff_mean, direction="both") +
theme_minimal()

#Get p-value
experience_in_null_world %>%
get_p_value(obs_stat=obs_diff_mean, direction="both")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Relationship Salary - Experience ?
Someone at the meeting argues that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.
We analyse the relationship between salary and experience. Draw a scatterplot to visually inspect the data
Conclusion: There does appear to be a positive relationship between experience and salary, hence we can assume that the salary on average increases with more experience.
ggplot(data=omega, aes(x=salary, y=experience)) +
geom_point() +
theme_minimal() +
geom_smooth(method=lm, se=FALSE)+
labs(
title = "Scatterplot of salary vs experience",
subtitle = "There appears to be a positive relationship ",
x = "Salary",
y = "Years of experience",
cex=0.1)

Check correlations between the data
We can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make ths plots somewhat transparent (alpha = 0.3).
omega %>%
select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
ggpairs(aes(colour=gender, alpha = 0.3))+
theme_bw()

Conclusion: There does appear to be a positive relationship between experience and salary, hence we can assume that the salary on average increases with more experience. Given that males have significantly more experience that females, it seems plausible that the difference in experience, rather than discrimination, is the reason between the average difference in salary between males and demales.