If you’re asked to conduct human capital evaluation systems/tests/etc, this section provides a brief introduction on the very basic statistics you’ll want to be aware of. We aren’t about to launch into a stats course here: there are plenty of quality resources out there if you want to learn how to do this stuff. But before you read further, take note that you won’t be able to rely entirely upon the techniques described below unless your sample of applicants is large enough. For example, if you want to know how the selection rates or pass rates on a test affect men vs. women, it will make your life easier if you’ve screened more than 30 people. (The Uniform Guidelines do not suggest a cutoff at all, while the OFCCP suggests a cutoff of 30). If you haven’t screened enough people to reach 30, you can combine groups…but that complicates things, as outlined below. And if your business doesn't have enough people in each similar position to reach these thresholds, these recommendations become irrelevant.
Finally, remember the difference between a sample and a population. A sample is a (preferably random) chunk of screened candidates you plan to use to represent an entire population over a set time period. A population is almost exactly what the word suggests in non-stats contexts: every single screened candidate for a particular group over a set time period (or every recorded outcome for the group in existence). The scores of every man and every woman who took a pre-employment test over the past 12 months is a population, not a sample (unless some test-takers did not identify a gender).
[It is important to remember the distinction between your organization's population and the population at large when using statistics and interpreting results. In most situations, you will be utilizing the former. When facing a hiring discrimination claim, plaintiffs may use the latter to claim that your company's population of a particular group is underutilized compared to the available population. For instance, a lawsuit might assert that 27% of the men living in your company's operating and recruitment area meet the basic qualifications of a job you hire regularly -- yet your organization's population of men in that role is only 7%. Though cases like this are difficult to prove in practice, companies do face these types of 'under-utilization' lawsuits occasionally. If the court decides the plaintiff's utilization claim has merit, the employer must counter the claim with statistical proof or by providing a plausible, logical explanation for the disparity.]
Measures of Central Tendency: Mean, Median, Mode, and Standard Deviation. These stats form the basis used to determine how your outcomes/test scores/etc are playing out. The mean is the statistical average. If you have a series of applicant test scores or selection evaluation scores, calculating the mean provides a ‘snapshot’ of what the ‘average’ applicant scores on your procedure. Obviously, the mean can be skewed or influenced by a number of factors that might make the figure deceptive. This is why the mean score of all applicants doesn’t tell you a whole lot by itself. However, calculating means across a specific location, a specific time period, or a specific demographic group can be very useful.
The median is the score that falls in the center of all the scores in your sample when all scores are placed in rank order from the lowest number to the highest. The mode is the most frequently-occurring score in your data. Both can be useful figures if your data is not distributed normally. (You might want to find out what a normal distribution is before we go any further. Basically, it’s another way of saying ‘bell curve’. Most data forms a bell curve when analyzed, hence the ‘normal’ moniker.)
Standard deviation is a number that represents how far apart numbers are on average in a data set (In an I-O Psychology context, this usually means test scores). A large standard deviation suggests a wide ‘spread’ or range of scores, potentially exposing the absence of a normal distribution. A very small standard deviation suggests that the scores in a data set are clustered tightly together.
The Four-Fifths Rule. If you’re asked to determine whether or not your test scores or applicant outcomes are biased in favor of (or against) any particular demographic group(s), you should look for violations of the Four-Fifths Rule before you check anything else. This test is often the first used by the courts or the EEOC if they receive a complaint about your hiring practices; it provides a rough 'initial glance' at selection disparities. By itself, this analysis does not definitively prove or disprove the presence of adverse impact against a particular group. However, a negative four-fifths outcome provides ammunition for any curious investigator to dig deeper.
To perform this check, calculate the applicants' success rate (or each test's pass rate). Include each demographic group that makes up more than 2% of your applicants that meet the basic qualifications (examples include ‘high school diploma or equivalent’ or ‘some college’) for the position. Figure out which group has the highest success rate. Compare the highest success rate to the success rates of every other group. If any other group has a success rate that is less than .8 (80%) of the highest group success rate, there is a 4/5ths Rule violation. For example: let’s say you have 500 total test-takers applying for a position. This group breaks down ethnically as follows:
- 100 Caucasians with 50 successful outcomes
- 100 Asians with 80 successful outcomes
- 100 Latinos with 71 successful outcomes
- 100 African Americans with 72 successful outcomes
- 100 Unknowns with 73 successful outcomes
Since Asians have the highest success rate (80%), with 4 of every 5 Asian applicants succeeding), the rate of every other group is compared to theirs. Since .8 of 80 equals 64, this means any group succeeding at a rate lower than 64 out of 100 applicants (64%) may indicate adverse impact. Given the information above, there seems to be adverse impact against the Caucasian group.
Beyond the Four-Fifths Rule But it’s never that simple. If your sample size or overall applicant pool is small, a 4/5ths calculation might not provide any useful information whatsoever. When the sample or population of applicants is small, just a couple of successes added to one group or taken away from another might dramatically change a finding of adverse impact. If you perform this check using statistics found in the above example, this double-check would ensure the practical significance of your adverse impact findings.
Thus, we recommend that you tweak your data a bit and re-try the same statistical analyses. Combine different groups into one group if your numbers are small, and compare that group to the top-performing group. Compare the same groups over differing sections of time (the past 6 months, the past year, from 2003 through 2006, etc) to see if any of the effects change. Or try the reverse: split your data to see if the differences in success rates reside with a narrower group, a specific location, or a shorter period of time.
Drilling down into your data can help you isolate and identify potential causes of the group differences you found (or didn’t find). Here’s why: deceptive outcomes may present themselves when examining a location, group, or time period that is either too small to reveal a suspected impact or too large to reveal subtle changes over time...[CONTINUED IN PART 2]