[CONTINUED FROM PART 1] ...Thus, combining data haphazardly could result in finding adverse impact that does not actually exist – or missing adverse impact when in fact there is adverse impact present.

THE PERIL OF COMBINING DATA TO CREATE LARGER SAMPLES
Most professionals will warn you that you must not combine several years of data into a single analysis due to Simpson's Paradox -- and they are correct.  The cause of the paradox is that your number of applicants in each demographic group might shift from month to month, or from year to year.  If you have enough data and you have stored it in a detailed manner, it is often easier to compare each group year-by-year AND only make single test/process comparisons between groups. (But be warned: this practice increases the odds of finding a problem where none exists.)  Combining all lower-scoring groups into a single group, combining multiple selection practices into one evaluation, and/or combining similar jobs into one job for evaluation purposes requires more complex statistical calculations.

For example, if all candidates have to pass a pre-hire basic reading test, a basic math test, and a structured interview, and you wish to compare men and women, it is often easier to compare male and female outcomes on the reading test alone, then compare them on the math test alone, then compare their structured interview scores alone – making sure you compare each group’s performance over the same interval of time (perhaps each month of male/female scores, or each year’s…it depends upon the number of applicants you screen over time.  Employing this strategy may enable you to avoid delving into the more complex pattern consistency analyses used to aggregate and compare combined groups, such as the “Breslow-Day Test” along with “Tarone’s Correction.” The next step would be to perform yet another analysis to check whether or not the overall process produced adverse impact.  Two that we know of include the Mantel-Haenzel Test and Minimum Risk Weights Test.

In practice, many practitioners gather enough data using more basic procedures to eschew these more complex analyses. However, certain circumstances and claims might make pattern consistency analyses mandatory.

STATISTICAL SIGNIFICANCE
If you find surface differences between two appropriately-defined applicant groups that cause you to raise an eyebrow, you should make sure these differences aren’t due to chance.  Use tests of statistical significance to make this determination.  Two common gauges of statistical significance are…

-The pooled-variance t-test. If you want to determine whether or not the difference you observed between the average scores of two independent populations is statistically significant, this is the test to use.  However, it's important to remember the t-test assumes the population variances of each compared group are equal.  If they are not, you'll have to use a different test to determine the equality of the two groups' variances.  Then there’s the…

-The F-test or ANOVA (analysis of variance). If you are checking to see if the differences between the average scores of multiple groups are statistically significant (for example, comparing the average scores of each individual ethnic group on an assessment to the average score of every other group), use this test. This will spare you the agony  -- and possible false positives --  that can result from performing multiple t-tests, comparing the average scores of only two groups at a time.

Keep in mind that these tests are not always appropriate depending on the nature of your data.  If you are unsure of what these limitations are, then you probably shouldn’t be using them.

When you make statistical significance calculations, the software program will likely spit out a p-value.  A p-value is just a fancy-sounding term representing the odds that the differences between the selection rates/test scores/etc between two groups are due to chance.  Industry and legal precedents suggest the following:

Any p-value lower than 0.05, or 5% (for a one-tailed test) or lower than 0.025, or 2.5% (for a two-tailed test) suggests the group differences you found were NOT a coincidence.  Specifically, the differences are 95% likely (or 97.5% likely) representative of a real difference between group performance/group outcomes.

In other words, if a statistical significance test gives you a p-value of 0.02, the differences you noticed in your initial analysis are almost definitely real.

If you want to tease apart the various potential reasons people get selected or rejected, and determine the amount of influence each factor (education, years of experience, test scores, interview, etc.) wields in your selection process, you can perform a regression analysis.  In the real world, these aren’t whipped out too often (unless you're in compensation).  Frankly, if you get to the point that you have to do a regression analysis, it’s probably due to an investigation and you’re probably already screwed.  That’s not to say it isn’t a valuable exercise; it’s just that it probably won’t be necessary if you adhere to the best practices of job analysis and selection process development we’ve conveniently outlined on this website.

CONCLUSION
Obviously, there’s a great deal left out of this section, owing to the fact that this isn’t a stats class and TheAppliedPsychologist isn’t in the business of plagiarism.  The field of statistics isn’t rocket science – but it IS fairly complicated math for the uninitiated.  Evaluating the effectiveness of selection assessments and programs requires both specialized software and a user with sophisticated knowledge of how properly interpret the analysis output.

If all of this is seems like gibberish to you, either grab a book to figure it out or pay someone who understands.  As we will repeat ad nauseam, ignoring these issues will eventually come back to bite you.