Submission Guidelines

Thursday, May 5, 2016

Standards for Interpreting Statistics Made Easy

By Ihor Cap
Cartoon by John Landers, Courtesy of Causeweb.org
One of the biggest problems faced by researchers and practitioners alike is the lack of helpful standards to make sense of all the research data. All the statistical data results I mean. The problem becomes more pressing if you are the person who must read that data, analyze it and even worse, present or discuss it with members of your audience. There is no time to waste because you needed it yesterday, right? If this sounds like you, then this article sums it up for you briefly.
More specifically, this article will expose you to a range of useful statistics and provide you with meaningful “tools” that will help you easily explain their worth to others. The meaning of those short classifying phrases is limited to interpreting PPM coefficients, test-retest stability coefficients, adjusted R-square values, index effect sizes, reliability or internal consistency coefficients, difficulty indices, item-discrimination indices or point biserial correlation coefficients. These apply to any research study.
Interpreting Pearson Product Moment Correlation or PPM Coefficients
Davis (1971 in Chappell,1984) labeled Pearson coefficient (zero order) correlations between .01 to .09 as “negligible”, .10 to .29 as “low”, .30 to .49 as “moderate”, .50 to .69 as “substantial” and .70 or higher as “very strong.” These short classifying phrases will help facilitate consistency in the interpretation of size (or strength) of all Pearsonian and adjusted R-Square values obtained in any study. These interpretations apply to simple correlation coefficients (or PPM correlation coefficients) as well as Pearsonian test-retest stability coefficients.
Interpreting the index Effect Size (ES)
The index Effect Size (ES), defined as “the mean difference between the treated and control subjects divided by the standard deviation of the control group,” (Smith, Glass & Miller, 1980) is often used to evaluate the magnitude of the experimental effect in standard deviation units. Schermer (1988) reviewed ES outcomes and devised a set of standards to facilitate consistency in the interpretation of these outcomes. Any researcher can adopt these standards. In quantitative terms, point size estimates of less than .2 are “small” effects, .5 as “medium” in size and higher than .5 as “large.”  Use these benchmarks to estimate the magnitude of the effect over the posttest or delayed posttest measure.
Interpreting Reliability Correlations or Internal Consistency Coefficients of Exams or Teacher-Made Tests
Fox (1969) labeled reliability correlations between 0 and .50 as “low”, .51 to .70 as “moderate”, .71 to .86 as “high”, and above .86 as “very high” for the purposes of educational research. Another researcher’s review of evaluation devices identified minimum reliability coefficient values of .85 for making effective decisions about individuals and .65 for groups (Ridley, 1976). Nothing below .50 for the latter would suffice (Jordan, 1953; Nunnally, 1967). Additionally, Diederich (1960), states that most teacher-made tests “…regarded as good, usable tests achieved reliabilities between .60 and .80.”  Lower test reliabilities may be acceptable for group research projects in education (Borg and Gall, 1989). Short form tests can expect slight drops in reliability in spite of retaining the best test items (Borg and Gall, 1989). If the reduction in length represents a negligible decrease in reliability, you will gain substantial savings in time-spent writing an exam. These short classifying phrases will facilitate consistency in the interpretation of size (or strength) of reliability coefficients for all instruments or exams used in any research project. Some well-known statistics that fall into this category include Cronbach’s Alpha, the Kuder-Richardson Formula 20 (KR-20) and the Kuder-Richardson Formula 21 (KR-21).
Interpreting the Quality or Power of each test item (or question) for the purposes of Item or Exam Analysis
Selection of  final composite test items (or questions) proceeds with the goal of obtaining a representative range of difficulties, and the highest possible item discrimination values balanced vis-à-vis the highest possible coverage over the Table of Specifications for content validity maintenance, thereby reducing researcher bias in the selection process (Richter, 1980).
Using the guidelines proposed by Kromhout (1987), test items passed by 80 percent of exam takers are extremely easy and items passed by less than 20 percent are extremely difficult for exam takers in a field-trial “test” study. In other words, removing difficulty indices above .80 and below .20 ensures that all exam takers receive a test with a moderate range of difficulty.
The item-discrimination index, which analyzes the power of each test (question) item, is the “Point Biserial Correlation Coefficient” (PBCC). Researchers consider this coefficient “… to be the single best measure of the effectiveness of a test item” (Lewis, 1989). Lewis (1989) proposes the following range of numbers and interpretations. A test item (or question) with a PBCC of .30 and above is a very good discriminator of the top 24% from the bottom 24% scoring groups. A test item with a PBCC of .20 to .29 is reasonably good, but subject to improvement. Test items with PBCCs of .09 to .19 are marginal, usually needing improvement, and those below .09 are poor, to be improved or discarded.  
Conclusions and Recommendations
This concludes your reading on “Standards for Interpreting Statistics Made Easy.” Researchers and practitioners are encouraged to use these standards and benchmarks in their future efforts related to analyzing, interpreting or explaining, and presenting their statistical data.  I hope that this read has made you a little less fearful of statistics and a little more confident in your newly acquired knowledge of the meaning and worth of these numerical performance benchmarks.
Author Information:
Ihor Cap, Ph.D. is an Education Research Specialist, Web Author and Marketing & Promotions Manager for EZREKLAMA.
References:
The complete reference to each of the cited sources is available in the following document cited below.
Cap, Ihor. (1995). The usefulness and effectiveness of a self-instructional print module on multicultural behaviour change in apprentices in Manitoba. Unpublished doctoral dissertation, Florida State University, Tallahassee. Available from University Microfilms Inc., P.O. Box 1764, Ann Arbor, MI 48106-1764 USA.  (377 pages)
 
Cartoon Picture by John Landers, Courtesy of Causeweb.org
This article first appeared August 24, 2009 in http://articlesandblogs.ezreklama.com.