Basic Item Analysis Concepts
This article provides you with some basic guidelines, concepts and issues perceived to be important for a proper understanding of item analysis. This pursuit, however, is best appreciated when supplemented with one-on-one or small group mentoring and “real-world” applications, in a safe learning environment.
Correlation
- Correlation measures the relationship between two or more variables. It does not explain why it is so. Reflecting this relatedness are everyday expressions about assumed correlations such as “Birds of a feather flock together” or “Time heals all wounds”.
- The range of correlation coefficients can be from -1.00 to +1.00, that is, a perfect negative correlation to a perfect positive correlation. High positive correlations mean that as one variable increases in size so does the other. High negative correlations mean that as one variable increases, the other decreases. The latter is also known as an inverse relationship. A lack of correlation between variables is expressed by a value of 0.00. Scatterplots are a good means for visually examining the direction and strength of correlations as well as examining the data for outliers (a score that falls way outside the range of other scores). This outlier will “stretch” or “shrink” the size of a correlation coefficient. Both coefficients can be reported if there is doubt in your decision making.
- Relationships can get complicated and defy linearity. The variables of age and health care in the curvilinear relationship are one such example. The coefficient of determination, r2, explains how much these variables have in common. Nonetheless, it is not only important to know the “magnitude” or “size” and direction of relationships, it is also important to know the significance as well as the practical significance of these relationships.
- As a note of caution, one should never assume that one variable causes a change in the other when explaining relationships with r2.
- The purchase of computers and designation of trades have increased significantly in the last few decades. A high positive relationship confirms this correlation, but one cannot assume that purchasing computers causes the Apprenticeship Branch to designate trades.
- The Standard Deviation (SD) or square root of the variance can be used as a rough estimate of exam quality/reliability. It discriminates between the exam takers differing ability levels and is related to N (i.e., sample) size. As such the optimal SDs for different N sizes are as follows: N=25 (SD about 3.9), N=50 (SD=4.5), N=100 (SD=5), N=500 (SD=6.1). Significant deviations from the above ranges would suggest that the exam does not discriminate well between exam takers to the extent desired.
- Like the correlation coefficient, the Mean (i.e. average) and SD are the basic workhorses of descriptive and inferential statistics of which item analysis is a part of.
- Both the mean and SD play a central role in estimating the effect magnitude of postcourse certification success in standard deviation units. The magnitudes of the resulting effect sizes add meaning to the often elusive interpretations of postcourse significance.
- Validity is the extent to which a trade test measures what it is supposed to measure.
- Although there are many forms of validity I would like to focus our attention on 2 general forms, namely face and content validity.
- Face validity is more convincing in part by including such authoritative "experts" as Program Advisory Committee representatives and instructors.
- A test is judged to be valid for a particular trade only if we know which skills we're testing. The tasks, subtasks and supporting knowledge and ability statements identified in the Occupational Analyses provide these skill sets. These are provided by competent tradespersons who in the end are the only people qualified to judge or rate the relevance of test items.
- These authorities and Test Developers ensure that the trade test items fit the “objectives” (i.e., content validity) of our Table of Specifications (TOS) for each subtask (i.e., representative sample).
- An item analysis and the consequent coefficient will reveal the accuracy of this fit.
- Thus and so, a test item is valid if it requires an apprentice to perform a subtask at a specific taxonomic level as specified in the TOS. You will recall that test items are keyed to a specific category of outcome (e.g., subtask, taxonomic level).
- Pass rates alone do not determine if a test is valid. Pass rates only tell us about a test's difficulty and nothing more. A test that is perfectly valid can have a zero pass rate and a test that is perfectly invalid can have a 100% pass rate. There is more discussion about pass/fail rates in the Standard of Error of Measurement (SEM) section further on.
- Most test scores will be very low when most test items are very difficult. Conversely, if most test items are very easy, then most test scores will be very high.
- An additional quality control procedure may be used to help systematize and facilitate the interpretation of examination pass/fail rates. This procedure plots the mean raw score recorded for each sitting in relation to a range whose parameters are defined by standard deviation values. The procedure allows us to determine whether a set of exam results recorded for a given exam or time-period should be considered "in control" -- that is, within the range of statistical normalcy. Accordingly, this range provides an easily determined benchmark for assessing normalcy and deviancy in exam behaviours as they apply to fluctuations in a set of results. Deviant behaviours identified by this method should trigger a thorough review of examination results and contents. The thorough review is conducted by means of advanced statistical analysis called item analysis. This procedure assumes that the same version of an exam is used.
- Two common statistics used to gain additional insight into the quality of test items are item difficulty (P value) and item discrimination (D value).
- P values can range from 0.0 to 1.0.
- A high P indicates an easy item; a low P a more difficult one.
- In practice, difficulty values tend to regress around the mean.
- The P value or Difficulty index is determined by dividing the total number of respondents answering an item correctly by the total number of respondents to that item.
- P values say nothing about the quality of an item, only whether it is easy or difficult and to some extent how useful they are as distractors. However, a greater number of chance scores are associated with test difficulty, negatively affecting reliability. Test items, regardless of taxonomic level, can be formatted by increasing levels of difficulty. This allows exam takers to not only complete the easier items immediately but can nudge their confidence forward in the process. There is no guarantee though that items formatted by increasing levels of difficulty would work better than those randomly generated. More importantly, if difficulty indices say above .80 and below .20 would be removed through item analysis, this would ensure exam takers of a more symmetrical distribution of items from test to test (a moderate range of difficulty), consistency from test to test, and would prevent some exam takers from unfairly receiving more difficult tests than other groups.
- The optimal range of difficulty values is 50% to 80%. This range tends to be related to point biserial correlation coefficient (PBCC) values of .40 or greater. Values above or below 50%-80% produce lower PBCC values and may need revision if not in this range. PBCC and D Values of +.40 and greater suggest that examinees who did well on the exam performed significantly better on the test item than did examinees who performed poorly on the total test.
- The D value discriminates high (upper 27%) scorers who correctly answered the keyed question from low (lower 27%) scorers getting an item correct. It confirms the validity of an item. My own experience shows that very good items tend to have more high scorers answer an item correctly versus low scorers with each distractor attracting a minimum of 5% of responses.
- Many researchers regard the PBCC as the single best measure for assessing the quality of a test item. This coefficient correlates a correctly scored item with responses on the scale as a whole. Like the D value, it measures the capacity of an item to discriminate between high and low scoring examinees. Unlike the D value, it assesses how much predictive power an item has on total test performance. In other words, it reflects a tendency of examinees selecting the correct answer to have high grades as compared against those who answer the item incorrectly. Point estimates can range from -1.0 to +1.0. Generally speaking the higher the index the better the predictive power.
- PBCC values less than .20 or negative values will need to be revised or replaced. Values between .20 to .39 are acceptable (“good”) and values of .40 or greater should be bronzed.
- When analyzing test items, PBCC values should be positive for the correct answer and distractor discrimination values should be lower than the correct key and, preferably, negative. The interpretation for 0.0 means that the same number of low and high scorers answered the item correctly. For -.20, more lower scoring examinees answered correctly and +.30 means that more higher scoring examinees answered the item correctly. More detailed sample interpretations follow.
Item 1
Pcnt Correct | Disc. Index | Point Biser. | Alt. | Pcnt Total |
Endorsing
Low High | Point Biser | Key | |
86 | .00 | .14 | A | 86 | 100 100 | .14 | * | |
B | 14 | 0 0 | -.21 | |||||
C | 0 0 | |||||||
D | 0 0 |
- Item 1. This item was too easy (proportion correct=.86). It did not discriminate well (point biserial correlation=.14 and discrimination index=0.0, unacceptably low). The key and distractors did not form a very good set, as most examinees identified the key correctly, but distractors C & D attracted no examinees and distractor B 14% of candidates.
Item 2
Pcnt Correct | Disc. Index | Point Biser. | Alt. | Pcnt Total |
Endorsing
Low High | Point Biser | Key | |
43 | .67 | .69 | A | 0 0 | ||||
B | 43 | 0 67 | .69 | * | ||||
C | 43 | 50 33 | -.60 | |||||
D | 14 | 50 0 | -.26 |
- Item 2. This was a relatively difficult item (proportion correct=.43). It discriminated well (point biserial correlation=.69 and discrimination index=.67, above the acceptable range). Between them, distractors C and D attracted more examinees than did the key, B. One of the distractors (A) did not attract any examinees, which suggests it was an implausible distractor.
Item 3
Pcnt Correct | Disc. Index | Point Biser. | Alt. | Pcnt Total |
Endorsing
Low High | Point Biser | Key | |
14 | -.50 | -.80 | A | 14 | 0 33 | .39 | ||
B | 14 | 50 0 | -.80 | * | ||||
C | 43 | 0 33 | .24 | |||||
D | 29 | 50 33 | -.09 |
- Item 3. This was a very difficult item (proportion correct=.14), which discriminated poorly (point biserial correlation=-.80 and discrimination index=-.50. More examinees chose distractor C (.24) than chose the key (-.80), with some also choosing A and D. However, it seems to have been the stronger candidates who were misled, so the item does not appear to work well.
Item 4
Pcnt Correct | Disc. Index | Point Biser. | Alt. | Pcnt Total |
Endorsing
Low High | Point Biser | Key | |
43 | .67 | .50 | A | 14 | 50 0 | -.26 | ||
B | 29 | 0 33 | .14 | |||||
C | 14 | 50 0 | -.80 | |||||
D | 43 | 0 67 | .50 | * |
- Item 4 This was a relatively difficult item (proportion correct=.43), which discriminated quite well (point biserial correlation=.50 and discrimination index=.67, above the acceptable range). Each distractor attracted some candidates.
Item 5
Pcnt Correct | Disc. Index | Point Biser. | Alt. | Pcnt Total |
Endorsing
Low High | Point Biser | Key | |
14 | .00 | -.21 | A | 14 | 0 0 | -.21 | ||
B | 43 | 0 100 | .76 | * | ||||
C | 29 | 100 0 | -.79 | |||||
D | 14 | 0 0 | .04 |
- Item 5. This was a very difficult item (proportion correct=.14). It did not discriminated well (point biserial correlation=.-.21 and discrimination index=.00, below the acceptable range). More examinees (.76) chose distractor B than choosing the correct option (-.21). A was specified as the correct answer, but B seems to work better. Option C attracted a few candidates from the lower scoring group (-.79), while D was a relatively weak distractor (.04).
Skewness
- Given no "pre-testing" or piloting procedure (i.e., same level or final exam apprentice scores are used as pilot data after the fact), the probability of obtaining a symmetrical distribution of scores is uncertain.
- Skewness (lopsidedness) of scores is likely. Marked skewness to the left (or low end of the scale) tends to yield rather easy items. Marked skewness to the right (or high end of the scale) tends to yield rather difficult items.
- The PBCC doesn't mean much if test items are highly skewed.
- A more symmetrical distribution is required; else one group of test takers may receive an easy exam whereas another receives a more difficult one.
- Skewness tends to decrease as the number of test items increase, regardless of P values (i.e., difficulty values) obtained.
- The Pearsonian coefficient of skewness deals with the distribution of scores. A coefficient of zero or near zero says that the test scores are symmetrically distributed and test for the attribute of the test. The formula is given by SK=3(mean-median)/SD.
- Using 78.143 as the average, 80 as the median and 9.478 as the standard deviation, we get: SK=3(78.143-80)/9.478; SK=-587 or -.6. Hence, the test is negatively skewed. An easy test.
- Given satisfactory reliability, a near symmetrical distribution of a test brings an item analysis to closure.
- Reliability is the extent to which a trade test provides consistent, stable and dependable results. It’s an overall indicator of the quality of a test.
- The goal is to achieve high reliability and a symmetrical distribution of items from test to test.
- The primary reliability coefficient (such as Cronbach's Alpha, Spearman Brown's "split-half" correlation, or Iteman’s Kuder Richardson 20 alpha coefficient) is based on the concept of "rational equivalence" which stresses the consistency of subjects responses to all items on the test and, thus, provides a measure of the internal consistency of the test.
- The coefficient of internal consistency shows consistency of performance on different parts or items of the test taken at a single sitting, usually computed by item analysis (e.g., Alpha coefficient). The coefficient can range from 0.0 to 1.0.
- The higher the coefficient the better the internal consistency.
- High reliability is not so much a concern for groups but vital for individual assessment and certification purposes. For making decisions about individuals a reliability coefficient of .85 or better is required. Generally, coefficients between .65 to .80 are fine for educational research or groups and less than .50 is to low. The latter coefficients may not be testing an appropriate attribute for which the test is designed for, say sheet metal work. Assuming test items are valid, low reliability means that you have to increase the number of items on the exam because the range of items is limited. In addition, you can try adjusting item difficulties to get a larger spread.
- Stability coefficients (test-retest reliability) establish the consistency of performance on a test over a period of time. The Pearson's Product Moment (PPM) coefficient can be used for this purpose. In other words, PPM is used for test-retest stability. In these coefficients, the instability is traceable to the test itself, or parts of it. PPM requires two administrations. The initial test and retest.
- PPM correlation classifications for interpreting the size or strength of correlations are as follows: .01-.09 is negligible; .10-.29 is considered low; .30-.49 is moderate; .50-.69 is substantial and .70 or more is very strong. These are useful for examining construct validity but more relevant for the purposes of test-retest reliability/stability.
- Note that discrimination index/pbcc values increase as the number of test questions increase, regardless of the ability of an individual test question to discriminate between good and poor subjects. In other words, reliability is directly related to the number of items on the exam.
- The reliability coefficient for a 25-item test cannot be compared directly with the same coefficient for a 100-item test. Short form tests can expect drops in reliability in spite of retaining the best items. However, this can be a modest sacrifice in exchange for substantial reduction in the examinees’ response burden. Validity of an exam may become a concern as reliability decreases. The objective is to maintain very high reliability (.85 or better) while maintaining content validity. Knowing something about the population sample and content validity for each test under consideration is important.
- A coefficient of .5 is satisfactory for tests that have 10-15 items, and .8 for 50 item tests. For lengthier exams, .85 or better is still important for making decisions about individuals and certification. Note that reliability increases as the number of test items increase on an exam.
- As length of test increases, so does the standard error of measurement (SEM). So, for a 90 item test you may have a SEM of 4.2 and for a 25 item test you may get a SEM of 2.
- Another measure of test reliability is the Standard Error of Measurement (SEM). The SEM also reflects to some extent the reliability of this exam. It attempts to estimate error involved in measuring a specific test taker’s observed (raw) grade with a specific exam. Theoretically, the raw score should lie within 1 SEM of the test taker’s “true” score more than 2/3 of the time.
- The sources for SEM are many, and can include differences in test administration conditions, factors related to the examinee such as illness, fatigue, and misreading, and/or psychometric properties of the test itself.
- What helps to correct these sources for error? Quality test items, easy to understand test instructions, and following closely the prescribed procedures for administering trades examinations help reduce measurement error.
- Again, it's most practical use arises out of the need for interpreting the individual test taker's examination results. After all, confronted with decisions about a level placement into a trade program or T.Q., Provincial, and Interprovincial examination challenges on the basis of past in-school success and/or experience, Apprenticeship Staff need to know how much confidence can be placed in the test taker's score.
- As is often the case, retesting someone a hundred times is impracticable. In such cases, decision-makers attempt to estimate the probable limits between which the individual's true test score will fall after only one administration. Given high reliability to begin with, SEM is compared to a range of scores within probable limits of the area curve (e.g., 2 times out of 3 the scores will lie within so many raw-score points given 1 SEM or 66% of the time, 95% of scores will lie within so many raw-score points given 2 SEM, or even up to 99.73% certainty). As certainty increases so does standard error. That’s why 1 SEM is recommended in our calculations.
- High reliability is especially important when making decisions about individuals and certification.For example, using a weighing scale to assess a person's weight may be a valid and an appropriate measure but if one measure reads you in at 95 pounds whereas a second reads you in at 112 pounds, then one would quickly question its accuracy and consistency. In other words, someone retaking the same Interprovincial exam, say two months later, could have a huge shift in his or her scores even if their ability level does not change substantially. The 17-point margin of error representing more than 12 percent of the exam (assuming it to be 0 to 135) means there is both a high false-pass rate and a high false-failure rate. For example, a person who received a score of 70 on the exam may have scored a 91 or a 53 simply because of the unreliability of the test.
- The SEM formula is given by SEM=SD * by the square root of 1-r1.
- Increasing test length increases reliability but it does not guarantee efficiency. Improving test reliability requires much time and effort, especially if coefficients fall below .60. For example, if the coefficient of a 140 item test is found to be .60, you will need to lengthen this test 5.7 times (140 x 5.7 or 798 items) to get a correlation coefficient of .85. Such a lengthy test is impracticable to administer. Anything below .40 confirms the test items have little in common with the attribute being measured. In this case, the test constructor's measurement problem should be reconsidered, and a new, more explicit examination (test) plan is in order. Note that a reliability coefficient is satisfactory only within the limits of the time and resources provided to improve test.
- The formula is given by:
(The reliability you want) x (1-reliability you got)
(The reliability you got) (1-reliability you want)
- Using your own data into the formula, you might get:
.90 x (1-.80)Recommendations
.80 x (1-.90) =.18/.08=2.25 times longer
As you can see, interpretation of all topics and point estimates is limited to one's understanding and experience with these topics. Here, I presented some of the basic concepts perceived to be important or useful to one's "toolbox" in dealing with item analysis. You are encouraged to read on the topic, take formal courses, and attend workshops or seminars. With experience and time, all of these guidelines can add to your proper understanding, interpretation and communication of the meaning and uses of item analysis.
Bibliography
Borg, R.W., & Gall, D.M. (1989). Educational research: An introduction (5th ed.). White Plains, NY: Longman.
Cap, Ihor. (1995). The usefulness and effectiveness of a self-instructional print module on multicultural behaviour change in apprentices in Manitoba. Unpublished doctoral dissertation, Florida State University, Tallahassee. Available from University Microfilms Inc., P.O. Box 1764, Ann Arbor, MI 48106-1764 USA. (377 pages)
Davis, B. F. (1964). Educational measurements and their interpretation. Belmont: Wadsworth Publishing Company.
Diederich, B.P. (1960). Short-cut statistics for teacher-made tests. Berkeley: Educational Testing Service, Evaluation and Advisory Series.
Freund, E.J., & Williams, J. F. (1982). Elementary business statistics: The modern approach (4th ed.). Englewood Cliffs, New Jersey: Prentice-Hall.
Jordan, M.A. (1953). Measurement in education: An introduction. New York: McGraw-Hill Book Company.
Nunnally, C. J. (1967). Psychometric theory. New York: McGraw-Hill Book Company.
Richter, J.J. (1980). The construction and partial validation of a scale to measure technology literacy of communication technology. Unpublished doctoral dissertation, West Virginia University.
Ridley, F. A. (1976). An evaluation device for assessing effectiveness of consumer education programs in home economics in Florida (Project No. VTAD – 5 F6-048). Tallahassee: The Florida State University, School of Home Economics, Department of Home Economic Education, Florida Department of Education, Division of Vocational, Technical and Adult Education.
Tuckman, W. B. (1988). Conducting educational research (3rd ed.). Orlando: Harcourt Brace Jovanovich.
Author: Ihor Cap, Ph.D.
This article was first published in the http://articlesandblogs.ezreklama.com website on September 24, 2008.