| By the Numbers: Anatomy of a Statistical Man
EHRENFELS: Professors use a statistical technique or should
I say they have their research assistants use a statistical technique
known as item analysis. This is only one of dozens
of statistical formulas employed in research, but I would like to make
an example of this one in particular because it is used in grading as
well as in research and I feel it captures the essence of their M.O. A
questionnaire or an exam contains multiple questions what are known
as items. Many questionnaires are even subdivided into scales,
groups of items that are purported to tap different aspects of your personality
or knowledge. For example, the Graduate Record Examination and Scholastic
Aptitude Test has a group of items constructed to measure your Verbal
skills and another group for your Mathematical skills. Item analysis yields
a coefficient which is a measure of the extent to which one item correlates
with all the other items in the scale collectively or if there
are no subscales in the questionnaire collectively.
MOYER: So in other words and lets use the example
you used in our earlier conversation lets say we have a Shyness
scale your answer to the question on a scale from 1 to 7,
rate the extent to which you prefer privacy to social gatherings
is correlated with your combined answer to all the other questions in
the scale to determine the extent to which this item is associated with
the other items.
EHRENFELS: Yes. Naturally this is not done on a person-by-person
basis, but the analysis is performed once you have a number of people
who have taken the questionnaire. We MAY find that despite the fact that
particular question seems to be like the others in the Shyness scale
seems to measure the same thing Shyness we may find that
people who tend to rate themselves higher on this particular item
end up scoring lower on the scale as a whole. So in effect, by keeping
this particular question in the scale, we deflate or underestimate the
true score on the scale. So we discard the item.
MOYER: Is there a cut-off you use. You have a coefficient for each
item in the scale, right?
EHRENFELS: Correct.
MOYER: So then how you do know when the item is good, or not good?
What does it take to keep it? Or to throw it out?
EHRENFELS: There are no hard and fast rules, except that once you
adopt a cut-off, it should be the same cut-off used for all the items.
Some professors will look at another coefficient nicknamed alpha
which is the measure of the internal consistency of the scale
and by that I mean an average of the correlation of each item with all
the other items. This is the extent to which the scale correlates with
itself the extent to which it can be said to measure the same thing.
Researchers want this number to be high because supposedly something cannot
correlate with something else to an extent greater than it can correlate
with itself and researchers WANT to devise a scale that correlates
with other measures and by that I mean a scale that is predictive
or diagnostic of other things. And what they do is they will recalculate
alpha for the scale WITH and WITHOUT each item, and if they notice that
alpha would be increased without an item, that item is discarded. Ultimately,
they want alpha to be at least .80. Anything below .70 is considered dubious,
and some strive for an alpha greater than .90, because such an alpha is
said to be admissible in court.
MOYER: So what is the problem with item analysis?
EHRENFELS: I have seen it abused. And by that I mean that when
researchers are expected to apply it universally -- it has a price.
I had a scale which I claimed measured x. Now I defined x in such a way
that the items themselves were synonymous with the definition of x. So
in my opinion, these items measure x as I define it regardless of what
the alpha coefficient or individual item analyses tell me. Now imagine
that one of the items significantly reduces the alpha.
MOYER: Above or below .70?
EHRENFELS: Doesnt matter. Should I throw out the item? Well,
no one will publish any research that involves a scale with a lower alpha.
So I am advised to drop the item. But if I do, I change the meaning of
the scale. The scale no longer measures x but some variation of x. This
is all well and good except I WANT to measure x.
MOYER: But according to alpha, you are not measuring x, right?
EHRENFELS: Not true. I will contend that x may not be as TIGHT
or internally consistent a construct as that we are used to dealing with,
but x is still x. You see, in our field, we are used to scales with .80
and even .90 coefficients. Have you ever SEEN one of these scales
that meet these criteria? The questions all look alike. It is a foregone
conclusion that people will respond to them similarly because they are
all variations of the same question. And THAT is how they are usually
created. Someone thinks of a question and then thinks how to re-word it
several ways. BORING!
MOYER: Not to mention artificial, right? This has been your complaint
against field.
EHRENFELS: I have also complained that field demands consensus
from its members. Well, it would also seem they demand consensus from
its subject matter.
MOYER: How do you think they would respond to your criticism?
EHRENFELS: They would tell me any scientist who does not revise
a theory to fit the data would be irresponsible.
MOYER: And how would you
EHRENFELS: Item analysis IS NOT data. If your hypothesis is that
people who score high on scale x behave in y way or make z kind of decisions
or experience w type of dreams, you have to TEST the hypothesis before
you throw it out by changing or discarding scale x. The real data is in
y, z, and w, not in x alone. Now I know that x can only correlate with
y, z, and w to the extent that correlates with itself, but we demand a
lot in the way of self-correlation. You dont need a .80 or .90
but this is what you are told you need in a scale if you want your research
to compete for publication. This may explain in part why our literature
is so lifeless and repetitive. It excludes too much and what it
does include correlates with itself, so to speak. We are also too quick
in this field to throw out or revise theories to conform to the first
signs of data. Let me tell you something. If we understood our data, we
wouldnt need theories. There is a little piece of every theory that
is supposed to transcend the data that is reserved to help us make
sense of something as fickle and variable and contradictory as DATA. So
I think we should give our theories the benefit of the doubt and stop
attempting as quickly as possible to develop theories that are duplicates
or analog maps of the data. We are not reassembling engines
here. Hell we use GRE scores as criteria in the admission of graduate
students. You want to know how well the GRE scores predict performance
in graduate school? The Verbal scale the most predictive scale
accounts for only 16 percent of the variation in academic performance
16 percent! And yet we continue to look at it.
MOYER: And why do you suppose that is?
EHRENFELS: Probably because it tells us a little about what kind
of person the applicant is. The applicant may have good comprehension
and writing skills, but we all know that doesnt make the difference
between a successful and unsuccessful graduate student. Now if we invented
a conformity inventory with subscales for compliance, sycophancy,
acquiescence, obsequiousness, and cadence-and-imitation then we
would really have something. But my point is that sometimes we want to
devise scales that measure circumstances we expect to vary or fluctuate.
I may not want to measure something stable and internally consistent.
I may want a barometer of sorts and sometimes not even that. I
may want a scale on which most people do not score HIGH or LOW most of
the time but when they do it tells me something about the
state they are in. I may want a series of scales that profile a state
such that I expect most scale scores to be neither high nor low most of
the time. But I expect the topography of the profile -- the variation
among the scale scores and the scores that are RELATIVELY higher or lower
to tell me something. But these kinds of statistics are just not
standard. No one has devised special rules or formats for them
so they would be overlooked. The field as a whole has fallen into such
now routines and routines are in and of themselves conducive to
biases and prejudices. We preclude a range of possibilities concerning
what may be learned and how it may be presented. Both beauty and truth
suffer as well as freedom. But like I mentioned earlier, what we
do with items that do not fit into the scale is no different from what
we do with researchers who do not live in the fold.
MOYER: You mentioned that item analysis is also used in grading
exams.
EHRENFELS: Professors use item analysis to hunt for multiple choice
questions that are answered correctly with as much success by students
who score poorly on the exam as a whole as by students who score well
on the exam. Such items are said to be negatively discriminating
which means they discriminate against good students and
they are discarded. Now I agree that item analysis here has some limited
benefits. I would like to use item analysis to make sure I didnt
accidentally key in the wrong answer for a multiple-choice item. And I
may even double-check the item to see if it was ambivalent or ambiguous.
But if I check the item and it seems fine to me on the surface
and if the answer is keyed in correctly I will not discard
it regardless of the item coefficient. Poor students or not they
are STILL students. And I will give them what they earned. And lets
face it -- sometimes good students especially these hyper-memorizing,
over-achieving types
MOYER: Careerists?
EHRENFELS: Perhaps. Sometimes they study in ways that is conducive
not to learning but to performing well on multiple-choice tests. Sometimes
a question comes along that discriminates against the bullshit memorization,
artificial achievement, and pseudo-understanding. I will not punish the
rest of the students for this. But the professors DO this because the
technique itself is scientific and exacting, and because it gives them
over a time a collection of the best test items. Some of the professors
archive this data, thinking they are creating the perfect test or test
bank. Some of them do this with an eye to publishing their test one day
that is if they are not already using a test bank developed
for the textbook by the publisher or the authors graduate assistants.
Some of these items can be bad too despite the research. But hell
its easy.
MOYER: So some professors dont even make up their own questions.
EHRENFELS: Most dont. And why should they? They dont
design their own lectures. Those are designed by the textbook and supplemental
teacher manuals and some of the lectures may be delivered by graduate
teaching assistants. So why not use the test bank that accompanies these
materials.
MOYER: I bet youll tell me why.
EHRENFELS: Well, Im getting a little off subject here
but I would say that it is lazy and anti-intellectual. I would say it
leads to this ONE monolithic view of the field. I bet I could convince
some professors of this but I dont think they would care.
Professors dont value their General Psychology courses because it
is General or Intro Psychology. Professors want
to teach courses in the material in which they specialize, and they want
to teach these courses to ADVANCED students not Psych 101
with which professors are unfamiliar to a bunch of college freshman
most of whom may not even be Psychology majors. But I will deal
with this more in our interview on teaching. What I am talking about NOW
is what professors are willing to do to bring teaching, grading, and test-taking
under the rubric of science, professionalism, and research. Some of the
items they would discard are not even wildly discriminating. In other
words, they do not just throw out extremely negative coefficients, but
also coefficients which are mildly negative, and some even discard items
with mildly POSITIVE coefficients in search of that perfect exam and that
perfect bell curve. Sometimes I think they were conditioned to see beauty
in that normal bell curve shape. Now this practice in and of itself is
not that consequential. That is why it has escaped everyones notice.
But it is symptomatic of some of their more consequential choices
and of a consequential PATTERN of behaviors which taken collectively
introduces a credentialism that favors careerists and discriminates
at more advanced levels of education against the true scholars. They really
are creating a race of super-scientists and administrators. And what you
really end up doing is narrowing the range of skills tapped -- or narrowing
the range of tapped skills that are reported such that you end
up with this yardstick that measures JUST ONE THING. And if you are not
in the top x percent of this ONE skill or quality your odds of
making it are very small. We really dont pay much attention to the
fact and I imagine this flaw dogs every field to a certain extent
that there are people who aspire to be members of the field who
are bright and creative probably brighter and more creative than
most in the field who never really get the chance to make it in.
They are weeded out at some point without a fair hearing. I think the
half of the public that DOES see this ACCEPTS it as the work of that chance
component that is part of life. I am here to put quite a different face
on that chance element to tell you that it really ISNT chance
at all but the work of something very systematic you may not see. |