Metrics for Admission to Graduate School – GREs and all that

This is an old and perhaps contentious topic. But there has been buzz around this recent article in Nature: A test that fails : Naturejobs. And, there are already several blog articles, but I will give a pointer to just one: The GRE: A test that fails, which has some interesting analysis. The Nature article is centered around the figure below which shows how applicant GRE-math scores are distributed.

GRE-math scores of PhD applicants, stratified by ethnicity

The main point of the article is, and I quote:

“De-emphasizing the GRE and augmenting admissions procedures with measures of other attributes — such as drive, diligence and the willingness to take scientific risks — would not only make graduate admissions more predictive of the ability to do well but would also increase diversity in STEM.”

I generally agree, and so does The GRE: A test that fails, saying:

“We at Harvard have downgraded the importance of GRE test scores in our admissions process and the quality and diversity of our admitted students has increased as a result.”

However, beyond my superficial agreement with the high-level claim, the Nature article is singularly disappointing because there is no hard evidence given, and the claims are strong. For an article in a flagship journal on a contentious topic like how race plays a role in access to education, I was expecting data to be unambiguous and precise in order to lend support for strong, possibly controversial claims.

The figure presents data on the preparation of students. Students applying from  different ethnicities have different preparations; that is not a shocker to me, though one must take care to carefully consider the (self) selection biases: today we actively encourage applications from under-represented groups whereas Asians and Whites are largely left to themselves when deciding to go for a PhD. See my article on “We’re growing taller” for an example of the strange things that can happen when there are biases in the data.

From the figure, we infer that students applying in different disciplines have different preparation – no surprise that a chef who doesn’t need a particular knife will not waste time sharpening it. As to the male-female discrepancy, The GRE: A test that fails offers the well known stereotype-threat stress as an explanation. It could be true, but I don’t really want to try at explanations. Rather, let’s see what the data can explain. In particular, is there evidence for the following strong claim:

“In simple terms, the GRE is a better indicator of sex and skin colour than of ability and ultimate success.

Let me be a very conceited devil and reason as follows (beyond conceit, there there is no hidden trick in the reasoning).

Magdon’s Thought Experiment. Suppose, heaven forgive the outrageous thought,  that us professors are actually doing a good job at admissions. For example, I look at far more than just the GRE score. I will admit that the GRE-math is an important indicator, and a low score there means you had better impress me somewhere else. I don’t care about the GRE-verbal too much. What else do I like to see? Published articles in my area are a big deal (there are some at my institute who say that published papers are not an indication of PhD potential, and they are entitled to their opinion if they will also advise the student for me); types of courses the student seems to excel at (does the student prefer math or history); very specific recommendation letters; has the applicant anything intelligent to say about my papers – which means they are able to read and understand papers. But, getting back to the main thought, we’re assuming the uncanny, namely that us professors are actually trying to admit students who will succeed at a PhD – what a bizzare thought (did I already say this). And so, we get the admit pool, and we are now going to analyze this admit pool to see if there is any predictive value in the GRE-math score, as to whether the student will succeed. Why are we analyzing this admit pool – because this is all we can do. Where else are we going to get data on who successfully completes a PhD and who does not? (I call this the convenient data trap – beware of data bearing gifts.)

When we analyze this admit pool we will find that GRE-math has nothing to do with whether or not the student completes the PhD and we will arrive at the stunning conclusion as claimed in the Nature article. It is precisely because as professors we did such a great job in admissions that GRE-math has no predictive value. And why is that. It is because we compensated. If the student was bad in GRE, they must have been good somewhere else. We admitted students precisely because we thought that given the information available, they would do a good PhD. Those who failed must have done so for some unforeseen reason (maybe they got married, had a kid and needed more money than the graduate stipend could provide them; and that has very little to do with their GRE-math). The process of admitting students creates a very biased sample. When you reason from a biased sample you had better be careful. In this case, a more precise statement we could make is:

Conditioned on being admitted to the PhD, a student with high GRE may be as likely to complete as one with low GRE.

Now, this is a very weak statement, and it is all because of those two important words “Conditioned on”, two words which send a bright warning signal: BEWARE of sampling bias.

The strong claims in the Nature article need data from a very different experiment:

  1. Select a random set of 21-year-olds and administer the GRE test on them.
  2. Admit them to your PhD program no matter what.
  3. Treat them all equally (how the heck could we do that).
  4. See which ones get a PhD and demonstrate lack of correlation with GRE-math.

Clearly this experiment is never going to be done and the results will likely prove the article wrong. We will never have the data to substantiate strong conclusions as claimed in the article. And anyway, the article did not give an analysis of the outgoing statistics of students to make their claims (which would still have to be qualified by the all powerful “conditioned on”). They looked at applicant scores.

  1. So there is no data on successful completion of PhD versus GRE. And, even if there was data, we would still have to deal with the sampling bias issue.
  2. What about “quality of PhD” versus GRE – notwithstanding the difficulty of quantifying the quality of a PhD?
  3. What about controlling for (say) ethnicity and then asking how quality of PhD or success probability varies with GRE?

The authors try to address my point 1 with statistics on their bridge-PhD program which has an emphasis on under-represented groups and an interview process for determining admission. They report 81% PhD-completion versus 50% for the national average. There is nothing wrong with what they did and it is great that they did it, hats off to them. But,  don’t be too wowed by this outcome either: the authors have a point to prove, and they did an experiment that proves their point: they admitted a group of students, likely gave those students all the nurturing and care needed to push through to a PhD. And reaped the results of an 81% completion rate versus a national average of 50%. But, how scalable is that, and what is the cost? A professor who invests that much resource into a single student will necessarily train fewer students; and anyway, a PhD is supposed to represent largely independent creative work. To my eyes, the point being made is that if you nurture students more, they are more likely to get a PhD; and that is not a bad thing, it is just not scalable.

Now for my opinion. Yes, we can improve admission processes; yes, we can improve the GRE; yes, the GRE-math is not relevant to every discipline, so only use it if math is relevant to your research; and so on. But the system is not that far off center. However, at the same time, I am worried. I sense a growing trend in the US to discredit any tool which highlights significant racial differences. Rather than accept the fact that there may be significant disparity in mathematical preparation when viewed by race, we try to find a problem with the GRE-tool and argue that it is useless in predicting PhD potential (should we also argue to stop using level of education to determine who to hire in companies because (my guess) there is significant racial disparity in levels of education?). No, instead we should just fix the real issue with the preparation differences. And, these arguments hit at nerves: inequity is a hot topic, and articles that promote “equity” even if they  use flawed data are fashionable, and get popular press. The hypothesis, to be extreme, is that mathematical preparation as measured by the GRE is no good at screening PhD applicants in math. Nonsense. The tool is indeed not perfect, but it says something. Preparation cannot be divorced from PhD-success. And, surely the GRE measures some dimension of that preparation. Now we must use the tool carefully, because we want to predict success. We want a measure of potential, not preparation.

As the Nature article suggests, we may choose to diminish reliance on a student’s GRE when other proven markers of achievement potential are available, such as evidence of grit and diligence (and I would add motivation and creativity). But even with grit, diligence, motivation and creativity, you still need a knife to cut the meat; and in most “mathematical” disciplines, that knife is, big surprise, math. An incoming student who has no expertise in mathematical induction will have difficulty with proofs, there is no getting around that. Further, absent direct evidence of these other markers for potential, the GRE does provide some indirect evidence. The GRE is a test, and you have to study for it.  To some extent, you will do well if you are good at taking the test but also if  you spent significant time preparing for the test. That, to me, smells of grit, diligence and motivation.

The real problem I have with this article is the potential 🙂 to encourage a double standard. For White and Asian students, its OK to use the GRE but for African American and Hispanic and other underrepresented groups, you should delve deeper. That is not what the article recommends but in practice that is what will happen. And the (mis)-justification of such double standards will be attributed to articles like this which have presented wild conclusions without proper data to back it up – I am very surprised that the article lived up to Nature’s review standards. Ultimately, advisors and counselors will adapt, and begin to tell certain groups of students that it is OK to not do so well in SATs or GREs, allowances will be made. No, its not okay!  Counselors and college advisors are good at identifying grit, motivation and diligence. Let them find those students and encourage, no, insist that they study for the GRE. Is that such a bad thing:  give them books for preparation and tutoring if necessary. Give them opportunities to excel, but lets not diminish the excellence of other students who have high scores by insinuating things like “because you are Asian or Hispanic you are predisposed to do better in the GRE-math because of biases in the test”. No! It’s because you were able, worked hard, were motivated and had grit that you did well on the GRE-math; and anyone with those qualities can and should do well, and it does not matter what race you come from.

We must strive for equal opportunity, proportionate representation will follow; but, beware of proportionate representation at the expense of equal opportunity.


We’re growing taller

The post title is based on this yahoo article:

Taller, Fatter, Older: How Humans Have Changed in 100 Years – Yahoo News.

which gets its data from an article that surveyed British army recruits 100 years ago versus now:

In the study of British recruits, the average height of British men, who had an average age of 20, was about 5 feet 6 inches (168 centimeters) at the turn of the century, whereas now they stand on average at about 5 feet 10 inches (178 cm). The increase can be attributed, most likely, to improved nutrition, health services and hygiene, said the researchers from the University of Essex in Colchester.

Well, the question is whether there is anything that needs attributing here or is it just a misinterpretation of the data. The smell of a rat begins to emerge once you realize that recruits into the British army are not your average Joe. They tend to be young, strong, athletic and perhaps tall men.

Let’s start with some data. The male population of England and Wales in 1911 was about 17,000,000 and currently it is about 29,000,000. The British armed forces meanwhile has dropped from about 500,000 in 1900 to about 180,000 now.

Now for some simplifying assumptions just to illustrate what can happen when you analyze this data. Suppose heights have the usual bell shaped distribution with an average height of 162cm and a standard deviation (variability) of 10cm. So roughly speaking heights are spread from 120cm to 200cm (a decent approximation). Also, clearly, its not the small and weak who will try out and get recruited into the British armed forces, heavens no. Suppose its tall men that are likely to be in the Armed forces.

Perform a little experiment. Generate 17,000,000 heights in a bell-shaped distribution with mean 162cm and std=10cm – this represents the population of men in Britain around the 1900s. Start with the tallest man and add him to the army with some probability, say 5%. Keep going down the list from tallest to shortest until you fill the recruitment quota of 500,000. This little experiment simulates building the British Army in 1900. By going from tallest to shortest you are making it more likely that taller men are getting into the army. Its a simple experiment, do it. Now compute the average height of the men in the Army. You will get approximately 168cm. Wow! So the men in the army in 1900 are about 6cm taller than the average man. (That is called sampling bias.)

Lets perform the same experiment today, but without changing the population statistics. So Generate 29,000,000 heights in a bell-shaped distribution with mean 162cm and std=10cm. Again, start with the tallest and pick him with the same probability 50%, and keep going until you have todays army of 180,000. What is the average height of todays army? About 178cm. WOW!

Stop. Go back to the quote from the article and see what you think of it now. Is this some voodoo statistical fluke? No. If you repeat the experiment, you will get the same result again and again. Is it magic? No. It is the power of sampling bias. We didn’t change the population – same average height. We didn’t even change the way the army is `sampled’. It is clearly simplistic, but that is not the point. It is just one way to obtain a sampling of taller men for the army, and it is a reasonable attempt. Ofcourse, this is not exactly what is going on because we made some modeling assumptions, but you get the idea. The golden nugget here is:

Be careful when you reason about a population from a biased sample.

Recruits into the British army are a biased sample toward the taller men – likely. Very strange things can happen when you take averages from a biased sample. The small bias in our simple experiment above explains the entire 10cm of the height difference without needing to resort to genetics, health, hygiene or wealth. Perhaps there is some truth to health and hygiene leading to bigger better taller people, but to get to this conclusion one must reason from a random unbiased sample. Unfortunately, the reality is that most such social experiments are done with conveniently available data (such as British recruits whose bio-data happen to be collected), not theoretically sound data. And, often conclusions are made without paying attention to the biases in the data.

Effective learning from data should adhere to 3 basic principles (see Chapter 5 of Learning From Data), the second of which is, and I quote:

“If the data data was sampled in a biased way, learning will produce a similarly biased outcome”

In short, if there are biases in the way the data is collected, then anything can happen. BEWARE.