Metrics for Admission to Graduate School – GREs and all that

This is an old and perhaps contentious topic. But there has been buzz around this recent article in Nature: A test that fails : Naturejobs. And, there are already several blog articles, but I will give a pointer to just one: The GRE: A test that fails, which has some interesting analysis. The Nature article is centered around the figure below which shows how applicant GRE-math scores are distributed.

GRE-math scores of PhD applicants, stratified by ethnicity

The main point of the article is, and I quote:

“De-emphasizing the GRE and augmenting admissions procedures with measures of other attributes — such as drive, diligence and the willingness to take scientific risks — would not only make graduate admissions more predictive of the ability to do well but would also increase diversity in STEM.”

I generally agree, and so does The GRE: A test that fails, saying:

“We at Harvard have downgraded the importance of GRE test scores in our admissions process and the quality and diversity of our admitted students has increased as a result.”

However, beyond my superficial agreement with the high-level claim, the Nature article is singularly disappointing because there is no hard evidence given, and the claims are strong. For an article in a flagship journal on a contentious topic like how race plays a role in access to education, I was expecting data to be unambiguous and precise in order to lend support for strong, possibly controversial claims.

The figure presents data on the preparation of students. Students applying from  different ethnicities have different preparations; that is not a shocker to me, though one must take care to carefully consider the (self) selection biases: today we actively encourage applications from under-represented groups whereas Asians and Whites are largely left to themselves when deciding to go for a PhD. See my article on “We’re growing taller” for an example of the strange things that can happen when there are biases in the data.

From the figure, we infer that students applying in different disciplines have different preparation – no surprise that a chef who doesn’t need a particular knife will not waste time sharpening it. As to the male-female discrepancy, The GRE: A test that fails offers the well known stereotype-threat stress as an explanation. It could be true, but I don’t really want to try at explanations. Rather, let’s see what the data can explain. In particular, is there evidence for the following strong claim:

“In simple terms, the GRE is a better indicator of sex and skin colour than of ability and ultimate success.

Let me be a very conceited devil and reason as follows (beyond conceit, there there is no hidden trick in the reasoning).

Magdon’s Thought Experiment. Suppose, heaven forgive the outrageous thought,  that us professors are actually doing a good job at admissions. For example, I look at far more than just the GRE score. I will admit that the GRE-math is an important indicator, and a low score there means you had better impress me somewhere else. I don’t care about the GRE-verbal too much. What else do I like to see? Published articles in my area are a big deal (there are some at my institute who say that published papers are not an indication of PhD potential, and they are entitled to their opinion if they will also advise the student for me); types of courses the student seems to excel at (does the student prefer math or history); very specific recommendation letters; has the applicant anything intelligent to say about my papers – which means they are able to read and understand papers. But, getting back to the main thought, we’re assuming the uncanny, namely that us professors are actually trying to admit students who will succeed at a PhD – what a bizzare thought (did I already say this). And so, we get the admit pool, and we are now going to analyze this admit pool to see if there is any predictive value in the GRE-math score, as to whether the student will succeed. Why are we analyzing this admit pool – because this is all we can do. Where else are we going to get data on who successfully completes a PhD and who does not? (I call this the convenient data trap – beware of data bearing gifts.)

When we analyze this admit pool we will find that GRE-math has nothing to do with whether or not the student completes the PhD and we will arrive at the stunning conclusion as claimed in the Nature article. It is precisely because as professors we did such a great job in admissions that GRE-math has no predictive value. And why is that. It is because we compensated. If the student was bad in GRE, they must have been good somewhere else. We admitted students precisely because we thought that given the information available, they would do a good PhD. Those who failed must have done so for some unforeseen reason (maybe they got married, had a kid and needed more money than the graduate stipend could provide them; and that has very little to do with their GRE-math). The process of admitting students creates a very biased sample. When you reason from a biased sample you had better be careful. In this case, a more precise statement we could make is:

Conditioned on being admitted to the PhD, a student with high GRE may be as likely to complete as one with low GRE.

Now, this is a very weak statement, and it is all because of those two important words “Conditioned on”, two words which send a bright warning signal: BEWARE of sampling bias.

The strong claims in the Nature article need data from a very different experiment:

  1. Select a random set of 21-year-olds and administer the GRE test on them.
  2. Admit them to your PhD program no matter what.
  3. Treat them all equally (how the heck could we do that).
  4. See which ones get a PhD and demonstrate lack of correlation with GRE-math.

Clearly this experiment is never going to be done and the results will likely prove the article wrong. We will never have the data to substantiate strong conclusions as claimed in the article. And anyway, the article did not give an analysis of the outgoing statistics of students to make their claims (which would still have to be qualified by the all powerful “conditioned on”). They looked at applicant scores.

  1. So there is no data on successful completion of PhD versus GRE. And, even if there was data, we would still have to deal with the sampling bias issue.
  2. What about “quality of PhD” versus GRE – notwithstanding the difficulty of quantifying the quality of a PhD?
  3. What about controlling for (say) ethnicity and then asking how quality of PhD or success probability varies with GRE?

The authors try to address my point 1 with statistics on their bridge-PhD program which has an emphasis on under-represented groups and an interview process for determining admission. They report 81% PhD-completion versus 50% for the national average. There is nothing wrong with what they did and it is great that they did it, hats off to them. But,  don’t be too wowed by this outcome either: the authors have a point to prove, and they did an experiment that proves their point: they admitted a group of students, likely gave those students all the nurturing and care needed to push through to a PhD. And reaped the results of an 81% completion rate versus a national average of 50%. But, how scalable is that, and what is the cost? A professor who invests that much resource into a single student will necessarily train fewer students; and anyway, a PhD is supposed to represent largely independent creative work. To my eyes, the point being made is that if you nurture students more, they are more likely to get a PhD; and that is not a bad thing, it is just not scalable.

Now for my opinion. Yes, we can improve admission processes; yes, we can improve the GRE; yes, the GRE-math is not relevant to every discipline, so only use it if math is relevant to your research; and so on. But the system is not that far off center. However, at the same time, I am worried. I sense a growing trend in the US to discredit any tool which highlights significant racial differences. Rather than accept the fact that there may be significant disparity in mathematical preparation when viewed by race, we try to find a problem with the GRE-tool and argue that it is useless in predicting PhD potential (should we also argue to stop using level of education to determine who to hire in companies because (my guess) there is significant racial disparity in levels of education?). No, instead we should just fix the real issue with the preparation differences. And, these arguments hit at nerves: inequity is a hot topic, and articles that promote “equity” even if they  use flawed data are fashionable, and get popular press. The hypothesis, to be extreme, is that mathematical preparation as measured by the GRE is no good at screening PhD applicants in math. Nonsense. The tool is indeed not perfect, but it says something. Preparation cannot be divorced from PhD-success. And, surely the GRE measures some dimension of that preparation. Now we must use the tool carefully, because we want to predict success. We want a measure of potential, not preparation.

As the Nature article suggests, we may choose to diminish reliance on a student’s GRE when other proven markers of achievement potential are available, such as evidence of grit and diligence (and I would add motivation and creativity). But even with grit, diligence, motivation and creativity, you still need a knife to cut the meat; and in most “mathematical” disciplines, that knife is, big surprise, math. An incoming student who has no expertise in mathematical induction will have difficulty with proofs, there is no getting around that. Further, absent direct evidence of these other markers for potential, the GRE does provide some indirect evidence. The GRE is a test, and you have to study for it.  To some extent, you will do well if you are good at taking the test but also if  you spent significant time preparing for the test. That, to me, smells of grit, diligence and motivation.

The real problem I have with this article is the potential 🙂 to encourage a double standard. For White and Asian students, its OK to use the GRE but for African American and Hispanic and other underrepresented groups, you should delve deeper. That is not what the article recommends but in practice that is what will happen. And the (mis)-justification of such double standards will be attributed to articles like this which have presented wild conclusions without proper data to back it up – I am very surprised that the article lived up to Nature’s review standards. Ultimately, advisors and counselors will adapt, and begin to tell certain groups of students that it is OK to not do so well in SATs or GREs, allowances will be made. No, its not okay!  Counselors and college advisors are good at identifying grit, motivation and diligence. Let them find those students and encourage, no, insist that they study for the GRE. Is that such a bad thing:  give them books for preparation and tutoring if necessary. Give them opportunities to excel, but lets not diminish the excellence of other students who have high scores by insinuating things like “because you are Asian or Hispanic you are predisposed to do better in the GRE-math because of biases in the test”. No! It’s because you were able, worked hard, were motivated and had grit that you did well on the GRE-math; and anyone with those qualities can and should do well, and it does not matter what race you come from.

We must strive for equal opportunity, proportionate representation will follow; but, beware of proportionate representation at the expense of equal opportunity.