Metrics for Admission to Graduate School – GREs and all that

This is an old and perhaps contentious topic. But there has been buzz around this recent article in Nature: A test that fails : Naturejobs. And, there are already several blog articles, but I will give a pointer to just one: The GRE: A test that fails, which has some interesting analysis. The Nature article is centered around the figure below which shows how applicant GRE-math scores are distributed.

GRE-math scores of PhD applicants, stratified by ethnicity

The main point of the article is, and I quote:

“De-emphasizing the GRE and augmenting admissions procedures with measures of other attributes — such as drive, diligence and the willingness to take scientific risks — would not only make graduate admissions more predictive of the ability to do well but would also increase diversity in STEM.”

I generally agree, and so does The GRE: A test that fails, saying:

“We at Harvard have downgraded the importance of GRE test scores in our admissions process and the quality and diversity of our admitted students has increased as a result.”

However, beyond my superficial agreement with the high-level claim, the Nature article is singularly disappointing because there is no hard evidence given, and the claims are strong. For an article in a flagship journal on a contentious topic like how race plays a role in access to education, I was expecting data to be unambiguous and precise in order to lend support for strong, possibly controversial claims.

The figure presents data on the preparation of students. Students applying from  different ethnicities have different preparations; that is not a shocker to me, though one must take care to carefully consider the (self) selection biases: today we actively encourage applications from under-represented groups whereas Asians and Whites are largely left to themselves when deciding to go for a PhD. See my article on “We’re growing taller” for an example of the strange things that can happen when there are biases in the data.

From the figure, we infer that students applying in different disciplines have different preparation – no surprise that a chef who doesn’t need a particular knife will not waste time sharpening it. As to the male-female discrepancy, The GRE: A test that fails offers the well known stereotype-threat stress as an explanation. It could be true, but I don’t really want to try at explanations. Rather, let’s see what the data can explain. In particular, is there evidence for the following strong claim:

“In simple terms, the GRE is a better indicator of sex and skin colour than of ability and ultimate success.

Let me be a very conceited devil and reason as follows (beyond conceit, there there is no hidden trick in the reasoning).

Magdon’s Thought Experiment. Suppose, heaven forgive the outrageous thought,  that us professors are actually doing a good job at admissions. For example, I look at far more than just the GRE score. I will admit that the GRE-math is an important indicator, and a low score there means you had better impress me somewhere else. I don’t care about the GRE-verbal too much. What else do I like to see? Published articles in my area are a big deal (there are some at my institute who say that published papers are not an indication of PhD potential, and they are entitled to their opinion if they will also advise the student for me); types of courses the student seems to excel at (does the student prefer math or history); very specific recommendation letters; has the applicant anything intelligent to say about my papers – which means they are able to read and understand papers. But, getting back to the main thought, we’re assuming the uncanny, namely that us professors are actually trying to admit students who will succeed at a PhD – what a bizzare thought (did I already say this). And so, we get the admit pool, and we are now going to analyze this admit pool to see if there is any predictive value in the GRE-math score, as to whether the student will succeed. Why are we analyzing this admit pool – because this is all we can do. Where else are we going to get data on who successfully completes a PhD and who does not? (I call this the convenient data trap – beware of data bearing gifts.)

When we analyze this admit pool we will find that GRE-math has nothing to do with whether or not the student completes the PhD and we will arrive at the stunning conclusion as claimed in the Nature article. It is precisely because as professors we did such a great job in admissions that GRE-math has no predictive value. And why is that. It is because we compensated. If the student was bad in GRE, they must have been good somewhere else. We admitted students precisely because we thought that given the information available, they would do a good PhD. Those who failed must have done so for some unforeseen reason (maybe they got married, had a kid and needed more money than the graduate stipend could provide them; and that has very little to do with their GRE-math). The process of admitting students creates a very biased sample. When you reason from a biased sample you had better be careful. In this case, a more precise statement we could make is:

Conditioned on being admitted to the PhD, a student with high GRE may be as likely to complete as one with low GRE.

Now, this is a very weak statement, and it is all because of those two important words “Conditioned on”, two words which send a bright warning signal: BEWARE of sampling bias.

The strong claims in the Nature article need data from a very different experiment:

  1. Select a random set of 21-year-olds and administer the GRE test on them.
  2. Admit them to your PhD program no matter what.
  3. Treat them all equally (how the heck could we do that).
  4. See which ones get a PhD and demonstrate lack of correlation with GRE-math.

Clearly this experiment is never going to be done and the results will likely prove the article wrong. We will never have the data to substantiate strong conclusions as claimed in the article. And anyway, the article did not give an analysis of the outgoing statistics of students to make their claims (which would still have to be qualified by the all powerful “conditioned on”). They looked at applicant scores.

  1. So there is no data on successful completion of PhD versus GRE. And, even if there was data, we would still have to deal with the sampling bias issue.
  2. What about “quality of PhD” versus GRE – notwithstanding the difficulty of quantifying the quality of a PhD?
  3. What about controlling for (say) ethnicity and then asking how quality of PhD or success probability varies with GRE?

The authors try to address my point 1 with statistics on their bridge-PhD program which has an emphasis on under-represented groups and an interview process for determining admission. They report 81% PhD-completion versus 50% for the national average. There is nothing wrong with what they did and it is great that they did it, hats off to them. But,  don’t be too wowed by this outcome either: the authors have a point to prove, and they did an experiment that proves their point: they admitted a group of students, likely gave those students all the nurturing and care needed to push through to a PhD. And reaped the results of an 81% completion rate versus a national average of 50%. But, how scalable is that, and what is the cost? A professor who invests that much resource into a single student will necessarily train fewer students; and anyway, a PhD is supposed to represent largely independent creative work. To my eyes, the point being made is that if you nurture students more, they are more likely to get a PhD; and that is not a bad thing, it is just not scalable.

Now for my opinion. Yes, we can improve admission processes; yes, we can improve the GRE; yes, the GRE-math is not relevant to every discipline, so only use it if math is relevant to your research; and so on. But the system is not that far off center. However, at the same time, I am worried. I sense a growing trend in the US to discredit any tool which highlights significant racial differences. Rather than accept the fact that there may be significant disparity in mathematical preparation when viewed by race, we try to find a problem with the GRE-tool and argue that it is useless in predicting PhD potential (should we also argue to stop using level of education to determine who to hire in companies because (my guess) there is significant racial disparity in levels of education?). No, instead we should just fix the real issue with the preparation differences. And, these arguments hit at nerves: inequity is a hot topic, and articles that promote “equity” even if they  use flawed data are fashionable, and get popular press. The hypothesis, to be extreme, is that mathematical preparation as measured by the GRE is no good at screening PhD applicants in math. Nonsense. The tool is indeed not perfect, but it says something. Preparation cannot be divorced from PhD-success. And, surely the GRE measures some dimension of that preparation. Now we must use the tool carefully, because we want to predict success. We want a measure of potential, not preparation.

As the Nature article suggests, we may choose to diminish reliance on a student’s GRE when other proven markers of achievement potential are available, such as evidence of grit and diligence (and I would add motivation and creativity). But even with grit, diligence, motivation and creativity, you still need a knife to cut the meat; and in most “mathematical” disciplines, that knife is, big surprise, math. An incoming student who has no expertise in mathematical induction will have difficulty with proofs, there is no getting around that. Further, absent direct evidence of these other markers for potential, the GRE does provide some indirect evidence. The GRE is a test, and you have to study for it.  To some extent, you will do well if you are good at taking the test but also if  you spent significant time preparing for the test. That, to me, smells of grit, diligence and motivation.

The real problem I have with this article is the potential 🙂 to encourage a double standard. For White and Asian students, its OK to use the GRE but for African American and Hispanic and other underrepresented groups, you should delve deeper. That is not what the article recommends but in practice that is what will happen. And the (mis)-justification of such double standards will be attributed to articles like this which have presented wild conclusions without proper data to back it up – I am very surprised that the article lived up to Nature’s review standards. Ultimately, advisors and counselors will adapt, and begin to tell certain groups of students that it is OK to not do so well in SATs or GREs, allowances will be made. No, its not okay!  Counselors and college advisors are good at identifying grit, motivation and diligence. Let them find those students and encourage, no, insist that they study for the GRE. Is that such a bad thing:  give them books for preparation and tutoring if necessary. Give them opportunities to excel, but lets not diminish the excellence of other students who have high scores by insinuating things like “because you are Asian or Hispanic you are predisposed to do better in the GRE-math because of biases in the test”. No! It’s because you were able, worked hard, were motivated and had grit that you did well on the GRE-math; and anyone with those qualities can and should do well, and it does not matter what race you come from.

We must strive for equal opportunity, proportionate representation will follow; but, beware of proportionate representation at the expense of equal opportunity.

Undergrad Mentors in Undergrad Classes

Everywhere you read about how important it is for faculty to mentor undergrad students; to nurture them to careers in research, and so on. That is important. Very little is said about the importance of undergraduate students mentoring each other. That is important too.

I am sure the discussion could be generalized, but I am talking specifically about the intro-sequence in computer science, which in some form exists in most curricula:

Intro2CS -> DataStructures -> Intro2Algorithms.

Each class is accompanied by a lab, and we use undergrad mentors to roam about during lab helping with questions that students have about compiler errors, small conceptual questions, tricks and shortcuts, etc. I am going to speak from my own experience, having taught Intro2CS and Intro2Algorithms, as well as having been an undergraduate mentor way back when. Without undergrad mentors, these two classes would have been completely different experiences for the students – worse. Being an undergraduate mentor helped me consolidate my knowledge in a field, and understand what is important and what is not.

Do undergrad mentors improve the undergrad experience? Yes!

  1. Professors and TAs are more of a guide as to what should be done. The undergrad mentor provides hands on “real-time” expertise about what they did do when they tried to solve the problem; what are the tricks and tips to keep in mind.
  2. Undergrad mentors see it from the undergrad point of view and what is difficult from that point of view.
  3. Undergrad mentors have more of a rapport with the students and a student is more likely to ask a “silly” question without fear of embarrassment.
  4. Writing programs can be very frustrating because small errors result in non-working code, and students cannot continue without overcoming these small bugs. An army of undergrad mentors is ideal for solving these kinds of issues and keeping the students moving forward; much more so than one or two TAs.
  5. Peer-to-peer motivation is a strong force: when a student sees that someone just like them can master the concept, the concept is not so daunting.

In a survey of in-class students, 41% of those replying said they would keep the undergrad mentors even if they had to pay for it themselves. (But wait, didn’t they already pay for it in that hefty tuition bill?)

In a survey of employers and recruiters ranging from Google, amazon, LinkedIn to Wall Street, TAing or mentoring is viewed as a significant positive in identifying strong candidates with potential leadership capabilities.

Is it good for the undergrad mentors? Yes!

  1. Not everyone can be an undergrad mentor. Only the best students get to do it, so it is something to strive for. Its voluntary but it gives status.
  2. A student who has learned the concept well enough to explain it to another has become a better computer scientist; they have mastered the true subtlety of the problem and why it can be hard for someone else. This is all part of the learning they accomplish when they simply just try to explain the concept to others.
  3. Computer science students will build skills that will enable them to be better managers of software projects and teams in the future.

In a survey of alumni who had mentored, all those who replied said it was a rewarding experience and they enjoyed working with the students, helping them to learn; often, their fellow peer undergrad mentors became close friends, etc. The money was the ultimate enticer though – mentoring for credit isn’t appealing to such students because they have no problems finding interesting courses to fill up their schedules.

Is it good for the pocket? Yes!

At (say) $10/hr for 2 hrs/week and 15 weeks, that is just $300 a semester. You can have 100 undergrad mentors for the typical cost of a TA (that’s not to say we don’t need TAs). You can manage any potential conflicts of interests by having undergrad mentors only help, not grade. What’s not to love?

There was a time when top graduate schools required PhDs to TA at least 1 semester to graduate because it builds an important skill. Doesn’t the same philosophy extend to the undergraduate level?

Undergraduate mentors are an important dimension of any good CS-program, and I am sure many other disciplines. Pay them well and keep them!

We’re growing taller

The post title is based on this yahoo article:

Taller, Fatter, Older: How Humans Have Changed in 100 Years – Yahoo News.

which gets its data from an article that surveyed British army recruits 100 years ago versus now:

In the study of British recruits, the average height of British men, who had an average age of 20, was about 5 feet 6 inches (168 centimeters) at the turn of the century, whereas now they stand on average at about 5 feet 10 inches (178 cm). The increase can be attributed, most likely, to improved nutrition, health services and hygiene, said the researchers from the University of Essex in Colchester.

Well, the question is whether there is anything that needs attributing here or is it just a misinterpretation of the data. The smell of a rat begins to emerge once you realize that recruits into the British army are not your average Joe. They tend to be young, strong, athletic and perhaps tall men.

Let’s start with some data. The male population of England and Wales in 1911 was about 17,000,000 and currently it is about 29,000,000. The British armed forces meanwhile has dropped from about 500,000 in 1900 to about 180,000 now.

Now for some simplifying assumptions just to illustrate what can happen when you analyze this data. Suppose heights have the usual bell shaped distribution with an average height of 162cm and a standard deviation (variability) of 10cm. So roughly speaking heights are spread from 120cm to 200cm (a decent approximation). Also, clearly, its not the small and weak who will try out and get recruited into the British armed forces, heavens no. Suppose its tall men that are likely to be in the Armed forces.

Perform a little experiment. Generate 17,000,000 heights in a bell-shaped distribution with mean 162cm and std=10cm – this represents the population of men in Britain around the 1900s. Start with the tallest man and add him to the army with some probability, say 50%. Keep going down the list from tallest to shortest until you fill the recruitment quota of 500,000. This little experiment simulates building the British Army in 1900. By going from tallest to shortest you are making it more likely that taller men are getting into the army. Its a simple experiment, do it. Now compute the average height of the men in the Army. You will get approximately 168cm. Wow! So the men in the army in 1900 are about 6cm taller than the average man. (That is called sampling bias.)

Lets perform the same experiment today, but without changing the population statistics. So Generate 29,000,000 heights in a bell-shaped distribution with mean 162cm and std=10cm. Again, start with the tallest and pick him with the same probability 50%, and keep going until you have todays army of 180,000. What is the average height of todays army? About 178cm. WOW!

Stop. Go back to the quote from the article and see what you think of it now. Is this some voodoo statistical fluke? No. If you repeat the experiment, you will get the same result again and again. Is it magic? No. It is the power of sampling bias. We didn’t change the population – same average height. We didn’t even change the way the army is `sampled’. It is clearly simplistic, but that is not the point. It is just one way to obtain a sampling of taller men for the army, and it is a reasonable attempt. Ofcourse, this is not exactly what is going on because we made some modeling assumptions, but you get the idea. The golden nugget here is:

Be careful when you reason about a population from a biased sample.

Recruits into the British army are a biased sample toward the taller men – likely. Very strange things can happen when you take averages from a biased sample. The small bias in our simple experiment above explains the entire 10cm of the height difference without needing to resort to genetics, health, hygiene or wealth. Perhaps there is some truth to health and hygiene leading to bigger better taller people, but to get to this conclusion one must reason from a random unbiased sample. Unfortunately, the reality is that most such social experiments are done with conveniently available data (such as British recruits whose bio-data happen to be collected), not theoretically sound data. And, often conclusions are made without paying attention to the biases in the data.

Effective learning from data should adhere to 3 basic principles (see Chapter 5 of Learning From Data), the second of which is, and I quote:

“If the data data was sampled in a biased way, learning will produce a similarly biased outcome”

In short, if there are biases in the way the data is collected, then anything can happen. BEWARE.

Novel Content Delivery For Classes

I am teaching FOCS (Foundations of Computer Science) in Fall 2014 and I was stunned at how hard it is to find just the right teaching environment. Several novel teaching technologies abound, typically tending toward online approaches such as flipped classrooms. All this technology is geared toward making the process of learning fun and easy while the student is away from the class. I would like to propose a new ground-breaking technology with the in-class student in mind. (Are there any of those students floating around these days?) Some of the virtues of the proposed new technology are:

  1. Its convenient (you don’t have to bring  any hardware to class). Takes up very little space and near-zero maintainance. Will work even in a power outage!
  2. You don’t need powerpoint or pdf slides. It’s a high contrast substrate (white on black).
  3. Material is revealed to the student at a pace they can assimilate.
  4. The content is located at the same spot that the lecturer is speaking so the students attention (audio and video) can be focused to the same spot. Contrast that with powerpoint, for example, where when the lecturer speaks the students must turn their attention away from the visual content (slide) to focus on the lecturer dynamics.
  5. Easily reference previous results or quantities in a derivation by keeping them on view as you progress through the derivation.
  6. Walk around the class as the content is revealed and be more engaging.

It’s a BIG blackboard.

Yet with so little downside to keeping this option alive in any classroom, its going the way of the dodo. Why? Is it so that curricula can claim modernism; innovation; high-tech?

Well, here is one emphatic vote for keeping the blackboard from extinction. It always works and flawlessly.