If you’re reading this article, you likely know something about AI. You probably also have an opinion about what model’s the best, or at least your favorite one to use, even with the options changing all the time.
There is furious development underway, with integrations occurring all around us, and common sense dictates that a single AI Large Language Model (LLM) is unlikely to be our best tool to achieve all ends.
So which are the best AI tools for coding website backends? Are they equal at Java and Python? Which one’s the best for research, with the most specific results and the least (or least dangerous) hallucinations?
We know the popular AI tools, but which one is the least likely to offend in customer service? And though most can do it now, which model is the most likely to pass the Turing test, by fooling us into thinking it’s a person?
It turns out that most AI experts can’t answer these questions. That is, unless they work for one of the handful of big AI companies whose closed, massive models dominate the space; and in that case, it’s hard to trust the answers are unbiased.
And speaking of bias, Google’s newly integrated AI search, AI Overview, claimed 13 US Presidents attended UW-Madison (with John Kennedy graduating a staggering 6 times, between 1930 and 1993). Also that Barack Obama was America’s first Muslim president.
It would have been nice for Google to know ahead of time that this would happen so quickly.
In this week’s The PTP Report, we look for objective AI tool assessment, or how we know what’s best (aside from personal instinct). We’ll look at which tests are being used now, new means coming online, and we’ll consider why better AI evaluation standards will soon be critical.
Testing and Comparison Now
When you research AI tool performance metrics, you probably encounter stats like this:
Okay, so at a glance, you see numbers, from human expert to unspecialized human test examples, with a group of AI offerings nearer the top.
Clearly a bigger number is better.
(This was taken from the LYMSYS Chatbot Arena Full Leaderboard, for relative scores on the MMLU, and we’ll get into what all that means in just a moment.)
Whenever we read about new AI products hitting the market, we usually get some kind of a score like this, a so-called benchmark, or at least a comparison (that it outperformed other models, etc), meant to standardize comparisons across the board.
AI companies don’t usually put out release notes or significant documentation to help us understand what’s gone into them (where we can expect to see changes from the last model, or why). We only get that it’s the newest flavor, produced at greater scale and expense, maybe in a new integration with a traditional technology (like search, or PC’s onboard chips), or speaking with a controversial new voice.
[For an AI tool comparison, check out The PTP Report for the top AI software development tools on the market now.]
Just last year, OpenAI’s GPT-4 made news for its performance on standardized tests, with headlines such as: GPT-4 Beats 90% Of Lawyers Trying To Pass The Bar (Forbes), Bar exam score shows AI can keep up with ‘human lawyers,’ researchers say (Reuters), and GPT-4 can ace the bar, but it only has a decent chance of passing the CFA exams. Here’s a list of difficult exams the ChatGPT and GPT-4 have passed (Business Insider).
These exams, by the way, included the SAT, GRE, Biology Olympiad, numerous AP exams, sections of the Wharton MBA exam, and the US medical licensing exams, and heady scores were listed for many, even hitting the 90th percentile on the Bar Exam.
These are tests we all know, and have likely taken some of, which makes this an extremely impressive achievement on the surface. But of course, it’s more complicated than it seems initially.
New research by MIT’s Eric Martinez found that OpenAI’s estimates of GPT-4’s scores were extremely misleading, with it actually scoring closer to the 48th percentile (and 15th percentile on the essays), and calling for AI companies to use more transparent and rigorous testing. The research directly questioned the assertion that AI was as ready to tackle legal tasks as we were led to believe.
Benchmarking AI models allows us to compare them over time, and is generally considered a better approach than using standardized tests made for humans. The leader of these tests currently in use is the Massive Multitask Language Understanding (MMLU). A multiple-choice test with 57 tasks including math, US history, computer science, and law, it was even devised with LLMs in mind.
But one of its authors, Dan Hendrycks, helped develop the test while at UC Berkeley, and told the New York Times’s Kevin Roose that it was never meant for this kind of public scrutiny or publicity, but instead to get researchers to take this challenge seriously:
“All of these benchmarks are wrong, but some are useful… Some of them can serve some utility for a fixed amount of time, but at some point, there’s so much pressure put on it that it reaches its breaking point.”
While their test is widely used and believed to generally indicate greater model competence, it also has several issues as a reliable benchmark, especially in how it’s administered.
In an Anthropic blog post from October, entitled Challenges in evaluating AI systems, some of these issues are well detailed, including:
- The MMLU’s heavy use means newer AI LLM rollouts will have been trained with its test questions
- Formatting changes on the test can impact the scores by as much as 5%
- AI developers are highly inconsistent in their application of the test, using few-shot (learning from subset of examples) and chain-of-thought reasoning (prompting in a series of steps rather than the whole ask at once) which can greatly elevate their own scores
- The MMLU itself has inconsistencies with some mislabeled or unanswerable questions
In other words, we don’t administer (or have our families administer) our own SATS, GREs, or Bar Exams, and for the same reason, it’s probably best that the AI companies themselves don’t measure their own benchmarks for public use.
New Tests, Testers, and Approaches Needed
The Anthropic blog post directly calls for more funding and support for third-party evaluations, and we see new means of testing popping up all the time, with several coming from academia, including the following:
BIG-bench (Beyond the Imitation Game) is a rigorous testing method, but it also requires a lot from the companies using it. Both time consuming and engineering-intensive, it’s not really practical for private enterprise benchmarking, but can be used for additional insight. (Anthropic, for example, dropped it after a single experiment.)
HELM (Holistic Evaluation of Language Models) was introduced by Stanford in 2022, and uses evaluation methods determined by experts, across diverse scenarios like reading comprehension, language understanding, and mathematical reasoning. It also uses API access, making it easier for the companies to use than BIG-bench, and has been employed by many of the top companies, like Anthropic, Google, Meta, and OpenAI.
But as noted by Anthropic, HELM is also slow (can take months) as it’s volunteer run, and they found that it did not evaluate all models fairly, given its requirement for identical formatting.
In the arena of crowdsourcing, LMSYS, an open research project started by researchers at UC Berkeley, recently launched the Chatbot Arena: with a leaderboard format across AI models. It allows users to test chatbots against each other, and then ranks user votes. Employing the Elo rating system created for chess to compare models with wins based on user preferences, it is also an excellent resource to compare other administered test results.
(Check it out here!)
Conclusion
As AI investor Nathan Benaich recently told Kevin Roose of The New York Times:
“Despite the appearance of science, most developers really judge models based on vibes or instinct… That might be fine for the moment, but as these models grow in power and social relevance, it won’t suffice.”
Stanford’s AI Index Report for 2024 agrees, including as a key takeaway point that:
“Robust and standardized evaluations for LLM responsibility are seriously lacking… Leading developers, including OpenAI, Google, and Anthropic, primarily test their models against different responsible AI benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top AI models.”
Testing for bias, something that’s mandated by a new Colorado law, is also rare, and extremely difficult, as evidenced by Anthropic’s description of trying to create their own, called Bias Benchmark for QA (BBQ). For Anthropic, developing this test proved exceedingly costly, and returning sometimes questionable values, though they can be applauded for the effort.
Without more funding, or third-party testing services of sufficient scale and objectivity, it remains unlikely that truly objective AI model evaluation will occur anywhere near the pace of development, leaving us to compare test results across a smattering of mismatched tests, wherever we can find them. (LMSYS’s full Arena Leaderboard is a great place to start.)
And without it, how can any of us know that we’re really using the best tool available for any given job?
References
Re-evaluating GPT-4’s bar exam performance by Martínez, E., Springer Link
Measuring Massive Multitask Language Understanding, arXiv:2009.03300
A.I. Has a Measurement Problem, The New York Times
Challenges in evaluating AI systems, Anthropic