The Open LLM Leaderboard: A Deep Dive into AI Evaluation Metrics

·

·

Unraveling the Conundrum of the Open LLM Leaderboard

The Twitter-verse got quite the jolt recently, thanks to a new feathered friend named Falcon. Its debut on the Open LLM Leaderboard – a battleground for AI language models – sparked quite the debate. The center of controversy? The lower-than-expected scores of the leaderboard’s reigning champion, the LLaMA model, in a key evaluation metric: Massive Multitask Language Understanding (MMLU).

As tech insiders, we couldn’t let a mystery like this slide. So, we decided to roll up our sleeves and make sense of the chaos.

AI Leaderboard: The Nuts and Bolts

Let’s take a step back for a moment to better understand what we’re dealing with. The Open LLM Leaderboard is a tool that runs models using the EleutherAI LM Evaluation Harness. The data it generates paints a picture of the AI’s performance on a series of tasks. It’s like a report card for AI models, but there’s a snag. The Leaderboard’s MMLU scores for the LLaMA models don’t match up with those reported in the LLaMA paper. Something’s up, folks.

MMLU: More than Meets the Eye

Here’s where things get interesting. The LLaMA team had actually adapted their evaluation code from another source – UC Berkeley’s original MMLU benchmark. But wait, there’s more! Another flavor of MMLU evaluation lurks in Stanford’s CRFM evaluation benchmark. To get to the bottom of this, we tested these three versions of MMLU evaluations on several models. And boy, did the results vary!.

Are All MMLU Scores Created Equal?

The same model can have vastly different scores depending on the evaluation method used. Talk about a kick in the teeth for anyone trying to compare models! Imagine you’ve trained a perfect replica of the LLaMA 65B model and your evaluation score is 30% lower than the published number. It’s not you, it’s the inconsistent scoring!

Is there a “best” way to evaluate a model? That’s a question for the ages. What works best for one model might not for another, and any attempt to level the playing field could inadvertently give an advantage to less powerful models.

The Big Picture

What we’ve learned from this AI rollercoaster ride is that even minute details can drastically affect evaluations. That’s why we need open, standardized, and reproducible benchmarks like EleutherAI Eval Harness or Stanford HELM. The Open LLM Leaderboard is committed to using these community-maintained evaluation libraries, and we’re currently updating the leaderboard with the latest version of the EleutherAI Eval Harness.

Look, the AI space is a wild west of innovation and change. But with transparency, collaboration, and a good dose of skepticism, we can keep making strides forward.

Source: huggingface.co