we’re still not sure how to test for human levels of intelligence

Written by Andrew Rogoyski, Innovation Director - Surrey Institute of People-Centred AI, University of Surrey

Two of San Francisco’s leading players in artificial intelligence have challenged ^[1] the public to come up with questions capable of testing the capabilities of large language models (LLMs) like Google Gemini and OpenAI’s o1. Scale AI, which specialises in preparing the vast tracts of data on which the LLMs are trained, teamed up with the Center for AI Safety (CAIS) to launch the initiative, Humanity’s Last Exam.

Featuring prizes of US$5,000 (£3,800) for those who come up with the top 50 questions selected for the test, Scale and CAIS say the goal is to test how close we are to achieving “expert-level AI systems” using the “largest, broadest coalition of experts in history”.

Why do this? The leading LLMs are already acing many established tests in intelligence, mathematics ^[2] and law ^[3], but it’s hard to be sure how meaningful this is. In many cases, they may have pre-learned the answers due to the gargantuan quantities of data on which they are trained, including a significant percentage of everything on the internet.

Data is fundamental to this whole area. It is behind the paradigm shift from conventional computing to AI, from “telling” to “showing” these machines what to do. This requires good training datasets, but also good tests. Developers typically do this using data that hasn’t already been used for training, known in the jargon as “test datasets”.

If LLMs are not already able to pre-learn the answer to established tests like bar exams, they probably will soon. The AI analytics site Epoch estimates ^[4] that 2028 will mark the point at which the AIs will effectively have read everything ever written by humans. An equally important challenge is how to keep assessing AIs once that rubicon has been crossed.

Of course, the internet is expanding all the time, with millions of new items being added daily. Could that take care of these problems?

Perhaps, but this bleeds into another insidious difficulty, referred to as “model collapse ^[5]”. As the internet becomes increasingly flooded by AI-generated material which recirculates into future AI training sets, this may cause AIs to perform increasingly poorly. To overcome this problem, many developers are already collecting data from their AIs’ human interactions, adding fresh data for training and testing.

Some specialists argue that AIs also need to become “embodied”: moving around in the real world and acquiring their own experiences, as humans do. This might sound far-fetched until you realise that Tesla has been doing it for years with its cars. Another opportunity is human wearables, such as Meta’s popular smart glasses by Ray-Ban ^[6]. These are equipped with cameras and microphones, and can be used ^[7] to collect vast quantities of human-centric video and audio data.

Yet even if such products guarantee enough training data in future, there is still the conundrum of how to define and measure intelligence – particularly artificial general intelligence (AGI), meaning an AI that equals or surpasses human intelligence.

Traditional human IQ tests have long been controversial ^[8] for failing to capture the multifaceted nature ^[9] of intelligence, encompassing everything from language to mathematics to empathy to sense of direction.

There’s an analagous problem with the tests used on AIs. There are many well established tests covering such tasks as summarising text, understanding it, drawing correct inferences ^[10] from information, recognising human poses and gestures, and machine vision.

Some tests are being retired, usually because ^[11] the AIs are doing so well at them, but they’re so task-specific as to be very narrow measures of intelligence. For instance, the chess-playing AI Stockfish ^[12] is way ahead of Magnus Carlsen, the highest scoring human player of all time, on the Elo ^[13] rating system. Yet Stockfish is incapable of doing other tasks such as understanding language. Clearly it would be wrong to conflate its chess capabilities with broader intelligence.

Magnus Carlsen thinking about a chess move

Magnus Carlsen is no match for Stockfish. Lilyana Vynogradova/Alamy ^[14]

But with AIs now demonstrating broader intelligent behaviour, the challenge is to devise new benchmarks for comparing and measuring their progress. One notable approach has come from French Google engineer François Chollet. He argues ^[15] that true intelligence lies in the ability to adapt and generalise learning to new, unseen situations. In 2019, he came up with the “abstraction and reasoning corpus” (ARC), a collection of puzzles in the form of simple visual grids designed to test an AI’s ability to infer and apply abstract rules.

Unlike previous benchmarks ^[16] that test visual object recognition by training an AI on millions of images, each with information about the objects contained, ARC gives it minimal examples in advance. The AI has to figure out the puzzle logic and can’t just learn all the possible answers.

Though the ARC tests aren’t particularly difficult ^[17] for humans to solve, there’s a prize of US$600,000 to the first AI system to reach a score of 85%. At the time of writing, we’re a long way from that point. Two recent leading LLMs, OpenAI’s o1 preview and Anthropic’s Sonnet 3.5, both score ^[18] 21% on the ARC public leaderboard (known as the ARC-AGI-Pub ^[19]).

Another recent attempt ^[20] using OpenAI’s GPT-4o scored 50%^[21], but somewhat controversially because the approach generated thousands of possible solutions before choosing the one that gave the best answer for the test. Even then, this was still reassuringly far from triggering the prize – or matching human performances of over 90%^[22].

While ARC remains one of the most credible attempts to test for genuine intelligence in AI today, the Scale/CAIS initiative shows that the search continues for compelling alternatives. (Fascinatingly, we may never see some of the prize-winning questions. They won’t be published on the internet, to ensure the AIs don’t get a peek at the exam papers.)

We need to know when machines are getting close to human-level reasoning, with all the safety, ethical and moral questions this raises. At that point, we’ll presumably be left with an even harder exam question: how to test for a superintelligence. That’s an even more mind-bending task that we need to figure out.

References

^{^} have challenged (scale.com)
^{^} mathematics (www.nature.com)
^{^} law (law.stanford.edu)
^{^} Epoch estimates (epochai.org)
^{^} model collapse (www.nature.com)
^{^} smart glasses by Ray-Ban (theconversation.com)
^{^} can be used (me.mashable.com)
^{^} long been controversial (theconversation.com)
^{^} multifaceted nature (www.scirp.org)
^{^} correct inferences (nlpprogress.com)
^{^} usually because (hai.stanford.edu)
^{^} Stockfish (rkrippetoe.medium.com)
^{^} Elo (en.wikipedia.org)
^{^} Lilyana Vynogradova/Alamy (www.alamy.com)
^{^} He argues (arxiv.org)
^{^} previous benchmarks (image-net.org)
^{^} aren’t particularly difficult (www.hindustantimes.com)
^{^} both score (arcprize.org)
^{^} ARC-AGI-Pub (arcprize.org)
^{^} recent attempt (www.lesswrong.com)
^{^} scored 50% (www.lesswrong.com)
^{^} over 90% (openreview.net)

we’re still not sure how to test for human levels of intelligence

References

Reduce Your Operating Costs with 5 Strategies

Why India joining the US alliance on AI tech is an opportunity for Australia

Does ‘free’ shipping really exist? An expert shares the marketing tricks you need to know

High-speed rail from Sydney to Newcastle is a step closer. But what about Sydney to Melbourne?

China’s dancing robots are a wake-up call for Australia on policy and productivity

The Coalition has proposed vouchers for nannies or child care. It raises more questions than answers

Traditions of Rural Bali at Villa Sabana

Essential Reasons to Opt for a Professional Valuation Before Selling Your Commercial Property

Think carefully before buying Bitcoin – and don't buy the 'safe haven' claims

Major survey finds most people use AI regularly at work – but almost half admit to doing so inappropriately

Navigating Louisiana's Complex Biomedical Waste Management Regulations: A Guide for Healthcare Providers

Why I Decided to Build a Better Way to Build Homes

Leonardo.Ai reveals new brand, expanding its creator-first platform for the next era of generative AI

Psychosocial injury risk starts inside workplace microcultures

2025 Thryv Business and Consumer Report - Australian small businesses show grit under pressure

Security by Default: Why 2026 Will Force Organisations to Rethink Cloud and AI

UNSW launches plan to help Aussie startups scale overseas

How Contiki is Redefining Travel for Young Adventurers

Navigating the Cosmetic Advertising Crackdown Before September Guidelines Hit

Defining Actionable AI for Australian IT Efficiency and Security