Business Daily Media

we’re still not sure how to test for human levels of intelligence

  • Written by Andrew Rogoyski, Innovation Director - Surrey Institute of People-Centred AI, University of Surrey
we’re still not sure how to test for human levels of intelligence

Two of San Francisco’s leading players in artificial intelligence have challenged[1] the public to come up with questions capable of testing the capabilities of large language models (LLMs) like Google Gemini and OpenAI’s o1. Scale AI, which specialises in preparing the vast tracts of data on which the LLMs are trained, teamed up with the Center for AI Safety (CAIS) to launch the initiative, Humanity’s Last Exam.

Featuring prizes of US$5,000 (£3,800) for those who come up with the top 50 questions selected for the test, Scale and CAIS say the goal is to test how close we are to achieving “expert-level AI systems” using the “largest, broadest coalition of experts in history”.

Why do this? The leading LLMs are already acing many established tests in intelligence, mathematics[2] and law[3], but it’s hard to be sure how meaningful this is. In many cases, they may have pre-learned the answers due to the gargantuan quantities of data on which they are trained, including a significant percentage of everything on the internet.

Data is fundamental to this whole area. It is behind the paradigm shift from conventional computing to AI, from “telling” to “showing” these machines what to do. This requires good training datasets, but also good tests. Developers typically do this using data that hasn’t already been used for training, known in the jargon as “test datasets”.

If LLMs are not already able to pre-learn the answer to established tests like bar exams, they probably will soon. The AI analytics site Epoch estimates[4] that 2028 will mark the point at which the AIs will effectively have read everything ever written by humans. An equally important challenge is how to keep assessing AIs once that rubicon has been crossed.

Of course, the internet is expanding all the time, with millions of new items being added daily. Could that take care of these problems?

Perhaps, but this bleeds into another insidious difficulty, referred to as “model collapse[5]”. As the internet becomes increasingly flooded by AI-generated material which recirculates into future AI training sets, this may cause AIs to perform increasingly poorly. To overcome this problem, many developers are already collecting data from their AIs’ human interactions, adding fresh data for training and testing.

Some specialists argue that AIs also need to become “embodied”: moving around in the real world and acquiring their own experiences, as humans do. This might sound far-fetched until you realise that Tesla has been doing it for years with its cars. Another opportunity is human wearables, such as Meta’s popular smart glasses by Ray-Ban[6]. These are equipped with cameras and microphones, and can be used[7] to collect vast quantities of human-centric video and audio data.

Yet even if such products guarantee enough training data in future, there is still the conundrum of how to define and measure intelligence – particularly artificial general intelligence (AGI), meaning an AI that equals or surpasses human intelligence.

Traditional human IQ tests have long been controversial[8] for failing to capture the multifaceted nature[9] of intelligence, encompassing everything from language to mathematics to empathy to sense of direction.

There’s an analagous problem with the tests used on AIs. There are many well established tests covering such tasks as summarising text, understanding it, drawing correct inferences[10] from information, recognising human poses and gestures, and machine vision.

Some tests are being retired, usually because[11] the AIs are doing so well at them, but they’re so task-specific as to be very narrow measures of intelligence. For instance, the chess-playing AI Stockfish[12] is way ahead of Magnus Carlsen, the highest scoring human player of all time, on the Elo[13] rating system. Yet Stockfish is incapable of doing other tasks such as understanding language. Clearly it would be wrong to conflate its chess capabilities with broader intelligence.

Magnus Carlsen thinking about a chess move
Magnus Carlsen is no match for Stockfish. Lilyana Vynogradova/Alamy[14]

But with AIs now demonstrating broader intelligent behaviour, the challenge is to devise new benchmarks for comparing and measuring their progress. One notable approach has come from French Google engineer François Chollet. He argues[15] that true intelligence lies in the ability to adapt and generalise learning to new, unseen situations. In 2019, he came up with the “abstraction and reasoning corpus” (ARC), a collection of puzzles in the form of simple visual grids designed to test an AI’s ability to infer and apply abstract rules.

Unlike previous benchmarks[16] that test visual object recognition by training an AI on millions of images, each with information about the objects contained, ARC gives it minimal examples in advance. The AI has to figure out the puzzle logic and can’t just learn all the possible answers.

Though the ARC tests aren’t particularly difficult[17] for humans to solve, there’s a prize of US$600,000 to the first AI system to reach a score of 85%. At the time of writing, we’re a long way from that point. Two recent leading LLMs, OpenAI’s o1 preview and Anthropic’s Sonnet 3.5, both score[18] 21% on the ARC public leaderboard (known as the ARC-AGI-Pub[19]).

Another recent attempt[20] using OpenAI’s GPT-4o scored 50%[21], but somewhat controversially because the approach generated thousands of possible solutions before choosing the one that gave the best answer for the test. Even then, this was still reassuringly far from triggering the prize – or matching human performances of over 90%[22].

While ARC remains one of the most credible attempts to test for genuine intelligence in AI today, the Scale/CAIS initiative shows that the search continues for compelling alternatives. (Fascinatingly, we may never see some of the prize-winning questions. They won’t be published on the internet, to ensure the AIs don’t get a peek at the exam papers.)

We need to know when machines are getting close to human-level reasoning, with all the safety, ethical and moral questions this raises. At that point, we’ll presumably be left with an even harder exam question: how to test for a superintelligence. That’s an even more mind-bending task that we need to figure out.

References

  1. ^ have challenged (scale.com)
  2. ^ mathematics (www.nature.com)
  3. ^ law (law.stanford.edu)
  4. ^ Epoch estimates (epochai.org)
  5. ^ model collapse (www.nature.com)
  6. ^ smart glasses by Ray-Ban (theconversation.com)
  7. ^ can be used (me.mashable.com)
  8. ^ long been controversial (theconversation.com)
  9. ^ multifaceted nature (www.scirp.org)
  10. ^ correct inferences (nlpprogress.com)
  11. ^ usually because (hai.stanford.edu)
  12. ^ Stockfish (rkrippetoe.medium.com)
  13. ^ Elo (en.wikipedia.org)
  14. ^ Lilyana Vynogradova/Alamy (www.alamy.com)
  15. ^ He argues (arxiv.org)
  16. ^ previous benchmarks (image-net.org)
  17. ^ aren’t particularly difficult (www.hindustantimes.com)
  18. ^ both score (arcprize.org)
  19. ^ ARC-AGI-Pub (arcprize.org)
  20. ^ recent attempt (www.lesswrong.com)
  21. ^ scored 50% (www.lesswrong.com)
  22. ^ over 90% (openreview.net)

Read more https://theconversation.com/ai-has-a-stupid-secret-were-still-not-sure-how-to-test-for-human-levels-of-intelligence-240469

A Closer Look at the Luxurious Tanah Merah New Condo Developments

Tanah Merah New Condo is a luxurious residential development located in the Tanah Merah district of Singapore. This exclusive condominium developmen...

Property

The Ultimate Choice for Modern Kitchens

Benchtop is the chandelier of the kitchen. It is one of those pieces that is of utmost central attention. It should be aesthetically very pleasing...

Property

How much is the rent in Brisbane?

If you’re looking at renting an apartment or house in Brisbane, one of the first questions you probably have is how much it is going to cost. Rent...

Property

3 Simple Ways to Earn More Dog Walking Clientele

A dog walking business can be very lucrative. Pet owners outside the house for significant periods lean on this service to care for their furry fr...

Business Training

A Beginner’s Guide to Selling Your Mortgage Note

If you currently hold a mortgage note and want to unlock its value, selling it on the secondary market could be worth exploring. The secondary mortg...

Property

What You Need To Know About Filing Taxes

As a business owner in the United States, filing taxes can be a daunting and complicated process. However, with the right knowledge and preparatio...

Business Training