OpenAI’s new o3 artificial intelligence model has achieved a groundbreaking high score on a prestigious AI reasoning test called the ARC Challenge, inspiring some AI fans to speculate that o3 has achieved artificial general intelligence (AGI). But while ARC Challenge organizers described o3’s achievement as a major milestone, they also supported the fact that it hasn’t won the competition’s top prize – and it’s just a step on the road to AGI, a term for hypothetical future AI with human-like intelligence.
The O3 model is the latest in a series of AI releases that follow the major language models that power ChatGPT. “This is a surprising and important step function increase in AI capabilities, which shows new task adaptation capabilities never seen before in the GPT family models,” said François Chollet, an engineer at Google and lead creator of the ARC Challenge, in a blog post.
What does OpenAI’s o3 model actually do?
Chollet designed the Abstraction and Reasoning Corpus (ARC) Challenge in 2019 to test how well AIs can find correct patterns connecting pairs of colored grids. Such visual puzzles are intended to make AIs demonstrate some form of general intelligence with basic reasoning. But throwing enough computing power at the puzzles could let even a non-sense program simply solve them through brute force. To prevent this, the competition also requires official score submissions to meet certain limits of computing power.
OpenAI’s recently announced o3 model—which is slated for release in early 2025—achieved its official breakthrough score of 75.7 percent on the ARC Challenge’s “semi-private” test, which is used to rank competitors on a public leaderboard. The computational cost of its performance was approximately $20 for each visual puzzle task, meeting the competition’s limit of less than $10,000 in total. But the tougher “private” test used to determine grand prize winners has an even stricter limit on computing power, equivalent to spending just 10 cents on each task OpenAI failed to complete.
The O3 model also achieved an unofficial score of 87.5 percent, using approximately 172 times more computing power than it did on the official score. By comparison, the typical human score is 84 percent, and a score of 85 percent is enough to win the ARC Challenge’s $600,000 top prize—if the model can also keep its computing costs within the required limits.
But to achieve its unofficial score, O3’s costs rose to thousands of dollars spent on solving each task. OpenAI requested the challenge organizers not to publish the exact computational cost.
Does this o3 performance show that AGI has been reached?
No, the organizers of the ARC challenge have specifically said that they do not consider beating this competition benchmark an indicator of having achieved AGI.
The O3 model also failed to solve more than 100 visual puzzle tasks, even when OpenAI applied a very large amount of computing power against the unofficial score, Mike Knoop, an ARC Challenge organizer at software company Zapier, said in a social media post on .
In a social media post on Bluesky, Melanie Mitchell of the Santa Fe Institute in New Mexico had the following to say about o3’s progress on the ARC benchmark: “I think solving these tasks using brute-force compute defeats the original purpose”.
“Although the new model is very impressive and represents a major milestone on the road to AGI, I do not believe this is AGI – there are still a lot of very easy [ARC Challenge] tasks that o3 cannot solve,” Chollet said in another X post.
However, Chollet described how we can know when human-level intelligence has been demonstrated by some form of AGI. “You know AGI is here when the exercise of creating tasks that are easy for ordinary people but difficult for AI simply becomes impossible,” he said in the blog post.
Thomas Dietterich at Oregon State University suggests another way to recognize AGI. “These architectures claim to include all the functional components required for human cognition,” he says. “In this way, the commercial AI systems lack episodic memory, planning, logical reasoning and, most importantly, meta-cognition.”
So what does o3’s high score really mean?
The O3 model’s high score comes as the technology industry and AI researchers have anticipated a slower pace of progress in the latest AI models for 2024 compared to the initial explosive development in 2023.
Although it did not win the ARC Challenge, o3’s high score indicates that AI models can beat the competitors’ benchmark in the near future. In addition to its unofficial high score, Chollet says many official low-compute submissions have already scored above 81 percent on the private evaluation test set.
Dietterich also believes that “it’s a very impressive leap in performance”. However, he cautions that without knowing more about how OpenAI’s o1 and o3 models work, it’s impossible to assess how impressive the high score is. For example, if o3 was able to practice the ARC problems beforehand, it would make it easier. “We’ll have to wait for an open source replication to understand the full significance of this,” says Dietterich.
The organizers of the ARC Challenge are already looking to launch a second, more difficult set of benchmark tests sometime in 2025. They will also keep the ARC Prize 2025 challenge running until someone takes the grand prize and open-sources their solution.
Subjects:
- artificial intelligence/
- AI