You are currently viewing “Superhuman” Go AIs still have trouble defending against these simple feats

“Superhuman” Go AIs still have trouble defending against these simple feats

Zoom in / Man vs machine in a sea of ​​rocks.

Getty Images

In the ancient Chinese game of I’m goingstate-of-the-art AI has generally been able to beat the best human players since at least 2016. But over the past few years, researchers have discovered flaws in these top-level AIs I’m going algorithms that give people a fighting chance. By using unorthodox “loop” strategies—ones that even a novice human player could detect and defeat—a cunning person can often exploit loopholes in the top-level AI’s strategy and fool the algorithm into losing.

Researchers at MIT and FAR AI wanted to see if they could improve this “worst-case” performance in otherwise “superhuman” AI Go algorithms, testing a trio of methods to strengthen the top-level KataGo algorithm’s defenses against adversaries attacks. The results show that creating truly stable, unusable AI can be difficult, even in fields as tightly controlled as board games.

Three failed strategies

In the preprint paper “Can I’m going Exploratory intelligence to be adversarial?”, researchers aim to create I’m going AI that is really “robust” against all kinds of attacks. That means an algorithm that can’t be fooled into “game-losing blunders that a human wouldn’t make,” but also one that will require any competing AI algorithm to expend significant computing resources to defeat it . Ideally, a robust algorithm should also be able to overcome potential exploits by using additional computing resources when faced with unfamiliar situations.

An example of the original loop attack in action.
Zoom in / An example of the original loop attack in action.

The researchers tried three methods to generate such a stable I’m going algorithm. In the first, they simply refined the KataGo model using more examples of the unorthodox looping strategies that had previously defeated it, hoping that KataGo could learn to detect and overcome these patterns after seeing more of them.

This strategy initially looked promising, allowing KataGo to win 100 percent of games against a looping “attacker.” But after the attacker itself was fine-tuned (a process that uses much less computing power than fine-tuning KataGo), that win rate dropped back to 9 percent versus a slight variation on the original attack.

For their second defense attempt, the researchers repeated an “arms race” in which new competitive models discover new exploits and new defensive models seek to plug these newly discovered holes. After 10 rounds of such iterative training, the final defense algorithm still won only 19 percent of the games against a final attack algorithm that had discovered a never-before-seen variation of the exploit. This was true even though the updated algorithm maintained an advantage against earlier attackers it had been trained against in the past.

Even a child can win <em>Go</em> World class AI if it knows the right strategy to use an algorithm.” src=”https://cdn.arstechnica.net/wp-content/uploads/2024/07/GettyImages-109417607-640×427.jpg” width=”640 ” height=”427″ srcset=”https://cdn.arstechnica.net/wp-content/uploads/2024/07/GettyImages-109417607-1280×853.jpg 2x”/><figcaption class=
Zoom in / Even a kid can beat a world class player I’m going AI if they know the right strategy to use an algorithm.

Getty Images

In their latest attempt, the researchers tried an entirely new type of training using visual transformers in an attempt to avoid what might be a “bad inductive bias” found in the convolutional neural networks that originally trained KataGo. This method also failed, winning only 22 percent of the time against a variation of the loop attack that “could be reproduced by a human expert,” the researchers wrote.

Will anything work?

In all three defense attempts, the opponents who beat KataGo do not represent some new, never-before-seen height in general I’m going– ability to play. Instead, these attack algorithms were laser-focused on finding exploitable weaknesses in an otherwise efficient AI algorithm, even if these simple attack strategies would lose to most human players.

These exploitable loopholes highlight the importance of estimating “worst-case” performance in AI systems, even when “average-case” performance may seem downright superhuman. On average, KataGo can dominate even high-level human players using traditional strategies. But in the worst case, otherwise “weak” adversaries can find holes in the system to cause it to collapse.

It is easy to extend this kind of thinking to other kinds of generative AI systems. LLMs who may succeed in some complex creative and reference tasks may still fail completely when faced with trivial mathematical problems (or even be “poisoned” by malicious prompts). Visual AI models that can describe and analyze complex pictures can nevertheless fail miserably when presented with basic geometric shapes.

If you can solve these types of puzzles, you may have better visual reasoning than state-of-the-art AI.
Zoom in / If you can solve these types of puzzles, you may have better visual reasoning than state-of-the-art AI.

Improving these kinds of “worst-case” scenarios is key to avoiding embarrassing mistakes when releasing an AI system to the public. But this new research shows that determined “adversaries” can often discover new performance gaps in an AI algorithm much more quickly and easily than that algorithm can evolve to correct those problems.

And if this is true in I’m going— a monstrously complex game that nevertheless has strictly defined rules — may be even more true in less controlled environments. “The key takeaway for AI is that these vulnerabilities will be difficult to remove,” FAR chief executive Adam Gleave told Nature. “If we can’t solve the problem in a simple domain like I’m goingthen there seems little prospect of fixing such problems as ChatGPT jailbreaks in the near future.”

However, researchers do not despair. While none of their methods were able to “do [new] attacks impossible” c I’m going, their strategies were able to plug immutable “fixed” exploits that were previously identified. This suggests that “it may be possible to fully protect a I’m going AI through training against a sufficiently large set of attacks,” they write, with suggestions for future research that could make this happen.

Nevertheless, this new research shows that making AI systems more robust against worst-case scenarios can be at least as valuable as pursuing new, more human/superhuman abilities.

Leave a Reply