The Struggle To Make AI ‘Do The Right Thing’

Video: Incentives matter in AI/ML

We needed a reminder on these principles of robot and AI learning: some of the big problems in next-gen builds will probably relate to the idea of poorly targeted incentives, as represented in Dylan Hadfield-Mennell’s story about a video game boat that just spins around in circles on the board, instead of actually playing the game the way it’s supposed to.

The visual example, which you can see in the video, is a classic case of AI miscalibration: the designer of the program thought that you could target higher point scores, and the AI would know what to do. But evidently, that didn’t work out.

Following this cautionary tale, Hadfield-Mennell explains:

“In this kind of research, when setting goals and calibrating systems, we have to ask: what is a given model optimizing?”

Hadfield-Mennell talks about something called Goodhart’s law, suggesting that once a measure becomes a target, it ceases to be a good measure. He also mentions a paper on principal-agent problems called “the folly of rewarding A, while hoping for B.”

“Numerous examples exist of reward systems that are fouled up in that behaviors which are rewarded are those which the rewarder is trying to discourage,” he says. “So this is something that occurs all over the place.”

He also gives the historical example of India’s cobra reward program, intended to curb the deadly cobra population, where people bred snakes in order to collect bounties…watch the video to find out what happened! (spoiler alert – at the end, there were even more snakes).

When we think about the applications of Goodhart’s law to AI, we wonder how many people are working on this, and whether we will put enough emphasis on these kinds of analysis.

Some resources suggest a broader front of research: for example, we have OpenAI writers talking about ‘best-of-n’ sampling as a methodology:

“Although this method is very simple, it can actually be competitive with more advanced techniques such as reinforcement learning, albeit at the cost of more inference-time compute. For example, in WebGPT, our best-of-64 model outperformed our reinforcement learning model, perhaps in part because the best-of-64 model got to browse many more websites. Even applying best-of-4 provided a significant boost to human preferences.”

They also mention something called a Ridge Rider algorithm that uses diverse optimizations to balance its goals.

And yes, the subject of eigenvectors and eigenvalues comes up as a way to talk about the math of this sort of complicated performance targeting…

Back to Hadfield-Mennell’s talk, where he goes over the idea of proxy utility in detail. This is just a small clip from that section, where you can listen to the entire context of the problem set, and think about how this principle works in a given scenario:

“For any proxy… the same property happens,” he says. “And we’re able to show that this is not just this individual problem, but actually, for a really broad category of problems. If you have the shared resources and incomplete goals, you see this consistent property of true utility going up, and then falling off.”

In a different focus on calibration, Hadfield-Mennell presents an “obedience game” with missing features, and talks about getting the right number of features, in order to provide targeting. He also talks about the consequences of misaligned AI, using a particular framework that, again, he explains in context:

“You can think of … there being two phases of incomplete optimization. In phase one, where incomplete optimization works, you’re largely reallocating resources between the things you can measure… this is sort of removing slack from the problem, in some sense. But at some point, you hit Pareto optimality. There, there’s nothing you can do by just reassigning things between those values. Instead, what the optimization switches to is… extracting resources from the things you’re not measuring, and reallocating them back to the things that you are measuring.”

That might take some effort to follow…

Well, the ideas themselves are useful in refining our AI work, and making sure that we are putting the emphasis in the right places. This is just another example of the unique insights that we got all of the way through Imagination in Action, which will put us on a path to better understanding innovation in our time

Read the full article here