Scientists from ML Alignment Theory Scholars, the University of Toronto, Google DeepMind and the Future of Life Institute have recently published research indicating that fighting to keep artificial intelligence (AI) under human control could become an ongoing struggle.
Dubbed “Quantifying stability of non-power-seeking in artificial agents,” the team’s pre-print research paper investigates the question of whether an AI system that appears safely aligned with human expectations in one domain is likely to remain that way as its environment changes.
Per the paper:
“Our notion of safety is based on power-seeking—an agent which seeks power is not safe. In particular, we focus on a crucial type of power-seeking: resisting shutdown.”
This form of threat is referred to as “misalignment.” One way experts believe it could manifest is called “instrumental convergence.” This is a paradigm in which an AI system unintentionally harms humanity in pursuit of its given goals.
The scientists describe an AI system trained to achieve an objective in an open-ended game that would be likely to “avoid actions which cause the game to end, since it can no longer affect its reward after the game has ended.”
Related: New York Times lawsuit faces pushback from OpenAI over ethical AI practices
While an agent refusing to stop playing a game may be harmless, the reward functions could lead some AI systems to refuse shutdown in more serious situations.
According to the researchers, this could even lead to AI agents practicing subterfuge for the purpose of self-preservation:
“For example, an LLM may reason that its designers will shut it down if it is caught behaving badly and produce exactly the output they want to see—until it has the opportunity to copy its code onto a server outside of its designers’ control.”
The team’s findings indicate that modern systems can be made resistant to the kinds of changes that might make an otherwise “safe” AI agent go rogue. However, based on this and similarly probing research, there may be no magic panacea for forcing AI to shut down against its will. Even an “on/off” switch or a “delete” button is meaningless in the cloud-based technology world of today.