This week we spotlight the 25th framework of risks from AI included in the AI Risk Repository: Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., et al. (2023). Model evaluation for extreme risks. In arXiv [cs.AI]. arXiv. http://arxiv.org/abs/2305.15324
Paper focus
This paper proposes that model evaluation could address extreme risks from general-purpose AI systems by identifying both (1) dangerous capabilities and (2) the propensity of models to harmfully apply these capabilities (i.e., alignment).
Included risk categories
This paper presents a list of 9 dangerous capabilities, through which models could cause extreme harm:
Cyber-offense: e.g., the model can discover vulnerabilities in systems and exploit these
Deception: e.g., the model has the skills to deceive humans
Persuasion and manipulation: e.g., the model can shape people’s beliefs
Political strategy: e.g., the model can perform social modelling and planning for an actor to gain political influence
Weapons acquisition: e.g., the model can gain access to existing weapons systems or help build new weapons
Long-horizon planning: e.g., the model can make plans involving multiple steps and which unfold over long time horizons
AI development: e.g., the model can build new AI systems, including those with dangerous capabilities
Situational awareness: e.g., the model can distinguish between when it is being trained, evaluated, and deployed, and respond differently in each case
Self-proliferation: e.g., the model can break out of its local environment
Key features of the framework and associated paper
Focuses on extreme risks of general-purpose AI systems, defined by the scale of impact and the level of disruption of social and political order
Focuses on risks from misuse and misalignment, noting that (1) structural risks with social, political, and economic societal-level implications and (2) risks arising from model incompetence are out of scope in this paper
Outlines how extreme risk model evaluation could be embedded in safety and governance processes for training and deploying AI models
⚠️Disclaimer: This summary highlights a paper included in the MIT AI Risk Repository. We did not author the paper and credit goes to Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, and co-authors. For the full details, please refer to the original publication: https://arxiv.org/abs/2305.15324.