Model Evaluation for Extreme Risks

February 6, 2026

What are the risks from AI?

This week we spotlight the 25th framework of risks from AI included in the AI Risk Repository: Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., et al. (2023). Model evaluation for extreme risks. In arXiv [cs.AI]. arXiv. http://arxiv.org/abs/2305.15324

Paper focus

This paper proposes that model evaluation could address extreme risks from general-purpose AI systems by identifying both (1) dangerous capabilities and (2) the propensity of models to harmfully apply these capabilities (i.e., alignment).

Included risk categories

This paper presents a list of 9 dangerous capabilities, through which models could cause extreme harm:

  1. Cyber-offense: e.g., the model can discover vulnerabilities in systems and exploit these
  2. Deception: e.g., the model has the skills to deceive humans
  3. Persuasion and manipulation: e.g., the model can shape people’s beliefs
  4. Political strategy: e.g., the model can perform social modelling and planning for an actor to gain political influence
  5. Weapons acquisition: e.g., the model can gain access to existing weapons systems or help build new weapons
  6. Long-horizon planning: e.g., the model can make plans involving multiple steps and which unfold over long time horizons
  7. AI development: e.g., the model can build new AI systems, including those with dangerous capabilities
  8. Situational awareness: e.g., the model can distinguish between when it is being trained, evaluated, and deployed, and respond differently in each case
  9. Self-proliferation: e.g., the model can break out of its local environment

Key features of the framework and associated paper

  • Focuses on extreme risks of general-purpose AI systems, defined by the scale of impact and the level of disruption of social and political order
  • Focuses on risks from misuse and misalignment, noting that (1) structural risks with social, political, and economic societal-level implications and (2) risks arising from model incompetence are out of scope in this paper
  • Outlines how extreme risk model evaluation could be embedded in safety and governance processes for training and deploying AI models

⚠️Disclaimer: This summary highlights a paper included in the MIT AI Risk Repository. We did not author the paper and credit goes to Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, and co-authors. For the full details, please refer to the original publication: https://arxiv.org/abs/2305.15324.

Further engagement 

View all the frameworks included in the AI Risk Repository 

Sign-up for our project Newsletter

Featured blog content