Model Evaluation for Extreme Risks

February 6, 2026

What are the risks from AI?

This week we spotlight the 25th framework of risks from AI included in the AI Risk Repository: Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., et al. (2023). Model evaluation for extreme risks. In arXiv [cs.AI]. arXiv. http://arxiv.org/abs/2305.15324

Paper focus‍

This paper proposes that model evaluation could address extreme risks from general-purpose AI systems by identifying both (1) dangerous capabilities and (2) the propensity of models to harmfully apply these capabilities (i.e., alignment).

Included risk categories

This paper presents a list of 9 dangerous capabilities, through which models could cause extreme harm:

Cyber-offense: e.g., the model can discover vulnerabilities in systems and exploit these
Deception: e.g., the model has the skills to deceive humans
Persuasion and manipulation: e.g., the model can shape people’s beliefs
Political strategy: e.g., the model can perform social modelling and planning for an actor to gain political influence
Weapons acquisition: e.g., the model can gain access to existing weapons systems or help build new weapons
Long-horizon planning: e.g., the model can make plans involving multiple steps and which unfold over long time horizons
AI development: e.g., the model can build new AI systems, including those with dangerous capabilities
Situational awareness: e.g., the model can distinguish between when it is being trained, evaluated, and deployed, and respond differently in each case
Self-proliferation: e.g., the model can break out of its local environment

Key features of the framework and associated paper

Focuses on extreme risks of general-purpose AI systems, defined by the scale of impact and the level of disruption of social and political order
Focuses on risks from misuse and misalignment, noting that (1) structural risks with social, political, and economic societal-level implications and (2) risks arising from model incompetence are out of scope in this paper
Outlines how extreme risk model evaluation could be embedded in safety and governance processes for training and deploying AI models

⚠️Disclaimer: This summary highlights a paper included in the MIT AI Risk Repository. We did not author the paper and credit goes to Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, and co-authors. For the full details, please refer to the original publication: https://arxiv.org/abs/2305.15324.

Further engagement

→ View all the frameworks included in the AI Risk Repository

→ Sign-up for our project Newsletter

Model Evaluation for Extreme Risks

What are the risks from AI?

Featured blog content

Repository Update: December 2025

AI Risk Repository Report updated (April 2025)

Mapping the AI Governance Landscape: Pilot Test and Update

Mapping AI Risk Mitigations

Explore the Frameworks Behind the AI Risk Repository

Explore the Frameworks in the AI Risk Mitigation Database

Incident Tracker - June 2025 Update