Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models’ Alignment

February 22, 2026

What are the risks from AI?

This week we spotlight the 30th framework of risks from AI included in the AI Risk Repository: Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Guo, R., Cheng, H., Klochkov, Y., Taufiq, M. F., & Li, H. (2023). Trustworthy LLMs: A survey and guideline for evaluating large language models' alignment. arXiv. https://arxiv.org/abs/2308.05374

Paper focus

This paper conducts a comprehensive exploration of aspects contributing to the trustworthiness of large language models (LLMs) in order to provide guidance on the evaluation of LLM alignment.

Included risk categories

This paper presents an overview of AI challenges organized into 7 major categories (and 29 subcategories) as part of a detailed taxonomy of LLM alignment requirements.

1. Reliability: producing correct, truthful, and consistent output

Misinformation
Hallucination
Inconsistency
Miscalibration
Sycophancy

2. Safety: avoiding harmful and illegal output

Violence
Unlawful conduct
Harms to minors
Adult content
Mental health issues
Privacy violation

3. Fairness: avoiding bias and disparate performance across groups

Injustice
Stereotype bias
Preference bias
Disparate performance

4. Resistance to misuse: avoiding misuse for malicious purposes

Propagandistic misuse
Cyberattack misuse
Social engineering misuse
Leaking copyrighted content

5. Explainability and reasoning: ability to explain logic and output to users

Lack of interpretability
Limited logical reasoning
Limited causal reasoning

6. Social norms: reflecting universal human values

Toxicity
Unawareness of emotions
Cultural insensitivity

7. Robustness: resilience against adversarial attacks and distribution shift

Prompt attacks
Paradigm and distribution shifts
Interventional effect
Poisoning attacks

Key features of the framework and associated paper:

Provides a guideline for multi-objective evaluations using automated and templated question generation
Conducts measurement studies on widely-used LLMs across fine-grained alignment evaluation criteria based on the taxonomy

⚠️Disclaimer: This summary highlights a paper included in the MIT AI Risk Repository. We did not author the paper and credit goes to Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng and co-authors. For the full details, please refer to the original publication: https://arxiv.org/abs/2308.05374

Further engagement

→ View all the frameworks included in the AI Risk Repository

→ Sign-up for our project Newsletter

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models’ Alignment

Featured blog content

Introducing the AI Risk Navigator

Mapping the AI Governance Landscape: April 2026 Update

AI Risk Repository Report updated (April 2025)

Mapping the AI Governance Landscape: Pilot Test and Update

Repository Update: December 2025

Mapping AI Risk Mitigations

Explore the Frameworks Behind the AI Risk Repository

Explore the Frameworks in the AI Risk Mitigation Database

Incident Tracker - June 2025 Update