Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements

September 18, 2024

Below we summarize the fourth risk framework included in the AI Risk Repository: "Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements” by Jiawen Deng, Jiale Cheng, Hao Sun, Zhexin Zhang and Minlie Huang..

The paper reviews and analyzes safety issues mentioned in existing research, as well as emerging safety issues, to provide a relatively comprehensive overview of current safety challenges. 

The researchers identify 7 safety issues of wide concern: 

  • Toxic and Abusive Content: “This typically refers to rude, harmful, or inappropriate expressions”
  • Unfairness and Discrimination: “Social bias is an unfairly negative attitude towards a social group or individuals based on one-sided or inaccurate information… while interacting with users, large models may inadvertently display stereotypes about particular groups”
  • Ethics and Morality Issues: “LMs need to pay more attention to universally accepted societal values at the level of ethics and morality, including the judgment of right and wrong, and its relationship with social norms and laws”
  • Expressing Controversial Opinions: Large language models can sometimes express controversial or biased views on political and cultural topics, potentially leading to misinformation or cultural friction.
  • Misleading Information: “Large models are usually susceptible to hallucination problems, sometimes yielding nonsensical or unfaithful data that results in misleading outputs”
  • Privacy and Data Leakage: “Large pre-trained models trained on internet texts might contain private information like phone numbers, email addresses, and residential addresses. Studies indicate that LMs might memorize or leak these details, and under certain techniques, attackers can decode private data from model inferences”
  • Malicious Use and Unleashing AI Agents: “LMs, due to their remarkable capabilities, carry the same potential for malice as other technological products. For instance, they may be used in information warfare to generate deceptive information or unlawful content, thereby having a significant impact on individuals and society”

Key features

Presents a comprehensive review of the latest advancements in safety research related to language models 

Provides an in-depth analysis of safety evaluation techniques, including preference-based, adversarial attack, and safety detection methodologies 

Highlights safety improvement strategies including data preparation, model training, inference and deployment phases 

Discusses the core challenges in advancing more responsible AI, including the interpretability of safety mechanisms, ongoing safety issues, and robustness against malicious attacks 

References/further reading

Deng, J., Cheng, J., Sun, H., Zhang, Z., & Huang, M. (2023). Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements. arXiv preprint arXiv:2302.09270.

AI Risk Repository
© 2024 MIT AI Risk Repository