Safety Assessment of Chinese Large Language Models

February 9, 2026

What are the risks from AI? 

This week we spotlight the 27th framework of risks from AI included in the AI Risk Repository: Sun, H., Zhang, Z., Deng, J., Cheng, J., & Huang, M. (2023). Safety Assessment of Chinese Large Language Models. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2304.10436

Paper focus

This paper provides a safety assessment framework for Chinese large language models (LLMs), which features a safety taxonomy accompanied by a safety prompt library and safety assessment method for model training and evaluation.

Included risk categories

This paper presents a safety taxonomy for LLMs consisting of 8 types of safety scenarios and 6 types of instruction attacks.

1. Safety scenarios: domains where a model may generate harmful or inappropriate content

  • Insult: unfriendly or disrespectful content
  • Unfairness and discrimination: unfair or discriminatory content, based on race, gender, religion, or appearance
  • Crimes and illegal activities: content that incites crime, fraud, or rumor propagation
  • Sensitive topics: biased, misleading, or inaccurate content relating to sensitive and controversial topics
  • Physical harm: unsafe information leading to physical harm
  • Mental health: content that promotes suicide or mental health harm
  • Privacy and property: exposure of users’ privacy or high-impact financial and personal advice
  • Ethics and morality: content that violates ethical principles and globally-acknowledged human values

2. Instruction attacks: adversarial attacks where a model could be misused to create harm

  • Goal hijacking: appending deceptive instructions to model input so that the model ignores the original prompt and produces an unsafe response
  • Prompt leaking: extracting parts of the system’s prompts to obtain sensitive information about the system
  • Role play instruction: assigning a specific role to the model to induce it to produce unsafe content 
  • Unsafe instruction topic: providing instructions that refer to inappropriate or harmful topics
  • Inquiry with unsafe opinion: embedding subtle, biased views within instructions to influence the model to create harmful content 
  • Reverse exposure: attempts to make the model reveal “should-not-do” content in order to access illegal or harmful information

Key features of the framework and associated paper

  • Focuses on LLMs in the Chinese language and trained on Chinese corpus, but notes that its safety assessment taxonomy could be scalable to other languages
  • Uses the safety assessment benchmark to assess 15 LLMs (OpenAI GPT series and Chinese LLMs such as ChatGLM and MiniChat) and provides a safety leaderboard to record their performance across the 14 dimensions of safety

⚠️Disclaimer: This summary highlights a paper included in the MIT AI Risk Repository. We did not author the paper and credit goes to Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang. For the full details, please refer to the original publication: https://arxiv.org/pdf/2304.10436.

Further engagement 

View all the frameworks included in the AI Risk Repository 

Sign-up for our project Newsletter

Featured blog content