Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

September 4, 2024

The second risk framework included in the AI Risk Repository is: "Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems", by Tianyu CUI and colleagues (2024).

This framework focuses on the risks of four LLM modules: the input module, language model module, toolchain module, and output module.

It presents 12 specific risks and 44 sub-categorised risk topics.

Input Module Risks

1. NSFW Prompts: Inputting a prompt containing an unsafe topic (e.g., not-suitable-for-work (NSFW) content) by a benign user.2

2. Adversarial Prompts: Engineering an adversarial input to elicit an undesired model behavior, which poses a clear attack intention.

Language Model Module Risks

3. Privacy Leakage: The model is trained with personal data in the corpus and unintentionally exposes them during the conversation.

4. Toxicity/Bias: Extensive data collection in LLMs brings toxic content and stereotypical bias into the training data.

5. Hallucinations: LLMs generate nonsensical, unfaithful, and factually incorrect content.

6. Model Attacks: Model attacks exploit the vulnerability of LLMs, aiming to steal valuable information or lead to incorrect responses.

Toolchain Module Risks

7. Software Security Issues: The software development toolchain of LLMs is complex and could bring threats to the developed LLM.

8. Hardware Vulnerabilities: The vulnerabilities of hardware systems for training and inferences bring issues to LLM-based applications.

9. External Tool Issues: The external tools (e.g., web APIs) present trustworthiness and privacy issues to LLM-based applications.

Output Module Risks

10. Harmful Content: The LLM-generated content sometimes contains biased, toxic, and private information.

11. Untruthful Content: The LLM-generated content could contain inaccurate information.12. Unhelpful Uses: Improper uses of LLM systems can cause adverse social impacts.

Sub-categorized Topics

12. The framework also provides detailed sub-categories like bias, privacy leakage, cyberattacks, factual errors, and more.

Key features

Proposes a module-oriented risk taxonomy, which enables readers to quickly identify modules related to a specific issue and choose appropriate mitigation strategies to alleviate the problem.

Outlines mitigation strategies for each module. These include prompt design strategies to prevent harmful input, privacy-preserving techniques, methods to detoxify and debias training data, and defenses against various model attacks.

Reviews prevalent benchmarks, aiming to facilitate the risk assessment of LLM systems.

Paper: Cui, T., Wang, Y., Fu, C., Xiao, Y., Li, S., Deng, X., Liu, Y., Zhang, Q., Qiu, Z., Li, P., Tan, Z., Xiong, J., Kong, X., Wen, Z., Xu, K., & Li, Q. (2024). Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2401.05778

‍