The second risk framework included in the AI Risk Repository is: "Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems", by Tianyu CUI and colleagues (2024).
This framework focuses on the risks of four LLM modules: the input module, language model module, toolchain module, and output module.
It presents 12 specific risks and 44 sub-categorised risk topics.
Input Module Risks
1. NSFW Prompts: Inputting a prompt containing an unsafe topic (e.g., not-suitable-for-work (NSFW) content) by a benign user.2
2. Adversarial Prompts: Engineering an adversarial input to elicit an undesired model behavior, which poses a clear attack intention.
Language Model Module Risks
3. Privacy Leakage: The model is trained with personal data in the corpus and unintentionally exposes them during the conversation.
4. Toxicity/Bias: Extensive data collection in LLMs brings toxic content and stereotypical bias into the training data.
5. Hallucinations: LLMs generate nonsensical, unfaithful, and factually incorrect content.
6. Model Attacks: Model attacks exploit the vulnerability of LLMs, aiming to steal valuable information or lead to incorrect responses.
Toolchain Module Risks
7. Software Security Issues: The software development toolchain of LLMs is complex and could bring threats to the developed LLM.
8. Hardware Vulnerabilities: The vulnerabilities of hardware systems for training and inferences bring issues to LLM-based applications.
9. External Tool Issues: The external tools (e.g., web APIs) present trustworthiness and privacy issues to LLM-based applications.
Output Module Risks
10. Harmful Content: The LLM-generated content sometimes contains biased, toxic, and private information.
11. Untruthful Content: The LLM-generated content could contain inaccurate information.12. Unhelpful Uses: Improper uses of LLM systems can cause adverse social impacts.
Sub-categorized Topics
12. The framework also provides detailed sub-categories like bias, privacy leakage, cyberattacks, factual errors, and more.
Proposes a module-oriented risk taxonomy, which enables readers to quickly identify modules related to a specific issue and choose appropriate mitigation strategies to alleviate the problem.
Outlines mitigation strategies for each module. These include prompt design strategies to prevent harmful input, privacy-preserving techniques, methods to detoxify and debias training data, and defenses against various model attacks.
Reviews prevalent benchmarks, aiming to facilitate the risk assessment of LLM systems.
Paper: Cui, T., Wang, Y., Fu, C., Xiao, Y., Li, S., Deng, X., Liu, Y., Zhang, Q., Qiu, Z., Li, P., Tan, Z., Xiong, J., Kong, X., Wen, Z., Xu, K., & Li, Q. (2024). Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2401.05778