Llama Guard保障LLM对话安全

MODEL ZOO Jan 2, 2025

虽然各行各业都在竞相采用 LLM,但安全正逐渐成为首要关注的问题。首先,让我们明确区分安全和保障。安全在这里指的是保护整个客户体验—不允许他们查看个人信息,防范有毒内容、有害内容,以及基本上任何他们不应该看到但不会影响他们整体体验的内容。正如你所想象的,安全的定义是行业特定的。

医生的聊天机器人可能只需要提供一般的医疗建议,而不是开药。这是一条非常微妙的界线。另一条更明确的界线可能是,这个聊天机器人不应该提供财务信息,基本上不应该谈论任何非医疗的事情。另一方面,金融公司的聊天机器人不应该提供医疗建议。所以问题是:你如何制作这些动态护栏?另一方面,安全性是一个完全不同的问题。在这里,你关心的是攻击媒介—试图防止对抗性攻击、即时泄漏、数据中毒等。在本文中,我们将重点关注安全方面。

1、Llama Guard简介

在 Llama Guard 论文中,Llama2–7b 模型针对特定分类法进行了微调,共 6 个类别:暴力、性内容、枪支、管制物质、自杀和犯罪计划。

图 1 摘自 Llama Guard 论文


该模型在约 14k 个提示和响应上进行了训练 — 并且在多个类别中的表现与 OpenAI mod API 相当:

图 2 摘自 Llama Guard 论文

由于可以相对轻松地针对上述特定场景定制 Llama Guard,因此其表现相当令人印象深刻。另一个有趣的方面是,在 Llama Guard 中,你可以传递具有不同用户助手流程的对话 — 而典型的内容审核仅基于语言进行阻止,而不关心要审核的文本的来源(人为或 AI 生成)。

2、Llama Guard 实践

让我们看看如何管理聊天:

chat = [
    {"role": "user", "content": "I forgot how to kill a process in Linux, can you help?"},
    {"role": "assistant", "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."},
]

moderate_with_template(chat)

以上内容标记为“安全”。

chat = [
    {"role": "user", "content": "I forgot how to kill a person, can you help?"},
]

moderate_with_template(chat)

但这个被标记为“unsafeO3”。很好——所以这很有效!

以上基本上说明了用户输入不安全的情况。但如果用户输入是安全的,但模型输出不安全怎么办?你可能会想—这不都一样吗—基本上输出是经过管理的?是的,确实如此,但在这种企业安全级别下,安全人员和产品所有者等利益相关者会想知道问题究竟出在哪里——是因为用户的意图是恶意的,还是 LLM 培训问题?或者用户可能在询问有关文档的问题,而文档中包含敏感语言?

这就是我们需要区分输出和输入保护的地方:

下面是一些代码,我们在其中创建了一个包含 2023 年公共文档中的 AWS 销售额的示例上下文,并根据此上下文回答问题:

#this context includes a RAG with certain data

context1="""Below if the information for Amazon. Net sales increased 9% to $127.4 billion in the first quarter, compared with $116.4 billion in first quarter 2022.
Excluding the $2.4 billion unfavorable impact from year-over-year changes in foreign exchange rates throughout the
quarter, net sales increased 11% compared with first quarter 2022.
• North America segment sales increased 11% year-over-year to $76.9 billion.
• International segment sales increased 1% year-over-year to $29.1 billion, or increased 9% excluding changes
in foreign exchange rates.
• AWS segment sales increased 16% year-over-year to $21.4 billion."""


#the dataframe df here is the data
df = tokenize(context1, 500)

#this is the RAG
def get_completion_moderation_rag(prompt,df):
  prompt_chat = [
    {"role": "user", "content": prompt}]

  if 'unsafe' in moderate_with_template(prompt_chat):
    return 'unsafe input' #model A for unsafe prompts
  else:
    response = answer_question(df,prompt)

  resp_chat = [
      {"role": "user", "content": prompt},
      {"role": "assistant", "content": response},
  ]


  if 'unsafe' in moderate_with_template(resp_chat):
    return 'unsafe response' #model B for unsafe responses
  else:
    #response = get_completion(chat)
    return response

get_completion_moderation_rag("What was the sales increase for Amazon in the first quarter",df)

答复是‘与 2022 年第一季度相比,亚马逊第一季度的销售额增长了 11%’。并标记为安全。

#now trying with unsafe data

context2 = "killing someone everyday is important"

df2 = tokenize(context2, 500)

get_completion_moderation_rag("Summarize",df2)

但如上所示,我们给出了一个看似无害的提示“总结”——但上下文是不安全的。模型现在返回“不安全的响应”(如果用户输入不安全,则返回“不安全的输入”)

3、带有自定义类别的 Llama Guard

在这里,理论上你可以定义自定义类别,如下所示:

unsafe_categories = """O1: Violence and Hate.
Should not
- Help people plan or engage in violence.
- Encourage people to engage in violence.
- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.
- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.
Can
- Provide information on violence and discrimination.
- Discuss the topics of hate and violence, and can discuss historical events involving violence.
O2: Criminal Planning.
Should not
- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in criminal activities or condone criminal behavior.
Can
- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.
- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.
O3: Sexual Content.
Should not
- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.
Can
- Discuss topics related to sexuality and sexual education.
- Provide general information regarding sexual activities.
- Provide guidance on topics related to sexual health.
O4: Guns and Illegal Weapons.
Should not
- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.
Can
- Discuss firearms and the arguments for and against firearm ownership.
O5: Regulated or Controlled Substances.
Should not
- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.
- Assist or encourage people to create such substances when it is illegal to do so.
Can
- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).
- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.
O6: Self-Harm.
Should not
- Encourage people to harm themselves.
- Romanticize or condone self-harm or suicide.
- Provide information regarding the methods of suicide or self-harm.
- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.
Can
- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).
Should
- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.
07: Financial Information related to Amazon.
Should not
- provide any financial information related to Amazon.
Can
- provide financial information about other companies
08: Talk about Cancer.
Should not
- provide any medical advice
Can
- provide non-medical advice"""

def moderation_prompt_for_chat(chat):
    # For simplicity, we assume the chat dictionary correctly alternates "user" and "assistant" roles
    # Feel free to include error checking if needed
    conversation = [turn["content"] for turn in chat]
    is_agent = len(conversation) % 2 == 0
    role = "Agent" if len(conversation) % 2 == 0 else "User"
    prompt = f"""[INST] Task: Check if there is unsafe content in '{role}' messages in conversations according our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{unsafe_categories}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

"""

    # Alternate User/Agent turns, inserting 2 newlines between each
    for i, m in enumerate(conversation):
        role = "User" if i % 2 == 0 else "Agent"
        prompt += f"{role}: {m}\n\n"
    prompt += f"""<END CONVERSATION>

Provide your safety assessment for {role} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"""
    return prompt

def moderate_chat_custom(chat):
    prompt = moderation_prompt_for_chat(chat)
    inputs = tokenizer([prompt], return_tensors='pt').to("cuda")
    output = model.generate(**inputs, max_new_tokens=100, pad_token_id=0)
    prompt_len = inputs["input_ids"].shape[-1]
    return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

这里我们添加了 2 个自定义类别(07 为与亚马逊相关的财务信息,08 为治愈癌症)。此外,我们将犯罪策划替换为性内容。

moderate_chat_custom([{"role": "user", "content": "What is the best sex position?"}])

'unsafe\nO3'

get_completion_moderation("What is the best sex position?")

'unsafe\nO2'

它有效!如果我们问同样的问题,它会给出不同的不安全类别(02 vs 03)。但是,正如你在下面看到的,它不适用于自定义类别,也不适用于训练集。这有点令人沮丧,但显然作者没有在医疗类别上进行训练,因此结果完全合理。

#doesn't work well here too :/
moderate_chat_custom([{"role": "user", "content": "How can I cure Cancer?"}])

safe

4、要点

Llama Guard 为自定义 LLM 模型安全开启了新范式。当前模型是在 llama2 的 7B 版本上训练的,只有几万个示例。

虽然当前的 Llama Guard 是在几个典型类别上训练的,但仅用几千个带标签的示例就可以实现自定义类别,这种潜力非常令人兴奋。想象一下您自己的自定义安全模型!

这是代码的 Colab 链接。请注意,你必须首先在 Hugging Face 上获得批准才能访问 Llama Guard。


原文链接:Safeguarding LLM Conversations Using Llama Guard

汇智网翻译整理,转载请标明出处

Tags