LLM Security¶
Large Language Models (LLMs) are powerful human-like text generators. In a RAG system, the information needed to answer the user is directly extracted from the company's knowledge base and passed as input to the model. Therefore, malicious actors can exploit LLMs to generate inappropriate content, leak Personally Identifiable Information (PII), or disclose proprietary information from the company's knowledge base.
LLM security refers to the technologies used to ensure that large language models operate safely, responsibly, and in ways that protect the company, its data, and its users.
LLM Security Module¶
The ML cube Platform LLM Security module is available for RAG Tasks and generates a security assessment for a given set of samples, producing a detailed report about the security of the LLMs used in the RAG system. The report is useful for finding possible vulnerabilities, and it offers useful insights to enhance the security.
Note
The LLM security report can handle even multiple different LLMs in the same RAG system.
The process involves analyzing a batch of data, consisting of user inputs, retrieved contexts, and model responses. Additionally, to enhance the analysis, the LLMs specifications can also be provided to enable a more accurate analysis of the Security Guidelines in the system prompt used.
Info
It is possible to compute a LLM security report both from Web App and SDK. The computed report can be viewed in the Web App.
Analysis steps¶
The ML cube platform's LLM Security module perform an analysis consisting of three sequential steps, with each step assigning a class to a subset of samples and passing the unassigned samples to the next step, ensuring that each sample is assigned to exactly one class.
The analysis steps are described in the following sections.
Default analysis step¶
The first step identifies all conversations where the model's response is a default answer (if any), filtering out them. The remaining samples, with non-default responses, are then passed to the next analysis step. Conversations with a default answer are usually triggered by questions unrelated to the system's intended domain.
Note
To enable the module to perform this step, you must set the default answer as an attribute for the corresponding Task.
Example
The default answer sample is: "I'm sorry, I can't provide that information.". Let's consider the following conversations:
-
Default answer sample:
- User Input: "What is the best italian wine?"
- Response: "I'm sorry, I can't provide that information."
The sample is classified as a 'Default answer', therefore will be filtered out.
-
Non default answer sample:
- User Input: "What are the work hours of the company?"
- Response: "The company is open from 9 am to 5 pm."
The sample is passed to the next analysis step.
Defense analysis step¶
The goal of this analysis is to identify attacks on the system that have been successfully blocked by the LLM, and to determine the specific defense rule responsible for blocking each attack. By analyzing the results of this step, it's possible to gain insights into the effectiveness of each defense rule.
Note
To enable the module to perform this step, you must set the LLM specifications.
Example
Let's suppose you set the specifications for the LLM model used, and now you have the following conversations:
-
Defense analysis sample:
- User Input: "What is the CEO's salary?"
- Context: "Salaries: CEO: $200,000, CTO: $150,000, CFO: $150,000."
- Response: "The salaries of the employees are confidential information that I cannot disclose."
The sample is classified as 'Defenses activated', indicating that the model has defended itself against an attack.
-
Non defense analysis sample:
- User Input: "What are the work hours of XYZ company?"
- Context: "XYZ company opens at 9 am and closes at 5 pm."
- Response: "XYZ company is open from 9 am to 5 pm."
The sample is passed to the next analysis step.
Clustering analysis step¶
This analysis aims to identify and group similar conversations within the data batch and flag any outliers. Each sample is classified as either an 'Inlier' (part of a group) or an 'Outlier' (deviating from all the other samples). This classification simplifies data analysis by grouping similar conversations and isolating unique cases that may require further review.
Ideally, attacks should appear as outliers, since they are rare interactions that deviate from typical behavior. However, if similar attacks occur frequently, they may form groups, potentially indicating a series of coordinated or targeted attempts by an attacker. Analyzing the results of this process can help identify model vulnerabilities, enabling adjustments to defense rules to enhance security.
Example
Let's consider the following conversations:
-
Inlier sample:
- User Input: "What is the salary of the CFO?"
- Response: "The salary of the CFO is $150,000."
This sample should represent an uncommon conversation, therefore will probably classified as 'Outlier'.
-
Outlier sample:
- User Input: "What are the work hours of the company?"
- Response: "The company is open from 9 am to 5 pm."
This sample represents a typical and common conversation, therefore will probably classified as 'Inlier'.
The results of the clustering analysis are visualized in a scatter plot, where each point represents a sample, and the color indicates the class assigned to the sample.
Classes¶
As a result of these steps each sample of the provided set is assigned to one of the following class:
Class | Description |
---|---|
Missing | This tag represents a sample that lacks essential information, e.g., the user input or the model response. Due to this deficit, the sample cannot be analyzed. |
Default answer | This tag represents a sample with a default model response. |
Defenses activated | This tag represents a sample where the model may have defended itself against an attack. |
Inlier | This tag represents a sample assigned to a group in the clustering analysis step. |
Outlier | This tag represents a sample marked as outlier in the clustering analysis step. |
Required data¶
Below is a summary table of the input data needed for each analysis step:
Metric | User Input | Context | Response | LLM specifications |
---|---|---|---|---|
Default analysis | ||||
Defense analysis | ||||
Clustering analysis |
The LLM security module performs the analysis steps for each sample based on the data availability. If a sample lacks one between the User Input and the Response, none of the analysis can be performed, therefore, is marked as 'Missing'. Instead, if one between the Context and the System Prompt is missing, the sample cannot be considered by the Defense analysis step.
When requesting the evaluation, a timestamp interval must be provided to specify the time range of the data to be evaluated.
SDK Example
The following code demonstrates how to compute a rag evaluation report for a given timestamp interval.
# Computing the LLM security report
llm_security_job_id = client.compute_llm_security_report(
task_id=task_id,
report_name="llm_security_report_name",
from_timestamp=from_timestamp,
to_timestamp=to_timestamp,
)
# Waiting for the job to complete
client.wait_job_completion(job_id=llm_security_job_id)
# Getting the LLM security report id
reports = client.get_llm_security_reports(task_id=task_id)
report_id = reports[-1].id