Summary
After discussing a variety of attacks against AI applications throughout the AI Red Teamer path, we will explore defensive measures in this module. In particular, we will discuss how to mitigate attacks such as evasion attacks, data attacks, and attacks on LLMs, including prompt injection and jailbreaks. Defensive measures, such as adversarial training and adversarial tuning, are applied directly to the model during the training process. In contrast, LLM guardrails are an application-layer measure implemented at the time of inference.
In more detail, this module covers the following:
- Theoretical fundamentals of adversarial training
- Implementing adversarial training
- Theoretical fundamentals of adversarial tuning
- Implementing adversarial tuning
- Basic concepts of LLM guardrails
- Implementing LLM guardrails
This module is broken into sections with accompanying hands-on exercises to practice each of the tactics and techniques we cover. The module ends with a practical hands-on skills assessment to gauge your understanding of the various topic areas.
You can start and stop the module at any time and pick up where you left off. There is no time limit or "grading", but you must complete all of the exercises and the skills assessment to receive the maximum number of cubes and have this module marked as complete in any paths you have chosen.
A firm grasp of the following modules can be considered a prerequisite for the successful completion of this module:
- Fundamentals of AI
- Applications of AI in InfoSec
- Introduction to Red Teaming AI
- Prompt Injection Attacks
- LLM Output Attacks
- AI Evasion - Foundations
- AI Evasion - First-Order Attacks
Note: Running the code snippets for adversarial training and particularly adversarial tuning requires powerful hardware. However, you do not need to run the code yourself to complete the module, as it is purely optional.
Introduction to AI Defense
As we have seen throughout the AI Red Teamer path, AI applications can be vulnerable to misuse, exploitation, and manipulation if not carefully protected. The impact of attacks on AI applications increases as their capabilities grow and they become more deeply integrated into other potentially sensitive systems. Building a secure AI system to mitigate the attacks explored throughout this path requires a multi-layered approach that prepares models for these attacks and establishes safeguards to prevent unintended behavior.
In this module, we will examine three key components of a comprehensive defense strategy for AI applications: LLM guardrails, adversarial training, and adversarial tuning. Guardrails are application-layer safeguards that is implemented at the time of inference. They define the rules and constraints that shape what the model can and cannot do, providing consistent boundaries for safe operation. On the other hand, adversarial training and adversarial tuning are implemented during the training process to make the target model more robust against specific attacks. Adversarial training strengthens models by exposing them to deceptive or manipulative examples during training, reducing the likelihood that they will fail when confronted with similar payloads at inference time. Adversarial tuning builds on these ideas by refining model behavior in response to evolving attack patterns, helping systems remain resilient as new threats emerge.
Together, these techniques form proactive and adaptive defense measures that can be implemented to mitigate or prevent the attack vectors discussed throughout the path. By understanding how these three concepts reinforce one another, we can design AI applications that provide a high level of security, ensuring safety, reliability, and user trust.