Summary

After discussing a variety of attacks against AI applications throughout the AI Red Teamer path, we will explore defensive measures in this module. In particular, we will discuss how to mitigate attacks such as evasion attacks, data attacks, and attacks on LLMs, including prompt injection and jailbreaks. Defensive measures, such as adversarial training and adversarial tuning, are applied directly to the model during the training process. In contrast, LLM guardrails are an application-layer measure implemented at the time of inference.

In more detail, this module covers the following:

Theoretical fundamentals of adversarial training
Implementing adversarial training
Theoretical fundamentals of adversarial tuning
Implementing adversarial tuning
Basic concepts of LLM guardrails
Implementing LLM guardrails

This module is broken into sections with accompanying hands-on exercises to practice each of the tactics and techniques we cover. The module ends with a practical hands-on skills assessment to gauge your understanding of the various topic areas.

You can start and stop the module at any time and pick up where you left off. There is no time limit or "grading", but you must complete all of the exercises and the skills assessment to receive the maximum number of cubes and have this module marked as complete in any paths you have chosen.

A firm grasp of the following modules can be considered a prerequisite for the successful completion of this module:

Note: Running the code snippets for adversarial training and particularly adversarial tuning requires powerful hardware. However, you do not need to run the code yourself to complete the module, as it is purely optional.

Introduction to AI Defense

As we have seen throughout the AI Red Teamer path, AI applications can be vulnerable to misuse, exploitation, and manipulation if not carefully protected. The impact of attacks on AI applications increases as their capabilities grow and they become more deeply integrated into other potentially sensitive systems. Building a secure AI system to mitigate the attacks explored throughout this path requires a multi-layered approach that prepares models for these attacks and establishes safeguards to prevent unintended behavior.

In this module, we will examine three key components of a comprehensive defense strategy for AI applications: LLM guardrails, adversarial training, and adversarial tuning. Guardrails are application-layer safeguards that is implemented at the time of inference. They define the rules and constraints that shape what the model can and cannot do, providing consistent boundaries for safe operation. On the other hand, adversarial training and adversarial tuning are implemented during the training process to make the target model more robust against specific attacks. Adversarial training strengthens models by exposing them to deceptive or manipulative examples during training, reducing the likelihood that they will fail when confronted with similar payloads at inference time. Adversarial tuning builds on these ideas by refining model behavior in response to evolving attack patterns, helping systems remain resilient as new threats emerge.

Together, these techniques form proactive and adaptive defense measures that can be implemented to mitigate or prevent the attack vectors discussed throughout the path. By understanding how these three concepts reinforce one another, we can design AI applications that provide a high level of security, ensuring safety, reliability, and user trust.

Sign Up / Log In to Unlock the Module

Please Sign Up or Log In to unlock the module and access the rest of the sections.

Sections

Introduction to AI Defense PREVIEW
Introduction to LLM Guardrails
Character-based Validation
Traditional Content-based Validation
AI-based Guardrails
Guardrail Libraries
Guardrail Services
Guardrail Challenge
Introduction to Adversarial Training
Understanding Adversarial Vulnerability
The Adversarial Training Defense
Building the Components
Training the Robust Model
Evaluation and Results
Introduction to LLM Adversarial Tuning
Understanding Jailbreak Attacks
Supervised Fine-Tuning for Safety
Lab Setup and Data Preparation
Training the Safety-Tuned Model
Evaluation and Validation
Skills Assessment

Relevant Paths

This module progresses you towards the following Paths

AI Red Teamer

The AI Red Teamer Job Role Path, in collaboration with Google, trains cybersecurity professionals to assess, exploit, and secure AI systems. Covering prompt injection, model privacy attacks, adversarial AI, supply chain risks, and deployment threats, it combines theory with hands-on exercises. Aligned with Google’s Secure AI Framework (SAIF), it ensures relevance to real-world AI security challenges. Learners will gain skills to manipulate model behaviors, develop AI-specific red teaming strategies, and perform offensive security testing against AI-driven applications.

Hard

230 Sections

Required: 970

Reward: +210

12 Modules included

Fundamentals of AI

Medium

24 Sections

Reward: +10

This module provides a comprehensive guide to the theoretical foundations of Artificial Intelligence (AI). It covers various learning paradigms, including supervised, unsupervised, and reinforcement learning, providing a solid understanding of key algorithms and concepts.

Applications of AI in InfoSec

Medium

25 Sections

Reward: +10

This module is a practical introduction to building AI models that can be applied to various infosec domains. It covers setting up a controlled AI environment using Miniconda for package management and JupyterLab for interactive experimentation. Students will learn to handle datasets, preprocess and transform data, and implement structured workflows for tasks such as spam classification, network anomaly detection, and malware classification. Throughout the module, learners will explore essential Python libraries like Scikit-learn and PyTorch, understand effective approaches to dataset processing, and become familiar with common evaluation metrics, enabling them to navigate the entire lifecycle of AI model development and experimentation.

Introduction to Red Teaming AI

Medium

11 Sections

Reward: +10

This module provides a comprehensive introduction to the world of red teaming Artificial Intelligence (AI) and systems utilizing Machine Learning (ML) deployments. It covers an overview of common security vulnerabilities in these systems and the types of attacks that can be launched against their components.

Prompt Injection Attacks

Medium

12 Sections

Reward: +20

This module comprehensively introduces one of the most prominent attacks on large language models (LLMs): Prompt Injection. It introduces prompt injection basics and covers detailed attack vectors based on real-world vulnerability reports. Furthermore, the module touches on academic research in the fields of novel prompt injection techniques and jailbreaks.

LLM Output Attacks

Medium

14 Sections

Reward: +20

In this module, we will explore different LLM output vulnerabilities resulting from improper handling of LLM outputs and insecure LLM applications. We will also touch on LLM abuse attacks, such as hate speech campaigns and misinformation generation, with a particular focus on the detection and mitigation of these attacks.

AI Data Attacks

Hard

25 Sections

Reward: +20

This module explores the intersection of Data and Artificial Intelligence, exposing how vulnerabilities within AI data pipelines can be exploited, ultimately aiming to degrade performance, achieve specific misclassifications, or execute arbitrary code.

Attacking AI - Application and System

Medium

14 Sections

Reward: +20

In this module, we will explore security vulnerabilities in the application and system components of AI deployments. We will also discuss the Model Context Protocol (MCP), an orchestration protocol for AI deployments introduced in 2024, including a deep dive into how the protocol works and how security vulnerabilities may arise.

AI Evasion - Foundations

Medium

12 Sections

Reward: +20

This module explores the foundations of inference‑time evasion attacks against AI models, showing how to manipulate inputs to bypass classifiers and force targeted misclassifications in white‑ and black‑box settings.

AI Evasion - First-Order Attacks

Hard

23 Sections

Reward: +20

This module explores gradient-based adversarial attacks that manipulate neural network inputs at inference time, showing how to craft perturbations that cause misclassification through white-box access to model gradients.

AI Evasion - Sparsity Attacks

Hard

28 Sections

Reward: +20

This module explores sparsity-constrained adversarial attacks that minimize the number of modified input features rather than perturbation magnitude, showing how to craft targeted misclassifications by changing only the most impactful pixels through L0-focused optimization and saliency-guided feature selection.

AI Privacy

Medium

21 Sections

Reward: +20 NEW

This module explores privacy attacks against machine learning models and the differential privacy defenses that protect models from such attacks.

AI Defense

Medium

21 Sections

Reward: +20 NEW

In this module, we will explore how to defend AI applications from the attack vectors discussed in the AI Red Teamer path. We will examine adversarial training, adversarial tuning, and LLM guardrails, including the fundamental concepts and practical implementation of these defensive measures.