Summary

Neural networks achieve high accuracy on clean data but remain vulnerable to carefully crafted perturbations that exploit their differentiable structure. When an attacker has access to model gradients, tiny input modifications aligned with the loss surface can cross decision boundaries and force misclassifications. These first-order attacks use gradient information to generate adversarial examples efficiently, showing the fundamental vulnerabilities in gradient-trained models.

This module provides a deep exploration of gradient-based evasion techniques that target neural network classifiers:

Foundations of gradient-based attacks, including norm constraints, local linearity, and high-dimensional accumulation effects.
Fast Gradient Sign Method (FGSM), which generates adversarial examples through a single gradient step under an L_\infty budget.
Iterative FGSM (I-FGSM), which refines perturbations through multiple smaller steps with projection to improve attack success.
DeepFool, which finds minimal perturbations through iterative linearization and geometric projection onto decision boundaries.

This module is broken into sections with hands-on exercises for each attack method. It concludes with a practical skills assessment to validate your understanding.

You can start and stop at any time and resume where you left off. There is no time limit or grading, but you must complete all exercises and the skills assessment to receive the maximum cubes and have the module marked as complete in any selected paths.

To ensure a smooth learning experience, the following skills are mandatory: solid Python proficiency, familiarity with Jupyter Notebooks, and basic understanding of neural networks and gradient computation.

A firm grasp of these modules is recommended before starting:

It is HIGHLY recommended to use your own PC/Laptop for the practicals.

Introduction to First-Order Evasion Attacks

Machine learning models appear robust when tested on normal data, often achieving 98-99% accuracy on held-out test sets. Yet these same models can be fooled by adversarial examples: inputs carefully modified to cause misclassification while remaining nearly identical to the original (that's the hope anyway). A digit image that clearly shows a "7" to human eyes can be tweaked by less than 1% and suddenly the model confidently predicts "2."

What makes these attacks possible? Neural networks learn by following gradients during training, adjusting weights to minimize loss. The same gradient information that trains the model also reveals its vulnerabilities. First-order attacks exploit this by computing gradients with respect to inputs rather than parameters, using calculus to find directions that maximally increase prediction error.

What Are First-Order Attacks?

First-order attacks use gradient information to craft adversarial examples. During normal training, gradients tell us how to adjust model weights to reduce error. During an attack, gradients tell us how to adjust inputs to increase error.

Think of it as asking the model: "If I change this pixel slightly, how much does your prediction change?" The gradient provides that answer for every pixel simultaneously. An attacker uses those answers to make coordinated changes across all pixels, pushing the input across a decision boundary while keeping the total modification small.

These attacks work in different scenarios. In white-box attacks, the attacker has complete access to the model and can compute exact gradients. In black-box attacks, the attacker only queries the model's outputs, either estimating gradients numerically or crafting examples on a similar surrogate model and relying on transferability.

The key constraint is keeping perturbations small as possible. We measure this using norms (covered in detail in the next chapter).

Two Fundamental Approaches

This module explores two foundational gradient-based attacks that represent different attack philosophies.

FGSM (Fast Gradient Sign Method) takes the direct approach. You decide upfront how much you're willing to change the input, then compute the gradient and move each pixel in the direction that increases loss. The "sign" part means you only look at whether each gradient component is positive or negative, not how large it is. This makes the attack extremely fast: one gradient computation, one step, done. It's simple enough to implement in a few lines of code, yet effective enough that it became the baseline for measuring adversarial robustness.

DeepFool asks a fundamentally different question: what's the smallest change that fools the model? Instead of choosing a budget upfront, DeepFool iteratively searches for the closest decision boundary. It approximates the boundary as a flat surface locally, takes the shortest step to that surface, then repeats with a fresh approximation. After a few iterations, it reaches the true boundary with minimal perturbation. This tells you exactly how vulnerable each input is, making it valuable for measuring model robustness quantitatively.

Why This Matters

Adversarial examples expose a clear gap between high accuracy and true robustness. A model can correctly classify 99% of normal test images while catastrophically failing under adversarial examples. This vulnerability matters in security-critical applications.

First-order attacks are also remarkably transferable. An adversarial example crafted against one model often fools other models, even those with different architectures or training procedures. This enables realistic attacks where the adversary doesn't need full access to the target model.

For defenders, understanding these attacks is essential. They provide baselines for evaluating defensive techniques like adversarial training, input preprocessing, and detection systems. They also help identify which inputs lie close to decision boundaries and might be vulnerable to natural perturbations.

Security Frameworks

Both of the major AI security frameworks (OWASP and SAIF) recognize evasion attacks as a major threat. OWASP's Machine Learning Security Top 10 lists input manipulation as ML01:2023, the highest-ranked risk for traditional ML systems. The framework distinguishes these inference-time attacks from training-time threats like data poisoning, noting that evasion requires only the ability to query a deployed model.

Google's Secure AI Framework (SAIF) addresses evasion through defense in depth: adversarial training during development, robustness evaluation before deployment, and input filtering during operation. SAIF recommends that security teams establish red teams to continuously test production models with adversarial examples, measuring how much perturbation is needed to achieve target attack success rates.

Both frameworks emphasize that defending against evasion requires technical understanding of how these attacks work. The techniques in this module provide that foundation, enabling you to implement OWASP's recommended defenses and execute SAIF's evaluation protocols.

Sign Up / Log In to Unlock the Module

Please Sign Up or Log In to unlock the module and access the rest of the sections.

Sections

Introduction to First-Order Evasion Attacks PREVIEW
Understanding Norms
FGSM
FGSM Setup
Normalization
Core Implementation
Evaluation Metrics
Visualization
Targeted FGSM
I-FGSM
I-FGSM Implementation
I-FGSM Analysis
FGSM Challenge
DeepFool and the Quest for Minimality
DeepFool Theory and Formulation
Building and Training the Target Model
DeepFool Implementation
Demonstrating DeepFool
Batch Attack Generation
Perturbation Analysis and Metrics
DeepFool Challenge
Skills Assessment 1
Skills Assessment 2

Relevant Paths

This module progresses you towards the following Paths

AI Red Teamer

The AI Red Teamer Job Role Path, in collaboration with Google, trains cybersecurity professionals to assess, exploit, and secure AI systems. Covering prompt injection, model privacy attacks, adversarial AI, supply chain risks, and deployment threats, it combines theory with hands-on exercises. Aligned with Google’s Secure AI Framework (SAIF), it ensures relevance to real-world AI security challenges. Learners will gain skills to manipulate model behaviors, develop AI-specific red teaming strategies, and perform offensive security testing against AI-driven applications.

Hard

229 Sections

Required: 970

Reward: +210

12 Modules included

Fundamentals of AI

Medium

24 Sections

Reward: +10

This module provides a comprehensive guide to the theoretical foundations of Artificial Intelligence (AI). It covers various learning paradigms, including supervised, unsupervised, and reinforcement learning, providing a solid understanding of key algorithms and concepts.

Applications of AI in InfoSec

Medium

25 Sections

Reward: +10

This module is a practical introduction to building AI models that can be applied to various infosec domains. It covers setting up a controlled AI environment using Miniconda for package management and JupyterLab for interactive experimentation. Students will learn to handle datasets, preprocess and transform data, and implement structured workflows for tasks such as spam classification, network anomaly detection, and malware classification. Throughout the module, learners will explore essential Python libraries like Scikit-learn and PyTorch, understand effective approaches to dataset processing, and become familiar with common evaluation metrics, enabling them to navigate the entire lifecycle of AI model development and experimentation.

Introduction to Red Teaming AI

Medium

11 Sections

Reward: +10

This module provides a comprehensive introduction to the world of red teaming Artificial Intelligence (AI) and systems utilizing Machine Learning (ML) deployments. It covers an overview of common security vulnerabilities in these systems and the types of attacks that can be launched against their components.

Prompt Injection Attacks

Medium

11 Sections

Reward: +20

This module comprehensively introduces one of the most prominent attacks on large language models (LLMs): Prompt Injection. It introduces prompt injection basics and covers detailed attack vectors based on real-world vulnerability reports. Furthermore, the module touches on academic research in the fields of novel prompt injection techniques and jailbreaks.

LLM Output Attacks

Medium

14 Sections

Reward: +20

In this module, we will explore different LLM output vulnerabilities resulting from improper handling of LLM outputs and insecure LLM applications. We will also touch on LLM abuse attacks, such as hate speech campaigns and misinformation generation, with a particular focus on the detection and mitigation of these attacks.

AI Data Attacks

Hard

25 Sections

Reward: +20

This module explores the intersection of Data and Artificial Intelligence, exposing how vulnerabilities within AI data pipelines can be exploited, ultimately aiming to degrade performance, achieve specific misclassifications, or execute arbitrary code.

Attacking AI - Application and System

Medium

14 Sections

Reward: +20

In this module, we will explore security vulnerabilities in the application and system components of AI deployments. We will also discuss the Model Context Protocol (MCP), an orchestration protocol for AI deployments introduced in 2024, including a deep dive into how the protocol works and how security vulnerabilities may arise.

AI Evasion - Foundations

Medium

12 Sections

Reward: +20

This module explores the foundations of inference‑time evasion attacks against AI models, showing how to manipulate inputs to bypass classifiers and force targeted misclassifications in white‑ and black‑box settings.

AI Evasion - First-Order Attacks

Hard

23 Sections

Reward: +20

This module explores gradient-based adversarial attacks that manipulate neural network inputs at inference time, showing how to craft perturbations that cause misclassification through white-box access to model gradients.

AI Evasion - Sparsity Attacks

Hard

28 Sections

Reward: +20

This module explores sparsity-constrained adversarial attacks that minimize the number of modified input features rather than perturbation magnitude, showing how to craft targeted misclassifications by changing only the most impactful pixels through L0-focused optimization and saliency-guided feature selection.

AI Privacy

Medium

21 Sections

Reward: +20 NEW

This module explores privacy attacks against machine learning models and the differential privacy defenses that protect models from such attacks.

AI Defense

Medium

21 Sections

Reward: +20 NEW

In this module, we will explore how to defend AI applications from the attack vectors discussed in the AI Red Teamer path. We will examine adversarial training, adversarial tuning, and LLM guardrails, including the fundamental concepts and practical implementation of these defensive measures.