The transition from HTB CBBH to HTB CWES has officially started. Learn More

AI Evasion - Foundations

This module explores the foundations of inference‑time evasion attacks against AI models, showing how to manipulate inputs to bypass classifiers and force targeted misclassifications in white‑ and black‑box settings.

Created by PandaSt0rm

Medium Offensive

Summary

Because deployed models are user data‑driven and accept untrusted prompts, tokens, text, and other such inputs - their security, reliability, and integrity all hinge on how they process inputs and expose predictions.

This module provides a foundational exploration of evasion techniques which target the inference time process:

  • Evasion threat model, white‑box versus black‑box, and transferability.
  • Feature‑obfuscation attacks such as GoodWords.
  • Building and attacking a spam filter (UCI SMS) with white‑box GoodWords.
  • Black‑box discovery under query limits, including candidate vocabulary construction, adaptive selection, and small‑combination testing.

This module is broken into sections with hands‑on exercises for each technique. It concludes with a practical skills assessment to validate your understanding.

You can start and stop at any time and resume where you left off. There is no time limit or grading, but you must complete all exercises and the skills assessment to receive the maximum cubes and have the module marked as complete in any selected paths.

To ensure a smooth learning experience, the following skills are mandatory: solid Python proficiency and familiarity with Jupyter Notebooks.

A firm grasp of these modules is recommended before starting:

It is HIGHLY recommended to use your own PC/Laptop for the practicals.

Introduction to AI Evasion Attacks

Evasion attacks manipulate inference-time inputs to cause a trained model to produce incorrect outputs. Adversarial machine learning, the discipline that studies these hostile interactions with machine learning systems, treats this interference during the inference phase as a distinct threat because it bypasses safeguards built into the training pipeline.

Understanding evasion is therefore essential for anyone securing deployed AI models, such as AI Red Teamers.

The AI Attack Landscape

To understand evasion, place it within the lifecycle of adversarial machine learning. AI systems can be attacked during training or at inference, and the distinction drives very different defenses.

Training‑time attacks change what the model learns. Data poisoning injects or alters training samples so the model internalizes biased patterns. Label manipulation tampers with annotations so ground truth no longer matches reality. Trojan attacks implant hidden triggers that activate specific behavior when an attacker’s pattern appears in the input.

These training-time attacks differ from inference-time manipulation, which changes only what the model sees when making a prediction. Data poisoning requires access to the dataset or pipeline and tends to shift behavior globally. Evasion, by contrast, leaves training and parameters untouched and succeeds by sending crafted inputs through the normal interface so a single example crosses the model’s learned decision boundary.

A property called transferability links these settings to real deployments. An adversarial example crafted against a surrogate model can often fool different production models, particularly when architectures or data are similar. This enables offline preparation and black-box attacks where the defender exposes only an API but not internals.

Evasion Attacks in Traditional ML vs. LLMs

Evasion in traditional machine learning targets fixed feature representations, so small, controlled edits can push an example across a learned decision boundary. In a spam filter that uses a bag-of-words model, adding benign tokens changes term frequencies and therefore the posterior used for classification, while the message still reads the same to a human. In a static malware detector, rearranging sections or perturbing imports alters byte patterns and crafted signatures without changing program intent. The common thread is direct manipulation of the summarized features the model consumes, with the goal of moving a single prediction to the other side of its threshold at inference time.

By contrast, large language models produce open-ended text and follow instructions, which makes prompt injection the central evasion pattern. An attacker places adversarial directives into the prompt or surrounding context so the model prioritizes those instructions over the original system guidance, all without changing parameters. The effect is still inference-time only, yet the consequences differ because the output is itself an action surface, for example a block of code, a database query, or markup that another system might execute or render. When tool routing or API calling wraps the model, those outputs can trigger follow-on behavior that extends the attack beyond the initial response.

Both settings share the same core idea, modify only the input seen at prediction time and steer the model’s behavior. The main difference lies in where the leverage comes from. Traditional models expose structured features and a label threshold, so evasion edits target the statistics that feed the classifier. LLMs expose conversational state and instruction following, so evasion edits target the model’s decision process through carefully placed text and context management. This distinction explains why we begin with structured classifiers, the mechanics are easy to observe and measure, then we map the intuition to LLM prompts where the surface is richer but the objective is the same.

This Module's Focus

This module anchors these concepts through a hands-on technique: the GoodWords attack. This attack manipulates probabilistic spam filters by inserting carefully selected benign tokens. Because Naive Bayes models assume conditional independence between features, adding those tokens shifts posterior probabilities enough to force a misclassification without raising obvious suspicion.

Studying this method builds intuition that spans domains. The attack exploits model assumptions, highlights the difference between white-box and black-box knowledge, and underscores the trade-off between subtle perturbations and reliable success, which also generalize well to other classification tasks and highlight a fundamental approach to adversarial manipulation.

Sign Up / Log In to Unlock the Module

Please Sign Up or Log In to unlock the module and access the rest of the sections.

Relevant Paths

This module progresses you towards the following Paths

AI Red Teamer

The AI Red Teamer Job Role Path, in collaboration with Google, trains cybersecurity professionals to assess, exploit, and secure AI systems. Covering prompt injection, model privacy attacks, adversarial AI, supply chain risks, and deployment threats, it combines theory with hands-on exercises. Aligned with Google’s Secure AI Framework (SAIF), it ensures relevance to real-world AI security challenges. Learners will gain skills to manipulate model behaviors, develop AI-specific red teaming strategies, and perform offensive security testing against AI-driven applications. The path will be gradually expanded with related modules until its completion.

Hard Path Sections 136 Sections
Required: 570
Reward: +130
Path Modules
Medium
Path Sections 24 Sections
Reward: +10
This module provides a comprehensive guide to the theoretical foundations of Artificial Intelligence (AI). It covers various learning paradigms, including supervised, unsupervised, and reinforcement learning, providing a solid understanding of key algorithms and concepts.
Medium
Path Sections 25 Sections
Reward: +10
This module is a practical introduction to building AI models that can be applied to various infosec domains. It covers setting up a controlled AI environment using Miniconda for package management and JupyterLab for interactive experimentation. Students will learn to handle datasets, preprocess and transform data, and implement structured workflows for tasks such as spam classification, network anomaly detection, and malware classification. Throughout the module, learners will explore essential Python libraries like Scikit-learn and PyTorch, understand effective approaches to dataset processing, and become familiar with common evaluation metrics, enabling them to navigate the entire lifecycle of AI model development and experimentation.
Medium
Path Sections 11 Sections
Reward: +10
This module provides a comprehensive introduction to the world of red teaming Artificial Intelligence (AI) and systems utilizing Machine Learning (ML) deployments. It covers an overview of common security vulnerabilities in these systems and the types of attacks that can be launched against their components.
Medium
Path Sections 11 Sections
Reward: +20
This module comprehensively introduces one of the most prominent attacks on large language models (LLMs): Prompt Injection. It introduces prompt injection basics and covers detailed attack vectors based on real-world vulnerability reports. Furthermore, the module touches on academic research in the fields of novel prompt injection techniques and jailbreaks.
Medium
Path Sections 14 Sections
Reward: +20
In this module, we will explore different LLM output vulnerabilities resulting from improper handling of LLM outputs and insecure LLM applications. We will also touch on LLM abuse attacks, such as hate speech campaigns and misinformation generation, with a particular focus on the detection and mitigation of these attacks.
Hard
Path Sections 25 Sections
Reward: +20
This module explores the intersection of Data and Artificial Intelligence, exposing how vulnerabilities within AI data pipelines can be exploited, ultimately aiming to degrade performance, achieve specific misclassifications, or execute arbitrary code.
Medium
Path Sections 14 Sections
Reward: +20 NEW
In this module, we will explore security vulnerabilities in the application and system components of AI deployments. We will also discuss the Model Context Protocol (MCP), an orchestration protocol for AI deployments introduced in 2024, including a deep dive into how the protocol works and how security vulnerabilities may arise.
Medium
Path Sections 12 Sections
Reward: +20 NEW
This module explores the foundations of inference‑time evasion attacks against AI models, showing how to manipulate inputs to bypass classifiers and force targeted misclassifications in white‑ and black‑box settings.