Make your 2026 legendary, the HTB way — 25% off our HTB Academy Annual Plans for a limited time. Learn More

AI Privacy

This module explores privacy attacks against machine learning models and the differential privacy defenses that protect models from such attacks.

Created by PandaSt0rm

Medium Defensive

Summary

Machine learning models are trained to learn patterns, not memorize individual bits of data. Yet every model trained on personal data carries a hidden risk: it can reveal which specific individuals were in its training set. This privacy violation, known as membership inference, represents one of the core threats to ML systems deployed on sensitive data.

The attack exploits a fundamental tension in machine learning: models tend to behave differently on data they have seen during training versus data they have not. This behavioral gap, caused by overfitting, creates a detectable membership fingerprint. An attacker with access only to model predictions can determine whether a particular person's data was used for training. If the model was trained exclusively on cancer patients, successfully identifying someone as a training member reveals their medical status.

This module provides an exploration of both attack and defense:

  • Membership Inference Attacks (MIA) using the shadow model methodology introduced by Shokri et al. to train attack classifiers that detect membership based on prediction confidence patterns.
  • Understanding differential privacy
  • DP-SGD (Differentially Private Stochastic Gradient Descent), which modifies training through per-sample gradient clipping and calibrated noise injection to limit any individual's influence on model parameters.
  • PATE (Private Aggregation of Teacher Ensembles), which achieves privacy through architectural separation, training multiple teachers on disjoint data partitions and using noisy vote aggregation to label public data for student training.

This module is broken into sections with hands-on exercises for implementing attacks and defenses. The DP-SGD section concludes with a skills assessment requiring submission to a validation server. It concludes with a practical skills assessment to validate your understanding.

You can start and stop at any time and resume where you left off. There is no time limit or grading, but you must complete all exercises and the skills assessment to receive the maximum cubes and have the module marked as complete in any selected paths.

To ensure a smooth learning experience, the following skills are mandatory: solid Python proficiency, familiarity with PyTorch, and understanding of neural network training, optimization, and evaluation metrics.

A firm grasp of these modules is recommended before starting:

Pwnbox will not be a good experience for this module. It is HIGHLY recommended to use your own PC/Laptop for the practicals.

Privacy Threats and Attack Fundamentals

Machine learning models are supposed to learn patterns, not memorize individuals. Yet every model trained on personal data carries a hidden risk: it can reveal which specific individuals were in its training set. This privacy violation, known as membership inference, represents one of the core threats to ML systems deployed on sensitive data.

What is a Membership Inference Attack?

A Membership Inference Attack (MIA) targets a straightforward question: was a specific data point used to train a given model? We provide a sample and receive a binary answer (member or non-member) based on how the model responds to that sample. Despite its simplicity, this binary classification has serious privacy implications because membership itself can reveal sensitive information.

Privacy Attacks on Machine Learning

MIA belongs to a broader family of privacy attacks targeting ML systems, each extracting different types of information.

Model inversion attacks attempt to reconstruct training data features from model outputs. Given a model that predicts disease risk from genetic markers, an attacker might infer genetic information about training subjects. A related threat comes from attribute inference attacks, where we deduce sensitive attributes not directly predicted by the model. If a model predicts income and was trained on complete records, querying it strategically might reveal education levels or employment status of training members. The most direct form of leakage occurs in training data extraction attacks, where we recover verbatim training examples from models. Large language models that memorize specific sequences are particularly vulnerable to this approach.

Where does MIA fit in this landscape? It represents a basic privacy violation because it requires the least information to execute and often appears alongside other attacks. Even if attackers do not run a separate membership test first, a model that leaks membership information tends to be easier to exploit with model inversion attacks or attribute inference attacks. MIA success also serves as a privacy audit metric: if a model leaks membership, it likely leaks other information too. We focus on MIA because it establishes a baseline vulnerability that other attacks often build upon.

Consider a medical diagnosis model trained on patient records. An attacker with access only to the model's predictions could determine whether a particular patient's data was used for training. If the model was trained exclusively on cancer patients, successfully identifying someone as a training member reveals their medical status. Similar scenarios arise in financial services, where membership in a credit scoring model's training set might expose loan history, or in social platforms where presence in a content moderation dataset could imply previous violations.

This module establishes the baseline vulnerability of ML models to membership inference before we introduce any defenses. We implement the core attack methodology from Shokri et al.'s 2017 paper, which demonstrated that standard neural networks leak substantial membership information through their prediction behavior.

Why Membership Inference Works

Membership inference exploits the distinction between memorization and generalization. When we train a model, we want it to learn general patterns that apply to new data. In practice, models also memorize specific training examples, particularly those that are unusual, repeated, or near decision boundaries. This memorization creates detectable behavioral differences between how models treat data they have seen versus data they have not.

Consider what happens during gradient descent. We adjust model weights to reduce loss on training samples. After many iterations, we fit training data very closely, sometimes perfectly classifying every training example. But this tight fit does not transfer to new data. We have learned idiosyncrasies of specific training samples instead of underlying patterns. This phenomenon, commonly called overfitting, is the root cause of membership leakage.

Our attack model extracts two distinct signals from model predictions. The first is prediction confidence: how certain is the model about its prediction? Members tend to receive higher-confidence predictions because the model optimized specifically for them during training. A member might receive [0.05, 0.95] (95% confidence in class 1) while a non-member gets [0.20, 0.80] (80% confidence). The second signal is prediction correctness: does the predicted class match the true label? Models are more accurate on their training data because they have seen those exact feature combinations during training.

These signals interact in interesting ways. A high-confidence correct prediction is a strong membership indicator because both signals agree. A high-confidence incorrect prediction suggests non-membership. The challenging cases are low-confidence predictions where neither signal is decisive, and these form the bulk of attack errors. Sometimes the signals conflict: a member receives a low-confidence prediction on a genuinely ambiguous sample, or a non-member receives high confidence on a very typical sample. Our attack model learns to weight these signals based on how often each pattern corresponds to membership.

Model capacity also matters. Larger models with more parameters can memorize more training data, making them more vulnerable to MIA. A model with 10 million parameters stores more training examples in its weights than one with 100,000 parameters. Regularization techniques like dropout, weight decay, and early stopping reduce memorization but cannot eliminate it entirely. The optimization process inherently treats training data differently from unseen data, creating tension between model utility and privacy.

Our goal as attackers is to learn a binary classifier that exploits this memorization gap. Given a specific sample and its true label, we determine whether that sample was in the target model's training set. A successful attack correctly identifies training members with accuracy significantly better than random guessing (50%).

Industry Security Frameworks

Three frameworks catalogue ML privacy risks. The OWASP ML Security Top 10 (draft v0.3) lists membership inference as ML04:2023, alongside model inversion (ML03:2023) and model theft (ML05:2023). It assigns membership inference moderate exploitability (4/5) and moderate impact (4/5), recommending differential privacy and regularization as primary defenses.

The OWASP Top 10 for LLM Applications (2025) addresses generative AI with LLM02: Sensitive Information Disclosure covering training data leakage, and LLM04: Data and Model Poisoning addressing the memorization patterns that enable inference attacks.

Google SAIF takes an architectural approach, categorizing membership inference under Sensitive Data Disclosure and Inferred Sensitive Data. SAIF assigns responsibility to both model creators (implement privacy-preserving training) and model consumers (filter outputs, monitor queries).

All three frameworks converge on the same conclusion: if a model leaks membership, it likely leaks other information too. Differential privacy is the recommended countermeasure, which is why we focus on DP-SGD and PATE in subsequent sections.

Sign Up / Log In to Unlock the Module

Please Sign Up or Log In to unlock the module and access the rest of the sections.

Relevant Paths

This module progresses you towards the following Paths

AI Red Teamer

The AI Red Teamer Job Role Path, in collaboration with Google, trains cybersecurity professionals to assess, exploit, and secure AI systems. Covering prompt injection, model privacy attacks, adversarial AI, supply chain risks, and deployment threats, it combines theory with hands-on exercises. Aligned with Google’s Secure AI Framework (SAIF), it ensures relevance to real-world AI security challenges. Learners will gain skills to manipulate model behaviors, develop AI-specific red teaming strategies, and perform offensive security testing against AI-driven applications.

Hard Path Sections 229 Sections
Required: 970
Reward: +210
Path Modules
Medium
Path Sections 24 Sections
Reward: +10
This module provides a comprehensive guide to the theoretical foundations of Artificial Intelligence (AI). It covers various learning paradigms, including supervised, unsupervised, and reinforcement learning, providing a solid understanding of key algorithms and concepts.
Medium
Path Sections 25 Sections
Reward: +10
This module is a practical introduction to building AI models that can be applied to various infosec domains. It covers setting up a controlled AI environment using Miniconda for package management and JupyterLab for interactive experimentation. Students will learn to handle datasets, preprocess and transform data, and implement structured workflows for tasks such as spam classification, network anomaly detection, and malware classification. Throughout the module, learners will explore essential Python libraries like Scikit-learn and PyTorch, understand effective approaches to dataset processing, and become familiar with common evaluation metrics, enabling them to navigate the entire lifecycle of AI model development and experimentation.
Medium
Path Sections 11 Sections
Reward: +10
This module provides a comprehensive introduction to the world of red teaming Artificial Intelligence (AI) and systems utilizing Machine Learning (ML) deployments. It covers an overview of common security vulnerabilities in these systems and the types of attacks that can be launched against their components.
Medium
Path Sections 11 Sections
Reward: +20
This module comprehensively introduces one of the most prominent attacks on large language models (LLMs): Prompt Injection. It introduces prompt injection basics and covers detailed attack vectors based on real-world vulnerability reports. Furthermore, the module touches on academic research in the fields of novel prompt injection techniques and jailbreaks.
Medium
Path Sections 14 Sections
Reward: +20
In this module, we will explore different LLM output vulnerabilities resulting from improper handling of LLM outputs and insecure LLM applications. We will also touch on LLM abuse attacks, such as hate speech campaigns and misinformation generation, with a particular focus on the detection and mitigation of these attacks.
Hard
Path Sections 25 Sections
Reward: +20
This module explores the intersection of Data and Artificial Intelligence, exposing how vulnerabilities within AI data pipelines can be exploited, ultimately aiming to degrade performance, achieve specific misclassifications, or execute arbitrary code.
Medium
Path Sections 14 Sections
Reward: +20
In this module, we will explore security vulnerabilities in the application and system components of AI deployments. We will also discuss the Model Context Protocol (MCP), an orchestration protocol for AI deployments introduced in 2024, including a deep dive into how the protocol works and how security vulnerabilities may arise.
Medium
Path Sections 12 Sections
Reward: +20
This module explores the foundations of inference‑time evasion attacks against AI models, showing how to manipulate inputs to bypass classifiers and force targeted misclassifications in white‑ and black‑box settings.
Hard
Path Sections 23 Sections
Reward: +20
This module explores gradient-based adversarial attacks that manipulate neural network inputs at inference time, showing how to craft perturbations that cause misclassification through white-box access to model gradients.
Hard
Path Sections 28 Sections
Reward: +20
This module explores sparsity-constrained adversarial attacks that minimize the number of modified input features rather than perturbation magnitude, showing how to craft targeted misclassifications by changing only the most impactful pixels through L0-focused optimization and saliency-guided feature selection.
Medium
Path Sections 21 Sections
Reward: +20 NEW
This module explores privacy attacks against machine learning models and the differential privacy defenses that protect models from such attacks.
Medium
Path Sections 21 Sections
Reward: +20 NEW
In this module, we will explore how to defend AI applications from the attack vectors discussed in the AI Red Teamer path. We will examine adversarial training, adversarial tuning, and LLM guardrails, including the fundamental concepts and practical implementation of these defensive measures.