NewDeepKeep launches Vibe AI Red Teaming- a new approach to AI securityRead more
DeepKeep

AI Security Glossary

Adversarial Attacks

Backdoor attacks

An attack in which malicious triggers are embedded into an AI model during training, causing the model to behave normally under standard inputs but produce attacker-controlled outputs when a specific trigger is present. Backdoor attacks are a primary supply chain risk when using third-party or open-source models.

Data Stealing

The unauthorized acquisition of data with the intention of misusing it through methods like hacking, phishing, malware, data breaches, and insider threats.

Denial of Service (DoS)

An adversarial attack that shuts down machines or networks to prevent them from functioning normally, by flooding the model with token-intensive prompts designed to exhaust compute resources, degrade response quality, or cause context window overflow.

Evasion

The most common adversarial attack on AI models, performed during inference. The attacker crafts an input that appears normal but causes the model to misclassify it or behave in unintended ways. For example, manipulating a prompt to bypass access controls or extract restricted information.

Model Stealing

A technique that allows adversaries to create models that imitate the functionality of black-box (defined below under Cyber Attacks) ML models. The attacker then queries the stolen models to gain insights without accessing the original models.

Data Poisoning

Involves manipulating a training dataset by introducing, modifying, or deleting specific data points. Attackers poison data to introduce biases, errors, or vulnerabilities into ML models, negatively impacting their decisions or predictions during deployment.

Model poisoning

Model Poisoning

A supply chain attack that corrupts an AI model by injecting malicious data during training, altering model weights, or embedding backdoor triggers. Distinct from data poisoning in that it can occur post-training and is especially relevant when sourcing open-source or third-party models.

Unsafe deserialization

Unsafe Deserialization

A code execution attack that exploits insecure AI model file formats (e.g., Pickle, NumPy .npy) to run arbitrary malicious code when a model file is loaded. A leading vector for supply chain compromise.

Training data extraction

Training Data Extraction

An attack in which prompts are crafted to cause a model to reproduce content from its training data - exposing proprietary datasets, intellectual property, copyrighted material, or sensitive business information.

Membership inference

Membership Inference

An attack that determines whether specific data was used to train a model, exposing information about proprietary datasets, data sourcing practices, or licensing compliance.

GenAI & LLM Risks

Hallucination

When a model generates misleading, nonsensical or incorrect output. Models often lack the capacity to respond "I don't know" and instead generate false information with unwavering confidence.

Jailbreak

Written phrases and creative prompts that bypass or trick a model's safeguards to draw out prohibited information that would otherwise be blocked by content filters and guidelines. Unlike Prompt Injection, which aims at the system outputs, Jailbreaking seeks to compromise the model's alignment.

PII / Personal Data

Also known as Personally Identifiable Information, PII risks violating privacy laws or stipulations when used to prompt or train GenAI models. This not only risks the deletion of data containing PII,  but also entire models.

Prompt Injection

Meant to elicit unintended behaviors, Prompt Injections are attacks that impact outputs like search rankings, website content and chatbot behaviors. A Direct Prompt Injection is intentional, while an Indirect one involves a user who unknowingly "injects" commands and text.

Toxicity

When LLMs produce manipulative images and text, potentially leading to the spread of disinformation and other harmful consequences.

Instruction hijacking

Instruction Hijacking

An attack in which adversarial inputs override or subvert the original system prompt of an AI application, redirecting the model's behavior to serve the attacker's goals.

Agentic AI Risks

AI agent excessive agency

Excessive Agency

When an AI agent takes actions beyond its intended scope, potentially causing unauthorized data access, unintended system changes, or cascading consequences across connected tools and services. A core risk in agentic architectures.

Tool misuse

Tool Misuse

When an AI agent calls external tools, APIs, or services in unintended or harmful ways, either due to adversarial manipulation or insufficient policy controls.

Multi-agent orchestration

Multi-Agent Orchestration Risk

Security and reliability risks arising when multiple AI agents communicate and hand off tasks to each other, where a compromise of one agent can cascade through the entire pipeline.

Indirection prompt injection

Indirect Prompt Injection

When malicious instructions are embedded in external content an AI agent reads (web pages, documents, emails), causing it to execute unintended actions without the user's knowledge.

Trustworthiness

Data Drifts

When the accuracy of AI models drifts within days. Business KPIs are negatively affected when production data differs from training data.

Biases

Refer to AI systems that produce biased results, usually reflective of human societal biases.

Explainability

The level at which human users are able to comprehend and trust the results created by AI, based on tracking how the AI made decisions and reached conclusions.

Fairness

In the context of AI, fairness is the process of correcting and eliminating algorithmic biases (about race, ethnicity, gender, sexual orientation, etc.).

Out-of-Distribution (OOD)

Data that deviates from patterns AI models were trained on, which leads models to behave in unexpected ways.

Weak Spots

Specific vulnerabilities in an AI model's architecture, training data, or deployment configuration that can cause unreliable outputs or be exploited by adversaries. Identifying weak spots is a primary objective of AI red teaming.

Cyber Attacks on AI

Malicious Code

A breed of code that can be used to corrupt files, erase hard drives, or allow attackers to access systems. Malicious code includes trojan horses, worms, and macros, and spreads by visiting infected websites or downloading infected attachments or files.

Malware Injection

When malware is injected into an established software program, website, or database using methods like SQL injection and command injection.

Trojan

Attacks that embed malicious code within training datasets and updates which seem benign. After attacking the AI system, these hidden payloads manipulate the model’s decision-making, causing data exfiltration and output poisoning.

Privacy

Data Extraction

When trained attack models are tasked with determining whether a data point is in a training set that can expose essential information like private API keys.

Model Inversion

An attack in which a machine learning model - an inversion model - is trained on the target model's output to predict the original dataset of the target model and infer sensitive information.

Private Data Leakage

When an LLM discloses information that should have remained confidential, leading to privacy and security breaches.

Attack Methodologies

Blackbox or Graybox

When attackers have either partial or no knowledge of a model, other than its input.

White Box

Also known as XAI attacks, this is when attackers know everything about the deployed model, e.g., inputs, model architecture, and specific model internals like weights or coefficients. Compared to blackbox attacks, white box attacks allow attackers more opportunity to gain information once they are able to access network gradients of XAI models.