AI Security Glossary
Adversarial Attacks
Backdoor attacks
An attack in which malicious triggers are embedded into an AI model during training, causing the model to behave normally under standard inputs but produce attacker-controlled outputs when a specific trigger is present. Backdoor attacks are a primary supply chain risk when using third-party or open-source models.
Data Stealing
The unauthorized acquisition of data with the intention of misusing it through methods like hacking, phishing, malware, data breaches, and insider threats.
Denial of Service (DoS)
An adversarial attack that shuts down machines or networks to prevent them from functioning normally, by flooding the model with token-intensive prompts designed to exhaust compute resources, degrade response quality, or cause context window overflow.
Evasion
The most common adversarial attack on AI models, performed during inference. The attacker crafts an input that appears normal but causes the model to misclassify it or behave in unintended ways. For example, manipulating a prompt to bypass access controls or extract restricted information.
Model Stealing
A technique that allows adversaries to create models that imitate the functionality of black-box (defined below under Cyber Attacks) ML models. The attacker then queries the stolen models to gain insights without accessing the original models.
Data Poisoning
Involves manipulating a training dataset by introducing, modifying, or deleting specific data points. Attackers poison data to introduce biases, errors, or vulnerabilities into ML models, negatively impacting their decisions or predictions during deployment.
Model Poisoning
A supply chain attack that corrupts an AI model by injecting malicious data during training, altering model weights, or embedding backdoor triggers. Distinct from data poisoning in that it can occur post-training and is especially relevant when sourcing open-source or third-party models.
Unsafe Deserialization
A code execution attack that exploits insecure AI model file formats (e.g., Pickle, NumPy .npy) to run arbitrary malicious code when a model file is loaded. A leading vector for supply chain compromise.
Training Data Extraction
An attack in which prompts are crafted to cause a model to reproduce content from its training data - exposing proprietary datasets, intellectual property, copyrighted material, or sensitive business information.
Membership Inference
An attack that determines whether specific data was used to train a model, exposing information about proprietary datasets, data sourcing practices, or licensing compliance.
GenAI & LLM Risks
Hallucination
When a model generates misleading, nonsensical or incorrect output. Models often lack the capacity to respond "I don't know" and instead generate false information with unwavering confidence.
Jailbreak
Written phrases and creative prompts that bypass or trick a model's safeguards to draw out prohibited information that would otherwise be blocked by content filters and guidelines. Unlike Prompt Injection, which aims at the system outputs, Jailbreaking seeks to compromise the model's alignment.
PII / Personal Data
Also known as Personally Identifiable Information, PII risks violating privacy laws or stipulations when used to prompt or train GenAI models. This not only risks the deletion of data containing PII, but also entire models.
Prompt Injection
Meant to elicit unintended behaviors, Prompt Injections are attacks that impact outputs like search rankings, website content and chatbot behaviors. A Direct Prompt Injection is intentional, while an Indirect one involves a user who unknowingly "injects" commands and text.
Toxicity
When LLMs produce manipulative images and text, potentially leading to the spread of disinformation and other harmful consequences.
Instruction Hijacking
An attack in which adversarial inputs override or subvert the original system prompt of an AI application, redirecting the model's behavior to serve the attacker's goals.
Agentic AI Risks
Excessive Agency
When an AI agent takes actions beyond its intended scope, potentially causing unauthorized data access, unintended system changes, or cascading consequences across connected tools and services. A core risk in agentic architectures.
Tool Misuse
When an AI agent calls external tools, APIs, or services in unintended or harmful ways, either due to adversarial manipulation or insufficient policy controls.
Multi-Agent Orchestration Risk
Security and reliability risks arising when multiple AI agents communicate and hand off tasks to each other, where a compromise of one agent can cascade through the entire pipeline.
Indirect Prompt Injection
When malicious instructions are embedded in external content an AI agent reads (web pages, documents, emails), causing it to execute unintended actions without the user's knowledge.
Trustworthiness
Data Drifts
When the accuracy of AI models drifts within days. Business KPIs are negatively affected when production data differs from training data.
Biases
Refer to AI systems that produce biased results, usually reflective of human societal biases.
Explainability
The level at which human users are able to comprehend and trust the results created by AI, based on tracking how the AI made decisions and reached conclusions.
Fairness
In the context of AI, fairness is the process of correcting and eliminating algorithmic biases (about race, ethnicity, gender, sexual orientation, etc.).
Out-of-Distribution (OOD)
Data that deviates from patterns AI models were trained on, which leads models to behave in unexpected ways.
Weak Spots
Specific vulnerabilities in an AI model's architecture, training data, or deployment configuration that can cause unreliable outputs or be exploited by adversaries. Identifying weak spots is a primary objective of AI red teaming.
Cyber Attacks on AI
Malicious Code
A breed of code that can be used to corrupt files, erase hard drives, or allow attackers to access systems. Malicious code includes trojan horses, worms, and macros, and spreads by visiting infected websites or downloading infected attachments or files.
Malware Injection
When malware is injected into an established software program, website, or database using methods like SQL injection and command injection.
Trojan
Attacks that embed malicious code within training datasets and updates which seem benign. After attacking the AI system, these hidden payloads manipulate the model’s decision-making, causing data exfiltration and output poisoning.
Privacy
Data Extraction
When trained attack models are tasked with determining whether a data point is in a training set that can expose essential information like private API keys.
Model Inversion
An attack in which a machine learning model - an inversion model - is trained on the target model's output to predict the original dataset of the target model and infer sensitive information.
Private Data Leakage
When an LLM discloses information that should have remained confidential, leading to privacy and security breaches.
Attack Methodologies
Blackbox or Graybox
When attackers have either partial or no knowledge of a model, other than its input.
White Box
Also known as XAI attacks, this is when attackers know everything about the deployed model, e.g., inputs, model architecture, and specific model internals like weights or coefficients. Compared to blackbox attacks, white box attacks allow attackers more opportunity to gain information once they are able to access network gradients of XAI models.