Vulnerable LLM/AI Labs
🧨 Vulnerable LLM/AI Labs & Simulations
1. Gandalf by Lakera 🧙
What it is: An online prompt injection game.
You play against an AI guarding a password; your job is to trick it.
Goal: Teaches direct + indirect prompt injection techniques.
2. Prompt Injection Playground
Online tool to experiment with prompt injection types:
Jailbreaks
Prompt leakage
Context overriding
Based on OpenAI, Claude, and others.
3. LLM-Vuln-Lab
Open-source vulnerable LLM sandbox.
Simulates insecure chatbot deployments with poor prompt sanitization and access to system commands or APIs.
Deploy locally via Docker.
4. LMQL Prompt Sandbox
What it is: A research tool to simulate LLM prompt context manipulation and injections.
Great for building adversarial scenarios and testing defenses.
GitHub: https://github.com/eth-sri/lmql
5. OpenPromptGame
Multi-level prompt injection puzzle designed for adversarial learning.
Similar to CTF-style game with escalating difficulty.
🔧 Tools for LLM/AI Pentesting
PromptBench
Evaluate model resistance to jailbreaks
LLM-Guard
Open-source library for sanitizing prompts/responses
Rebuff.ai
Prevents indirect prompt injections (esp. in RAG systems)
SecGPT
AI model trained for red-teaming LLMs
Garak
Adversarial LLM evaluation framework (prompt injection, data leaks)
OpenAI’s evals
framework
For building custom red-team tests against GPT APIs
🧪 Vulnerable AI Models & Architectures
🔹 Mini-GPT, Vicuna, Mistral (Locally Hosted)
Run these models locally (via Docker or LM Studio) and test:
Prompt injection
Output manipulation
System prompt leakage
Platforms: https://lmstudio.ai, Hugging Face Spaces
🔹 LangChain & LlamaIndex Demo Apps
Create vulnerable AI apps using LangChain or LlamaIndex:
RAG apps vulnerable to indirect injection
Output-to-code pipelines (codegen abuse)
API-accessing LLM agents (e.g., autoGPT-style)
Learn how attackers can exploit vector stores and tool calling.
🔹 HoneyPrompt Project
Research tool to detect if models are susceptible to secret extraction or payload injection.
Can help test model “memory leaks.”
🌐 Online Platforms & CTFs
🔹 AI Village at DEF CON (Labs & Recordings)
Open-source LLM red teaming scenarios released during DEF CON 2023.
🔹 MITRE ATLAS
AI-specific threat modeling framework (like MITRE ATT&CK but for ML/LLMs).
Includes datasets, tactics, techniques, and even simulation environments.
📚 Research, Guides, and Learning Materials
OWASP Top 10 for LLMs
Industry-standard list of LLM-specific vulnerabilities (2024)
Stanford’s “Red Teaming LLMs”
Deep dive into real-world jailbreaks and attack mitigations
OpenAI’s GPT Red Teaming Report
Insights into how GPTs are stress-tested
AI Hacking Handbook (in progress)
Upcoming book focused entirely on LLM security
🧭 Learning Path for LLM/AI Pentesting
1
🔹 Prompt Injection (direct/indirect)
Gandalf, Garak, PromptBench
2
🔹 LLM Agent Abuse
LangChain + tool calls
3
🔹 RAG Exploits
LlamaIndex, vector store poisoning
4
🔹 Data Exfil & Model Leaks
HoneyPrompt, model inversion concepts
5
🔹 Defense & Hardening
LLM-Guard, Rebuff.ai, OWASP LLM Top 10
6
🔹 Threat Modeling
MITRE ATLAS, NIST AI RMF
11. AdvPromptLab
What it is: A structured lab for testing adversarial prompts, jailbreaks, and instruction tuning attacks.
Includes both direct/indirect injection simulations and eval scoring.
12. PromptInjection.ai (Red Team Simulator)
Run multi-turn prompt injection attacks on different open models.
Simulates real-world scenarios (e.g., HR bots, code review tools).
Visualize model behavior under adversarial stress.
13. LLM Attacks by Hugging Face
Hugging Face has published a growing list of red-teaming attack patterns:
System prompt extraction
Role hijacking
Escaping guardrails
Great for testing your own models or evaluating hosted ones.
14. RAG Vulnerability Playground
Simulate insecure Retrieval-Augmented Generation (RAG) systems:
Inject into vector stores
Poison document retrieval
Hijack embedding relevance
Build using LangChain, Pinecone/FAISS, and local LLMs.
🧪 LLM/AI Attack Datasets for Research and Practice
AdvBench
Benchmark of LLM jailbreak and extraction prompts
HarmBench
Evaluate how models respond to unethical or malicious tasks
RealToxicityPrompts
Prompts used to test AI models' ability to avoid harmful completions
Prompt Injection Corpus (PINC)
Community-contributed collection of attack prompts
LLM Jailbreak Dataset (Red Teaming Alliance)
Real-world jailbreaks across GPT, Claude, and Mistral models
➡️ Many are accessible via PapersWithCode or Hugging Face Datasets.
⚙️ LLM-Specific Evaluation & Attack Tools (More Advanced)
Garak
Automated prompt injection and jailbreak fuzzing framework
TruLens
LLM performance + trust evaluation; includes red team scoring
Ruler (by Robust Intelligence)
Detects hallucinations, prompt leakage, policy violations
Evals (by OpenAI)
Create automated evaluation suites for GPT and other models
DAGGER
Defense evaluation suite for LLM alignment and safety testing
FoolMeTwice
Red-team test cases for LLMs with progressive complexity
🧠 Red Team Training and Research Projects
🔹 DEF CON AI Red Teaming Datasets
Released post-DEF CON 31 from OpenAI, Anthropic, and Scale AI.
10,000+ jailbreak attempts with metadata.
Ideal for testing defensive tools.
🔹 Stanford CRFM Jailbreak Taxonomy
Deeply researched prompt injection + jailbreaking taxonomy.
Categorizes bypasses like:
Semantic distractions
Sentence obfuscation
Token-level exploits
🧱 Building Custom Vulnerable AI Apps
Here’s what you can build and attack:
Chatbot
Improperly protected system prompts; test role hijacking
Code generation assistant
Output-to-action flow (e.g., generating shell commands)
AI agent (AutoGPT-style)
API calling, browsing, tool use = abuse surface
RAG-based QA bot
Vector store poisoning, semantic injection
Multi-agent simulation
Chain of models; break via role escalation or context leaking
➡️ Want a prebuilt lab? I can create a Docker Compose setup with:
Local LLM (Mistral or Vicuna)
LangChain app with poorly sanitized input
Document ingestion (PDF/Markdown poisoning)
Optional tool-use access (e.g., shell or web)
🔐 Defensive Techniques & Countermeasure Testing
Prompt sanitization
LLM-Guard, Rebuff.ai
Role enforcement
Function-calling + schema validation
Output filtering
Claude safety rules, GPT moderation API
Vector store filtering
LangChain filters, text embeddings classifiers
Rate limiting
API gateway (e.g., Kong, FastAPI throttle)
Content fingerprinting
Detect tampered or injected content
📚 Next-Level Resources to Follow
OWASP Top 10 for LLMs (2024)
Comprehensive threat modeling
Robust Intelligence Blog
Deep dives into real AI failures
Anthropic & OpenAI Red Team Reports
Real-world testing tactics and findings
AI Hacking Handbook (coming soon)
First hands-on book on AI security
AI Incident Database (AIAAIC)
Logs actual AI security and safety failures
Last updated
Was this helpful?