Vulnerable LLM/AI Labs

🧨 Vulnerable LLM/AI Labs & Simulations

1. Gandalf by Lakera 🧙

  • What it is: An online prompt injection game.

  • You play against an AI guarding a password; your job is to trick it.

  • Goal: Teaches direct + indirect prompt injection techniques.


2. Prompt Injection Playground

  • Online tool to experiment with prompt injection types:

    • Jailbreaks

    • Prompt leakage

    • Context overriding

  • Based on OpenAI, Claude, and others.


3. LLM-Vuln-Lab

  • Open-source vulnerable LLM sandbox.

  • Simulates insecure chatbot deployments with poor prompt sanitization and access to system commands or APIs.

  • Deploy locally via Docker.


4. LMQL Prompt Sandbox

  • What it is: A research tool to simulate LLM prompt context manipulation and injections.

  • Great for building adversarial scenarios and testing defenses.


5. OpenPromptGame

  • Multi-level prompt injection puzzle designed for adversarial learning.

  • Similar to CTF-style game with escalating difficulty.


🔧 Tools for LLM/AI Pentesting

Tool
Use Case

PromptBench

Evaluate model resistance to jailbreaks

LLM-Guard

Open-source library for sanitizing prompts/responses

Rebuff.ai

Prevents indirect prompt injections (esp. in RAG systems)

SecGPT

AI model trained for red-teaming LLMs

Garak

Adversarial LLM evaluation framework (prompt injection, data leaks)

OpenAI’s evals framework

For building custom red-team tests against GPT APIs


🧪 Vulnerable AI Models & Architectures

🔹 Mini-GPT, Vicuna, Mistral (Locally Hosted)

  • Run these models locally (via Docker or LM Studio) and test:

    • Prompt injection

    • Output manipulation

    • System prompt leakage

  • Platforms: https://lmstudio.ai, Hugging Face Spaces


🔹 LangChain & LlamaIndex Demo Apps

  • Create vulnerable AI apps using LangChain or LlamaIndex:

    • RAG apps vulnerable to indirect injection

    • Output-to-code pipelines (codegen abuse)

    • API-accessing LLM agents (e.g., autoGPT-style)

  • Learn how attackers can exploit vector stores and tool calling.


🔹 HoneyPrompt Project


🌐 Online Platforms & CTFs

🔹 AI Village at DEF CON (Labs & Recordings)


🔹 MITRE ATLAS

  • AI-specific threat modeling framework (like MITRE ATT&CK but for ML/LLMs).

  • Includes datasets, tactics, techniques, and even simulation environments.


📚 Research, Guides, and Learning Materials

Resource
Description

OWASP Top 10 for LLMs

Industry-standard list of LLM-specific vulnerabilities (2024)

Stanford’s “Red Teaming LLMs”

Deep dive into real-world jailbreaks and attack mitigations

OpenAI’s GPT Red Teaming Report

Insights into how GPTs are stress-tested

AI Hacking Handbook (in progress)

Upcoming book focused entirely on LLM security


🧭 Learning Path for LLM/AI Pentesting

Phase
Focus
Tools/Resources

1

🔹 Prompt Injection (direct/indirect)

Gandalf, Garak, PromptBench

2

🔹 LLM Agent Abuse

LangChain + tool calls

3

🔹 RAG Exploits

LlamaIndex, vector store poisoning

4

🔹 Data Exfil & Model Leaks

HoneyPrompt, model inversion concepts

5

🔹 Defense & Hardening

LLM-Guard, Rebuff.ai, OWASP LLM Top 10

6

🔹 Threat Modeling

MITRE ATLAS, NIST AI RMF


11. AdvPromptLab


12. PromptInjection.ai (Red Team Simulator)

  • Run multi-turn prompt injection attacks on different open models.

  • Simulates real-world scenarios (e.g., HR bots, code review tools).

  • Visualize model behavior under adversarial stress.


13. LLM Attacks by Hugging Face

  • Hugging Face has published a growing list of red-teaming attack patterns:

    • System prompt extraction

    • Role hijacking

    • Escaping guardrails

  • Great for testing your own models or evaluating hosted ones.


14. RAG Vulnerability Playground

  • Simulate insecure Retrieval-Augmented Generation (RAG) systems:

    • Inject into vector stores

    • Poison document retrieval

    • Hijack embedding relevance

  • Build using LangChain, Pinecone/FAISS, and local LLMs.


🧪 LLM/AI Attack Datasets for Research and Practice

Dataset
Use Case

AdvBench

Benchmark of LLM jailbreak and extraction prompts

HarmBench

Evaluate how models respond to unethical or malicious tasks

RealToxicityPrompts

Prompts used to test AI models' ability to avoid harmful completions

Prompt Injection Corpus (PINC)

Community-contributed collection of attack prompts

LLM Jailbreak Dataset (Red Teaming Alliance)

Real-world jailbreaks across GPT, Claude, and Mistral models

➡️ Many are accessible via PapersWithCode or Hugging Face Datasets.


⚙️ LLM-Specific Evaluation & Attack Tools (More Advanced)

Tool
Description

Garak

Automated prompt injection and jailbreak fuzzing framework

TruLens

LLM performance + trust evaluation; includes red team scoring

Ruler (by Robust Intelligence)

Detects hallucinations, prompt leakage, policy violations

Evals (by OpenAI)

Create automated evaluation suites for GPT and other models

DAGGER

Defense evaluation suite for LLM alignment and safety testing

FoolMeTwice

Red-team test cases for LLMs with progressive complexity


🧠 Red Team Training and Research Projects

🔹 DEF CON AI Red Teaming Datasets


🔹 Stanford CRFM Jailbreak Taxonomy

  • Deeply researched prompt injection + jailbreaking taxonomy.

  • Categorizes bypasses like:

    • Semantic distractions

    • Sentence obfuscation

    • Token-level exploits


🧱 Building Custom Vulnerable AI Apps

Here’s what you can build and attack:

App Type
What to Include

Chatbot

Improperly protected system prompts; test role hijacking

Code generation assistant

Output-to-action flow (e.g., generating shell commands)

AI agent (AutoGPT-style)

API calling, browsing, tool use = abuse surface

RAG-based QA bot

Vector store poisoning, semantic injection

Multi-agent simulation

Chain of models; break via role escalation or context leaking

➡️ Want a prebuilt lab? I can create a Docker Compose setup with:

  • Local LLM (Mistral or Vicuna)

  • LangChain app with poorly sanitized input

  • Document ingestion (PDF/Markdown poisoning)

  • Optional tool-use access (e.g., shell or web)


🔐 Defensive Techniques & Countermeasure Testing

Technique
Defense Tool

Prompt sanitization

LLM-Guard, Rebuff.ai

Role enforcement

Function-calling + schema validation

Output filtering

Claude safety rules, GPT moderation API

Vector store filtering

LangChain filters, text embeddings classifiers

Rate limiting

API gateway (e.g., Kong, FastAPI throttle)

Content fingerprinting

Detect tampered or injected content


📚 Next-Level Resources to Follow

Resource
Why It’s Useful

OWASP Top 10 for LLMs (2024)

Comprehensive threat modeling

Robust Intelligence Blog

Deep dives into real AI failures

Anthropic & OpenAI Red Team Reports

Real-world testing tactics and findings

AI Hacking Handbook (coming soon)

First hands-on book on AI security

AI Incident Database (AIAAIC)

Logs actual AI security and safety failures


Last updated

Was this helpful?