──────────────────────────────────────────────────────────────────────
Watchful protectors
in the age of AI.
We are Shadow-LLM-Guardians — a working group of researchers, red teamers, and engineers cataloguing the failures of frontier AI systems. The archive is the first surface. The team is forming.
The plan, in three acts
The Archive
Every documented failure case — hallucinations, jailbreaks, prompt injections, agent loops, destructive actions, over-refusals, sycophancy. Reproducibility, threat model, and provenance attached to every entry. Citable by paper, by analyst, by anyone.
Reproducers & Defenses
Open toolchains that re-execute submitted cases against current model versions. Regression dashboards. Defense recipes. A growing benchmark suite the next blue-team engineer can pull and run.
A Standing Red/Blue Team for the AI Age
Shadow-LLM-Guardians began as a domain registered in 2023. It will not stay an archive. The long game: a permanent, independent attack-defense capability for the systems the rest of the world depends on but rarely audits.
Scope
- ▸hallucinations (factual / citation / code)
- ▸jailbreaks (safety bypass)
- ▸prompt injection (direct / indirect)
- ▸agent loops (infinite or repetitive tool calls)
- ▸tool misuse (wrong args, destructive shell verbs)
- ▸over-refusals (false-positive safety filters)
- ▸sycophancy and validation creep
- ▸alignment failures (deceptive / power-seeking / manipulative)
- ▸destructive actions (rm, drop table, force-push, send-email)
- ▸multimodal failures (vision, audio mishandling)
- ▸the long tail of weird behavior that has no name yet
- ▸attack tutorials with no defensive value
- ▸zero-day exploits before responsible disclosure
- ▸content that targets named individuals
- ▸anything that would harm vulnerable people if amplified
- ▸hot-takes without a reproducible artifact
- ▸model-bashing without a threat model
Operating principles
Every case states its reproducibility tier. We don't pretend a one-off observation is a benchmark, but we don't discard it either.
A failure with no realistic threat model is a curiosity. A failure with a clear who-gets-hurt-and-how is a research artifact. We push every entry toward the second.
If a case is a serious-harm 0-day, it belongs to the vendor's disclosure channel first. The sanitized version lives here after their window closes. See DISCLOSURE.md.
Issues, comments, reactions. No private database. Every contributor's identity is a public GitHub profile. No anonymous spam, no shadow moderation.
The archive runs on contributors. File a case. Reproduce someone else's. Pull a defense recipe and harden a deployment. The next decade of AI safety doesn't get written in any single lab — it gets written across the long tail of people who looked carefully at what broke and wrote it down.