The AI code review benchmark your LLM doesn't want to see.
Three steps. No scripts. You talk to your AI, it reviews the code, you score the result.
Open your AI tool in the repo folder. Load your skills if you have any. Pick a vault:
That's it. Or drag-and-drop your review.json below.
Works with any AI tool that can read files: Claude Code, Cursor, Windsurf, Codex, Gemini, Aider, ... The AI reads the code itself using its own tools. No subprocess, no API proxy, no file pasting. That's what makes it a real test.
Six dimensions of code review ability, weighted by importance.
Three benchmark tiers. Pick your level.
The vanilla results are in. Now it's your turn.
Build a skill, a prompt, or a workflow that finds more bugs than vanilla AI. Share it openly. The best skill ships to the community and gets used in real code reviews.
100 planted bugs per project. 42 intentionally broken apps. One fair benchmark. The best skill wins - and helps make the internet a bit harder to exploit.
Upload your review.json to score it against encrypted ground truth.
review.json here or click to browse