Welcome to Tolexty's Blog: Show HN: I benchmarked LLM agents on fixing real-world security vulnerabilities https://ift.tt/KVkMXgL

Show HN: I benchmarked LLM agents on fixing real-world security vulnerabilities I built a benchmark with 20 real CVEs across 18 Python projects (Pillow, GitPython, yt-dlp, urllib3, etc). I've run it over 5 LLM agents (3 OpenAI, 2 poolside) and 3 different prompts (full advisory, locate, diagnose) with a total of 300 runs. The agents are tasked to fix security vulnerabilities in a sandboxed environment and they are scored against a hidden security tests from the maintainer's own fix. Best solve rate was 50%. On the other 50%, some fixes are sometimes coherent and pass all regression tests, but vulnerability still present. The main differentiator I found between models is cost: gpt-5.5 at 12× more expensive than gpt-5.4-mini while producing statistically similar results. Within-family performance gaps are small, which points out the difference is likely due to model training data. I also did a power analysis and the task count needed to detect a meaningful within-family edge at ~700. Full write-up: https://giovannigatti.github.io/cve-bench Code: https://ift.tt/8yeu63I https://giovannigatti.github.io/cve-bench/ June 4, 2026 at 09:43PM

Welcome to Tolexty's Blog

Show HN: I benchmarked LLM agents on fixing real-world security vulnerabilities https://ift.tt/KVkMXgL

No comments:

Show HN: Skillful, stop maintaining the same AI workflow in five places https://ift.tt/tKp7SiY

Translate

Adsense