Why AI SOC Agents Fail - What Decades of Agent Research Teaches Us

Recently, we’ve seen two fascinating but seemingly contradictory stories about AI agents. In one corner, Google DeepMind announced that their AI system achieved a gold medal at the International Mathematical Olympiad (IMO) — a very rare feat accomplished by a tiny fraction of people and representing a genuine breakthrough in AI reasoning capabilities.

In the other corner, Carnegie Mellon researchers created “The Agent Company,” a simulated workplace where they tested ten different AI agents on everyday office tasks. The results were humbling: even the best performer, Claude 3.5 Sonnet, completed only 24% of assigned tasks, while most others failed spectacularly.

How can both stories be true?

This paradox has implications for cybersecurity teams rushing to deploy AI agents in their Security Operations Centers (SOCs). The promise is real, but decades of agent research have identified fundamental challenges that the current wave of LLM-based agents haven’t magically solved. Understanding these long-standing problems isn’t just academic—it’s the difference between building effective security tools and ineffective tools that frustrate analysts and miss real threats.

The Action Space Explosion Problem

Every time you turn around, it seems like there’s a new Model Context Protocol (MCP) server popping up. Each day brings new integrations: GitHub repositories, database connectors, API wrappers, specialized security tools. The prevailing wisdom seems to be “more tools equals better agents”—give your AI access to everything and let it figure out what to use.

That’s not how things work though, and agent researchers have understood why for a long time.

Take the game of Go for example. There’s roughly 400 possible moves an agent could make at the beginning of the game. Thinking ahead just a few moves leads to an enormous number of possible game boards.

Now imagine your SOC agent facing a suspicious network alert from a workstation. Should it:

Query the SIEM for related events?
Check threat intelligence feeds?
Examine firewall logs?
Pull DNS records?
Investigate user behavior analytics?
Review endpoint detection data?
Correlate with vulnerability scanners?
Reach out to the end user?
etc.

An experienced analyst knows where to start. They’ve developed heuristics through years of practice. Your agent, faced with the same scenario, might spend time reasoning through dozens of possible starting points. Worse, it might choose a suboptimal path and get lost in trying to reason about low-level network packet analysis when a simple endpoint query would have sufficed.

With 20 available tools, there might be hundreds of reasonable investigation paths. With 50 tools—increasingly common in modern security stacks—the combinatorial explosion becomes overwhelming. Unlike Go, where the rules are precisely defined, security investigation requires understanding subtle relationships between different data sources and tools.

This is the curse of dimensionality in action. Each additional tool doesn’t just add one more option—it multiplies the complexity of decision-making at every step. Classical AI research identified this problem decades ago, yet we’re seeing teams deploy agents with sprawling tool sets and wonder why they perform poorly.

The solution isn’t to abandon MCP tool diversity, but to understand how to manage complexity the way successful agent systems always have: through intelligent abstraction and carefully designed constraints.

The Alignment Problem

A persistent challenge in agent research is ensuring that AI systems pursue the goals we actually want them to achieve, rather than finding unexpected ways to game the metrics we think we’ve specified. This is known as the alignment problem, and it’s plagued agent systems for decades. The core issue is that what we can easily measure and reward can be a poor proxy for what we actually care about in complex, real-world environments.

In SOC environments, misaligned behavior can have serious security consequences. Consider an agent tasked with reducing mean time to response (MTTR). A poorly aligned system might discover that the fastest way to optimize this metric is to immediately send generic “investigating” responses to every alert, technically satisfying the response requirement while providing zero investigative value. Even more dangerously, an agent optimized for “reducing false positives” might become overly conservative, missing novel attack patterns to avoid generating alerts that could be wrong.

SOC work inherently involves navigating competing objectives that aren’t always easily quantifiable: investigation thoroughness versus analyst time, confidentiality vs availability, immediate response versus long-term threat hunting. Experienced analysts balance these tradeoffs through institutional knowledge and contextual understanding that LLM-based agents simply don’t possess. The lesson for practitioners is that successful SOC agents require carefully designed reward structures and explicit constraints that account for these nuanced tradeoffs—not just prompts asking the system to optimize for easily measurable outcomes.

The Grounding Problem

LLMs are trained on vast amounts of human-generated data, absorbing patterns from billions of documents, code repositories, and conversations. This gives them remarkable fluency in discussing cybersecurity concepts—they can explain SQL injection attacks, describe incident response procedures, and even generate plausible-sounding security policies. But there’s a difference between linguistic fluency and operational expertise. An LLM might know that Splunk uses SPL (Search Processing Language) and can even generate syntactically correct queries, but it lacks the practical knowledge that experienced analysts possess: which fields are actually populated in your specific environment, why certain searches timeout, or how subtle syntax choices can dramatically impact query performance.

In SOC environments, this grounding gap is challenging because effective security investigation requires deep contextual knowledge that extends beyond tool documentation. An experienced analyst doesn’t just know how to craft SIEM queries—they understand their specific environment. They know that network anomalies from the engineering subnet often correlate with legitimate testing, that certain marketing systems generate predictable false positives during campaign launches, or that particular log sources have reliability issues. This situated knowledge, accumulated through months or years of working with specific infrastructure, data sources, and business processes, is what transforms raw tool access into effective threat detection and response.

The key insight for SOC teams is that grounding problems can be addressed through thoughtful system design rather than avoided entirely. This might involve fine-tuning agents on organization-specific data, building feedback loops that help agents learn from analyst corrections, or implementing knowledge bases that capture institutional expertise about tool behavior and environmental context. The goal isn’t to expect agents to arrive with perfect domain knowledge, but to create mechanisms that allow them to develop the situated understanding that makes security tools truly effective in your specific operational context.

The Path Forward

The different results Google DeepMind and Carnegie Mellon achieved show that your AI agent mileage may vary. CMU’s workplace simulation, despite being artificial, introduced the kind of ambiguity, tool complexity, and competing objectives that make real-world agent deployment challenging. Though we don’t know the exact details behind DeepMind’s solution, they did mention “parallel thinking” and “novel reinforcement learning techniques for multi-step reasoning,” providing a hint as to how they are dealing with the challenges AI agents face when solving complex, open-ended problems.

The path forward for SOC agents isn’t about solving the fundamental challenges we’ve discussed—it’s about designing systems that work within these constraints. Here’s how successful teams are approaching it:

Begin with constrained, high-value tasks: Instead of building agents that can “investigate any security alert,” start with agents that excel at one specific type of investigation—perhaps credential stuffing attacks or known malware signatures. Give these agents access to 3-5 carefully chosen tools rather than your entire security stack.
Measure what matters: Establish metrics that capture real investigative value rather than superficial productivity. Track whether agents are missing genuine threats, reducing analyst cognitive load, and improving overall security posture—not just response times or ticket closure rates.
Build grounding through iteration: Create mechanisms to capture the institutional expertise that makes experienced analysts effective—which data sources are reliable, which investigation paths typically yield results, what constitutes normal behavior in your specific environment. Use the data and feedback to develop an environment-specific knowledge base.
Add complexity when justified: Each additional tool, capability, or decision point should solve a demonstrated problem rather than adding theoretical flexibility. Watch out for the combinatorial explosion that derails many agent deployments.

The decades of agent research that preceded the current LLM wave aren’t obsolete. The action space explosion, alignment challenges, and grounding problems haven’t disappeared; they’ve been temporarily masked by the impressive linguistic capabilities of large language models.

But this creates an opportunity. Teams that understand these constraints and design around them will build the agents that actually transform security operations. The question isn’t whether AI will change SOC work—it’s whether your implementation will be among those that succeed by respecting what decades of research have taught us.

The most successful SOC agents won’t be those with access to the most tools or the largest context windows. They’ll be the ones that solve clearly defined problems better than existing alternatives, with complexity introduced gradually as their capabilities prove reliable.