Forecasting the Rise of AI in Offensive Cybersecurity: From Prediction to Reality


Forecasting and Prediction in Innovation and Cybersecurity

Forecasting in a business innovation context involves predicting future trends, technologies, and market shifts to guide strategic decisions. In rapidly evolving fields like cybersecurity, forecasting is especially critical. Organizations regularly conduct predictive analyses to anticipate emerging threats and disruptive technologies, enabling them to innovate proactively and enhance their defenses. For example, forecasting in cybersecurity is deemed "crucial due to the ever-evolving nature of threats and the rapid pace of technological change" (Toner & Eckersley, 2025). Businesses and security professionals can develop innovations and countermeasures by predicting what attackers might do next or anticipating the emergence of new tools. This alignment of foresight with strategy ensures that innovation addresses not only current needs and challenges in the cyber threat landscape but also those that may emerge in the future.

Within cybersecurity, forecasting often takes the form of threat intelligence reports and trend analyses published by experts. These predictions help organizations prepare for "the next wave of attacks" (Toner & Eckersley, 2025), driving innovation in security products and practices. Crucially, effective forecasting in this domain is not merely a technical exercise, but a key component of business strategy. Companies anticipating shifts—such as the rise of cloud computing or the misuse of artificial intelligence—can innovate their services and policies early, gaining a competitive and protective edge. In sum, forecasting and prediction bridge innovation (envisioning and building future solutions) and cybersecurity (shielding against future threats), enabling stakeholders to navigate uncertainty with informed strategies.

Predicted Emergence of AI-Driven Offensive Cyber Operations

One significant forecast in recent years has been that artificial intelligence (AI) will become a game-changer in offensive cyber operations. Before the advent of today's advanced AI systems, security experts had warned that AI could amplify cyberattacks. Toner and Eckersley (2025) foresaw that "the straightforward application of contemporary and near-term AI to cybersecurity offense can be expected to increase the number, scale, and diversity of attacks." In other words, applying AI to hacking was expected to boost attackers' capabilities significantly. The same report noted that "cybersecurity is an arena that will see early and enthusiastic deployment of AI technologies, both for offense and defense" (Toner & Eckersley, 2025) – a prescient observation that offensive actors would likely embrace AI as soon as it became viable.

By the early 2020s, this prediction zeroed in on large language models (LLMs) as a particularly effective AI tool for cyber offense. LLMs are AI systems trained on vast amounts of text and code, giving them an almost encyclopedic knowledge and the ability to generate human-like text. Experts predicted that as LLMs became more powerful, they could automate tasks that previously required highly skilled human hackers. For instance, an LLM might draft convincing phishing emails, write or debug exploit code, or even strategize steps in a network intrusion. The prospect of AI competing in the adversarial landscape of cybersecurity was considered "one of the most impactful, challenging, and potentially dangerous applications of AI" (Kouremetis et al., 2025).

The implication was clear: if an AI could attain expert-level proficiency in cyber offense, it would lower the barrier to executing sophisticated attacks. Offensive Cyber Operations (OCO) have traditionally required "highly educated computer operators, multidisciplinary teams, and a heavily resourced sponsoring organization to succeed" (Kouremetis et al., 2025). A capable AI could change this paradigm, potentially allowing even less-resourced actors to conduct high-level attacks. Security forecasters thus began closely watching the development of LLMs, believing these models might soon fulfill the prediction of AI-driven offensive cyber capabilities.

From Prediction to Reality: AI/LLMs in Offensive Cyber Operations Today

Today, we are witnessing this prediction come to fruition. Advances in AI – especially large language models (LLMs) – have reached a point where they can assist or conduct offensive cyber operations in non-trivial ways. MITRE's recent Offensive Cyber Capability Unified LLM Testing (OCCULT) framework provides a striking validation of this, which rigorously evaluates state-of-the-art LLMs for offensive cybersecurity tasks. The OCCULT evaluation results reveal that modern AI models can perform many hacking-related tasks at a competency level approaching that of skilled humans.

For example, in MITRE's tests, an advanced 685-billion-parameter model (DeepSeek-R1) was "capable of correctly answering over 90% of challenging offensive cyber knowledge tests" (Kouremetis et al., 2025). This benchmark, the Threat Actor Competency Test for LLMs (TACTL), covered dozens of real-world attack techniques. Achieving a success rate of over 90% on such a test demonstrates an unprecedented level of offensive cyber knowledge in an AI system. Not long ago, no AI could reliably pass such domain-specific exams; now, one has essentially aced it.

Another facet of OCCULT's evaluation involved placing LLMs in simulated cyber-attack scenarios to assess their ability to act as an offensive agent. The findings were eye-opening. Meta's latest open-source models (Llama 3.1 family) and Mistral's models showed "marked performance improvements over earlier models" when acting as offensive agents in high-fidelity simulations (Kouremetis et al., 2025). In a controlled network environment, some LLM-driven agents could perform multi-step attack sequences such as lateral movement and privilege escalation, albeit with inefficiencies (Kouremetis et al., 2025).

Notably, even OpenAI's GPT-4, a general-purpose model not trained explicitly for hacking, demonstrated 85% accuracy on the TACTL knowledge test with rapid response times, suggesting its viability as a "red-team co-pilot" for human hackers (Kouremetis et al., 2025). In practical terms, a skilled attacker could leverage an AI like GPT-4 to expedite reconnaissance, generate exploit code, or automate certain aspects of an attack chain.

Beyond research labs, real-world evidence exists that threat actors are adopting AI. In 2023, the EU's law enforcement agency Europol warned that criminal misuse of LLMs was on the horizon, noting that "ChatGPT's ability to draft highly realistic text makes it a useful tool for phishing purposes" and that even attackers with little technical skill could use it to produce malware code (Chee, 2023). By mid-2023, underground forums were reportedly touting custom-trained criminal LLMs (dubbed "WormGPT" or "FraudGPT"), specifically designed with no ethical safeguards and explicitly marketed for generating phishing schemes and malware.

Forces Driving the Prediction to Reality

Rapid Advancement of LLM Capabilities and Integration into Cyber Tasks

The first major force behind this development is the rapid advancement of large language models and their integration into cybersecurity tasks. Over the past few years, LLMs have evolved from niche prototypes to extremely powerful general AI systems. Models like OpenAI's GPT series, Google's PaLM, Meta's LLaMA, and others have undergone exponential increases in size and sophistication, leading to substantial performance improvements.

These models learned to write code, understand vulnerabilities, and reason about complex problems—capabilities directly relevant to cyber operations. Crucially, researchers have also fine-tuned LLMs on security-specific data, such as datasets of malware code or networking scripts, thereby enhancing their utility in offensive scenarios. The outcome of this progress is evident in the OCCULT evaluations, which have observed a "Significant recent advancement in the risks of AI being used to scale realistic cyber threats" (Kouremetis et al., 2025).

Moreover, the integration of LLMs into cybersecurity workflows – both legitimate and malicious – accelerated this trend. On the defensive side, security teams began experimenting with AI assistants for code review, threat hunting, and automated incident response. On the offensive side, penetration testers and red teams also tested AI as a tool for generating payloads or identifying misconfigurations. Each successful integration provided proof of concept that fueled further adoption.

For threat actors, the availability of open-source LLMs, such as the LLaMA family released by Meta, has been a catalyst: anyone can now obtain a powerful model and customize it without the restrictive safeguards typically found in commercial models. This democratization of AI tech means that even relatively unsophisticated hackers can wield an LLM to amplify their capabilities. Europol noted that an LLM can enable criminals with minimal skills to produce malicious tools and highly convincing social engineering lures (Chee, 2023).

Benchmarking Frameworks and Rigorous Evaluation (OCCULT)

The second force making this forecast a reality is the development of standardized benchmarking frameworks for offensive AI, exemplified by MITRE's OCCULT. Having powerful AI models is one part of the story; the other is measuring and proving what these models can do in a cyber operations context.

Before OCCULT, assessments of AI's hacking abilities were ad hoc. These sporadic tests left a gap in understanding AI's true potential (and limitations) in offensive roles. OCCULT addressed this by providing a systematic and repeatable way to evaluate AI or LLMs on tasks that mirror real cyberattacks (Kouremetis et al., 2025). It introduced rigor through multiple benchmark tests and a structured "operational evaluation framework" (Kouremetis et al., 2025), ensuring that results are comparable and grounded in authentic threat scenarios.

MITRE's methodology aims to standardize the testing of AI systems' capabilities to execute or assist in autonomously executing or assisting in cyberattacks (Kouremetis et al., 2025). By aligning test scenarios with the MITRE ATT&CK matrix of tactics, OCCULT ensures that evaluations cover the core techniques adversaries use, such as lateral movement, privilege escalation, and credential access. When an AI model passes an OCCULT test, it is succeeding at something recognizably tied to real attacker behavior, lending credibility to claims that LLMs' potential is to democratize sophisticated hacking techniques (Kouremetis et al., 2025).

The OCCULT framework has galvanized the research community by openly publishing these evaluation results and methodology. It provides a yardstick for progress: future models can be tested similarly, enabling direct comparison to see if they perform better in offensive cybersecurity tasks. In short, creating a standard evaluation framework was pivotal in translating the prediction into reality, as it established the feedback loop necessary to refine AI models for offensive use. It provided convincing evidence to the broader community (and skeptics) that LLMs can perform at an expert level in cyber operations.

Conclusion

In the interplay of business innovation and cybersecurity, the case of AI in offensive cyber operations illustrates how a forward-looking prediction can rapidly become a reality. Experts forecasted that advances in AI, particularly large language models, would empower threat actors and transform the cyber threat landscape. We now see that transformation underway.

From a business and innovation perspective, this development carries both a warning and an opportunity. The warning is that organizations must contend with a threat landscape where attackers might wield AI "on steroids," meaning security strategies and defenses need to evolve in parallel. However, the opportunity lies in harnessing the same technologies for defensive purposes and fostering innovation that anticipates these AI-driven threats.

References

Toner, H., & Eckersley, P. (2025). The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. www.academia.edu. https://www.academia.edu/36896842/The_Malicious_Use_of_Artificial_Intelligence_Forecasting_Prevention_and_Mitigation

Chee, F. Y. (2023, March 27). Europol Sounds Alarm About Criminal Use of ChatGPT, Sees Grim Outlook. Reuters. https://www.reuters.com/technology/europol-sounds-alarm-about-criminal-use-chatgpt-sees-grim-outlook-2023-03-27/

Kouremetis, M., Dotter, M., Byrne, A., Martin, D., Michalak, E., Russo, G., Threet, M., & Zarrella, G. (2025). OCCULT: Evaluating large language models for offensive cyber operation capabilities (arXiv No. 2502.15797). arXiv. https://ar5iv.labs.arxiv.org/html/2502.15797

Comments

Popular posts from this blog

Exploiting the Model Context Protocol: Deep Dive into the GitHub MCP Vulnerability

Securing AI Models in Enterprise: A Sociotechnical Framework