On AI Security

What’s going on?

I’m working on security products at a observability vendor who happens to like dogs a lot. Our leadership was early in embracing AI and pushed it hard onto our engineering. This positioned us well to jump onto the AI security bandwagon and products came out (the incident assistant is really impressive and expensive) but the value proposition of the “secure your AI” pitch stayed very vague.

Upon reading the Thinkst debrief of RSAC 2026, I figured we’re not alone and I think I know why.

Tweet saying RSAC was more subdued this year. Although the floor was plastered in Al, Al protection & Agentic*, everyone knows its a placeholder while we figure things out... So it's more performative than normal: Vendors act like they have the solutions & attendees act like they believe it

The tl;dr is that I don’t think we know the risk we should be protecting against, and thus are building tools to go after attack vectors. Those attack vectors are incredibly hard to block generically and there is no push for someone to buy a security product (even a good one!) because the risk to their org is unclear.

Where am I coming from

Feel free to skip this section, but I wanted to state where I’m coming from, that this is not a hype piece and that I have expertise in what I’m talking about.

I have been writing code for around 15 years, and have cared about security since JailbreakMe 2.0. I care about understanding systems and believe I have a reasonably broad understanding, from hardware to the final user’s economics.

I’m an AI-skeptic who uses AI regularly. I read Ed Zitron and SemiAnalysis closely. I’m using it a few times a week, generally the most advanced available models (primarily Opus 4.6 med/high effort, but GPT5.4 and Gemini 3 too). I’m using it for basic automation, engineering in codebases I’m not familiar with, research and occasionally doing tedious PRs where I know precisely what I want done.

I think the current capabilities are useful in their current state, and thus isn’t going anywhere even if progress stops. Specifically:

  • To write small scripts/doing actions that could technically be done with a few lines of Python/sed if you put your mind to it (and had 1h to burn because it’s hyper specific and you’re not familiar with pyxml or whatever)
  • To do initial research on a topic, ASSUMING you’ll then take the findings and validate them carefully (I saw Opus 4.5 cite a source and hallucinate the content multiple times with a mostly empty context)
  • To pull in a specialized expert assuming the stakes are low and that you can quickly evaluate whether it’s lying to you or not. The more obscure your ask, the more careful you have to be, even if it cites sources
  • Very flexibly connect data sources when the systems don’t have a good way to talk to one another and pushing some text is easier than pushing a service.
  • Interacting with the analog world: they’re very good at taking a representation of the real world and turning it into a format a digital system can use (although the demo I saw need some serious denoising/consistency checks)

With that said, are LLMs going to be used at scale? Yes: for summarization (calls, incidents, long documents), coding assistance, likely robotics (until world models are a thing).

Are they going to replace coders? I don’t think so if your coders are any good: I’m not 100% convinced it actually helps with velocity since reviewing code isn’t faster or easier that writing it (and if you push a PR to a coworker without reviewing it first, you should be on call for it). Management believe in it and want to be seen riding the AI wave so I’d be very careful taking published corporate reports at face value.

Are they going to destroy jobs? Probably not if you job couldn’t already have been destroyed by a basic script.

Are OpenAI/Anthropic going to manage to protect their API high margins if everyone need inference at scale? No way. Open source models are good enough and close enough behind that the differentiated value from frontier models is pretty niche and won’t ever be used at scale when you start looking at actually industrializing usage. Coding agents is probably the biggest, most lucrative niche where the market is price insensitive and the users look for and notice the minute performance difference. However, that only applies when you pay your devs top of market. I can’t see shitty consulting company paying even $50/mo/dev for good tokens. Most other deployments will prototype on frontier models, then migrate to cheaper models running on a third party inference provider.

Back to AI Security

Our goal, as a security industry is not to protect IT systems. No one gives a shit if a server get compromised. If I tell you someone hacked into something, the first question will either be “What was stored on it” or “What does it have access to”. What we’re trying to protect is:

  • The confidentiality of users & data
  • The integrity of the data, i.e. the system behave like users expect it to
  • The availability of the system, it’s doing what it’s supposed to do when it’s supposed to do it

There is value in LLM models, so there will be risks and threats to protect against. However, until they’re clear, it’s almost impossible to build a compelling product. You have to guess the market, the application, the flexibility of the attacker (cf OpenClaw), the motivation of an attacker… It’s like throwing a dart, blindfolded, after travelling half an hour on a horse’s back.

In this post, I’m going to focus on attacks at runtime against a deployed model because that’s my background. I’m not dismissive, nor totally convinced by a widespread threat at training time, mostly because I expect most deployments down the road to use ~ vanilla models. There are cool problems in the supply chain to run models but it’s similarly out of my scope.

The biggest security issues unique to running AI models were discovered quite quickly:

  • Prompt injection: get the model to talk about something it shouldn’t (data leak, malicious tool use…)
  • Denial of Service: models are expensive, it’s fairly easy to exhaust the budget or get rate limited

(Un?)fortunately, if you consider the model like a public API, none of those are really new threats.

A diagram of three boxes showing user inputs from the Internet being passed to an AI Agent that then make tool calls

Replace AI Agent by API and it looks pretty standard

Moreover, we justify the cost of new security measures because we need to protect something of value. Today, I only saw three compelling industrialized use case for LLMs:

  1. Summarizing calls (where the likelihood of a malicious actor messing with the transcript is low) or documents
  2. Assisting in incident resolution by applying existing triage procedures and summarizing the current state of the incident people joining in later (disclaimer, my employer has such a product, it’s really expensive but I like it)
  3. Support chatbot, but they have no autonomy and no one will honestly get mad if they’re prompted to say something crazy. Sometimes, they’re given access to too much data (for instance, data from all users in order to answer questions about this specific user’s data) but that’s not an AI issue.

Everyone is experimenting with AI, but unless it’s industrialized (exposed to adversarial inputs and makes available its output), we’re facing two problems when trying to build a security product.

The first is that because the threats are abstract, we can’t know what’s the best abstraction layer to target (the raw prompt? the steps the model is planning to take? individual tool use?) which is leading to a lot of effort going down a blind alley. The other is that there is no real point is spending much money protecting it: you’re paying to solve a risk and the risk is not clear. Right now, people are calling for AI security products because they’re “doing AI” and they use security products for the rest of their systems so obviously, they need an AI security product. I would bet that no CISO would defend this budget at the expense of endpoint or infrastructure security: it’s a niche of application security and nowhere near the top of the stack once hype dies down.

No one knows what an AI security product should protect against, because no one knows how AI is a risk. Prompt injection isn’t a risk, it’s a means by which a risk can be realized. There is no point in locking a door to an empty wood shed. If you want to build a business, you need urgency which means you need to know that there is something valuable to protect, and what threats apply to this valuable deployment.

Until AI is deployed at scale in user facing products and has autonomy to take action, there is no need for AI security product. At best, AI auditing to track cost and validate your deployment is correct. It’s okay to think about threats and explore possible defenses but you’re not going to see significant deals until there is a credible threat that could impact one’s business. And if you identify how to deploy AI at scale to solve a problem worth significant money, you should not be in the security business, you should be building this :)

Who cares if someone is trying a prompt injection. What are you going to do? Block their IP after the fact? Prompt injection is a means to an end. It’s cool to build a technology to generically catch them but in practice, they’re not very good and the field is moving so fast that current implementations are likely never going to be useful at scale. This is why I think most of the current investment in AI security will end in tears. Which is a bit unfortunate because I suspect we’re starting to see what those AI threats will look like and why going after prompt injections was so mistaken.

What about OpenClaw?

What if this use case wasn’t in B2B & B2C despite our focus since it’s historically profitable for SaaS?

I think deployments like OpenClaw are incredibly interesting and possibly a killer feature.

This is interesting for us because the deployment can’t be nearly as segregated since the point is to act on external stimuli. The point is to receive external prompts (emails) and take action on them. This is precisely our worst case scenario and thus a stress test of where needs for security products could creep up. Prompt injection is the point of the product, so detecting them is meaningless. If you receive an email asking the agent to do something, they should do something! That’s the point of the product!

I believe there is an interesting problem to solve here. Let’s summarize what levers a security solution can have on the execution flow.

Same graph as before, but this time showing security measures that could be introduced: prompt scanning on the inputs, plan scanning at the agent stage and tool allowlisting

The opportunities to catch something going wrong

You can filter the prompt as it’s flowing into the model, you can limit the tools the agent is allowed to use or you can review what the model is trying to do.

Prompt scanning

This is what most products are focused on because it’s easy to deploy in enterprises (just intercept the user input).

However, it’s also the hardest to interpret:

  • You need the flexibility of an LLM to interpret it, which means your checker could also suffer from a prompt injection
  • If you want to be able to block, you need to run synchronously which adds latency. You can limit the issue by running asynchronously and blocking the response but that means the user will incur costs for evaluating the malicious prompt and malicious tool uses may have been executed.
  • The false positive & negative issues from WAF will come up, for the same reasons: you’re not looking at how the system respond to the prompt

I don’t think there is a great future for those products, unless their use becomes mandatory (like WAFs).

Tool allowlisting

This is what Claude Code is doing: you have to manually approve each tool use. This ensures no malicious action can be taken, but is super annoying and require high interactivity which is a no go for a personal assistant. You could prepare an allow list of actions and tools the agent is allowed to use, but the more permissive, the higher the risk a malicious prompt could do something dangerous. The only solution here would be to apply different allowlists based on where the input is coming from.

Plan scanning

Let’s assume that the root issue is having an opaque agent receive the malicious input and take the malicious action. Can we, using some sort of plan mode, defang a compromised agent?

Same graph as before, but the agent is split in two with a plan being generated. The plan is reviewed by a third agent called reviewer

If the reviewer is good enough and can configure the tool allowlist, we’d be getting to a nice place

The source of an AI security issue is when the model is prompted by the user to do something bad and does it. This can be addressed by requiring the agent to make a plan, having the plan reviewed by another model, then executed with the appropriate tool allowlist. If you can follow this architecture and make the plan format sufficiently formal for the reviewer not to be influenced by the prompt injection (not easy!), I don’t really see what risks are left.

Producing this plan isn’t trivial, but if correlated with external metadata (source of the trigger, model’s “understanding” of why it’s trying to do something), the reviewer can focus solely on “what should such a trigger be allowed to do”, and “is this sequence of action consistent with the type of requests this input should accept”. Focus the reviewer on What is being done, when prompt injection are generally about convincing a model about Why something should be done.

I think this is a wedge that we can use to solve the problem. The plan/execute mode in Claude Code is clearly a hint in this direction, with the developer as reviewer. Our reviewer thankfully has a simpler job to do since it doesn’t have to figure out whether the agent solution is correct. It only has to check whether those actions are generally consistent with the types of request we’re expected to process autonomously (in case of doubt, ping the user) and restrict the tool allowlist.

I could be wrong with this, with the main risks I’m seeing being:

  1. The plan having to be so detailed it “leaks” the injection into the reviewer model
  2. The reviewer missing risks in complex tool sequences (write a file with malicious code, then execute it as a side effect of an unrelated tool)

Not sure how big a market opportunity it would be since it needs to be closely co-designed with the agentic system. I think you could pitch your customized reviewer that’s better at filtering & checking plans. Not sure if you can make it better enough and unique enough so that your tricks aren’t integrated in the reviewer embedded in the project.

Let’s sum it up

Unless a massive new use case come up, here is where I see a need for AI security tools in the corporate world:

  • Control tool use (i.e. data sources models have access to) of public facing models: this could be achieved by sandboxing the tools the agent has access to, possibly depending of where the original prompt is coming from or the user permissions
  • Apply some level of monitoring and rate limiting to model calls (cost optimization, not directly AI safety)

I don’t believe prompt scanning will be a successful market, or at least I don’t believe it will be an effective control: I worked on WAF-adjacent topics for the last 7 years and prompts are so much more ambiguous.

Autonomous agents like Openclaw look really cool and face their own threats. However, they’re basically a runtime: a harness to link external stimuli to tools. I don’t believe those can be efficiently protected from the outside: a plan reviewer feels to me like the most robust response because it reviews the model’s action no matter the reason for those actions. I’m seeing parallels with the RASP technology I worked on in the past that managed to comprehensively solve some classes of security issues.

I’m really hopeful the plan reviewer approach can work and would be curious to see how it behaves in the real world. I think the issues I mentionned with plan structure and complex tool sequences won’t creep up right away and they feel solvable. If you experimented with it, drop me a line on Mastodon!