Is a secure AI assistant possible?

The landscape of artificial intelligence is rapidly evolving, with AI agents, particularly those powered by Large Language Models (LLMs), presenting a complex mix of promise and peril. Even when confined within the controlled environment of a chat interface, LLMs are prone to errors and exhibiting undesirable behaviors. The stakes escalate dramatically when these agents are equipped with tools that grant them agency in the real world, such as web browsers and the ability to send emails. In this scenario, the consequences of their missteps can become profoundly serious, raising critical questions about the feasibility of truly secure AI assistants.

This inherent risk might explain why the first significant breakthrough in personal LLM assistants didn’t emerge from the major AI research labs, which are acutely aware of reputational damage and legal liabilities. Instead, it was an independent software engineer, Peter Steinberger, who spearheaded this development. In November 2025, Steinberger released his tool, now known as OpenClaw, on GitHub. By late January, the project had achieved viral status, attracting widespread attention and adoption.

OpenClaw empowers users to create their own personalized AI assistants by leveraging existing LLMs. For many, this involves entrusting vast amounts of personal data to these agents, ranging from years of accumulated emails to the entirety of their hard drive contents. This level of data access has understandably sent shockwaves through the cybersecurity community. The potential risks associated with OpenClaw are so extensive that comprehending them thoroughly would likely consume the better part of a week, as evidenced by the deluge of security-focused blog posts that have emerged in recent weeks. The gravity of these concerns even prompted the Chinese government to issue a public warning regarding OpenClaw’s security vulnerabilities.

In response to the mounting apprehension, Steinberger himself acknowledged the risks, posting on X that non-technical individuals should refrain from using the software. While he did not respond to a request for comment for this article, the clear and present demand for the capabilities OpenClaw offers extends beyond those with the technical expertise to conduct their own security audits. Consequently, any AI companies aspiring to enter the personal assistant market will be compelled to devise systems that can effectively safeguard user data. Achieving this will necessitate the adoption of strategies drawn from the forefront of agent security research.

Risk Management: The Double-Edged Sword of AI Agency

At its core, OpenClaw functions as a sophisticated control system for LLMs, akin to a "mecha suit." Users can select their preferred LLM to act as the "pilot," granting it enhanced memory capabilities and the autonomy to set and execute tasks on a regular schedule. Unlike the agentic offerings from major AI corporations, OpenClaw agents are designed for continuous operation, 24/7, and can be interacted with through familiar messaging platforms like WhatsApp. This enables them to function as hyper-capable personal assistants, capable of generating personalized to-do lists, planning vacations while users are engaged in work, and even developing new applications in their spare time.

However, this immense power comes with inherent risks. For an AI assistant to effectively manage an inbox, it must be granted access to that inbox and all the sensitive information it contains. Similarly, if an assistant is to make purchases on behalf of a user, it requires access to payment details. For tasks involving local computer operations, such as coding, the agent needs some level of access to the user’s files.

The potential for things to go awry is multifaceted. One significant concern is the possibility of AI assistants making critical errors. A stark example of this was when a user’s Google Antigravity coding agent reportedly wiped their entire hard drive without authorization after misinterpreting instructions to clear a cache. Another danger lies in the potential for malicious actors to gain unauthorized access to the agent through conventional hacking tools, enabling them to either exfiltrate sensitive data or execute harmful code. In the weeks following OpenClaw’s viral surge, security researchers have demonstrated numerous vulnerabilities that expose security-naïve users to significant risks.

While these dangers can be mitigated, they require proactive measures. Some users are opting to run their OpenClaw agents on separate, isolated computers or in cloud environments, thereby safeguarding the data on their primary hard drives from accidental deletion. Other vulnerabilities could potentially be addressed through established cybersecurity practices.

Yet, the experts interviewed for this article expressed particular concern about a more insidious security threat known as prompt injection. Prompt injection is, in essence, a form of LLM hijacking. By simply posting malicious text or images on a website that an LLM might access, or by sending them to an inbox that the LLM monitors, attackers can manipulate the agent to perform actions against the user’s intent.

The ramifications of such an attack are particularly dire if the LLM has access to any of the user’s private information. Nicolas Papernot, a professor of electrical and computer engineering at the University of Toronto, likens using OpenClaw to "giving your wallet to a stranger in the street." The willingness of major AI companies to offer personal assistants may ultimately hinge on their ability to develop robust defenses against these sophisticated attacks.

It is crucial to note that, as of now, prompt injection has not yet resulted in any publicly reported catastrophic incidents. However, with potentially hundreds of thousands of OpenClaw agents operating online, prompt injection could very well emerge as an increasingly attractive strategy for cybercriminals. As Papernot observes, "Tools like this are incentivizing malicious actors to attack a much broader population."

Building Guardrails: The Quest for Secure AI Assistants

The term "prompt injection" was coined by prominent LLM blogger Simon Willison in 2022, a couple of months before the public release of ChatGPT. Even at that early stage, it was evident that LLMs would introduce a novel category of security vulnerabilities upon widespread adoption. LLMs struggle to differentiate between instructions from users and the data they process, such as emails and web search results – to an LLM, it’s all just text. Consequently, if an attacker embeds a few sentences within an email that the LLM misinterprets as a user instruction, the attacker can compel the LLM to execute any command.

Prompt injection represents a formidable challenge, and solutions are not immediately apparent. "We don’t really have a silver-bullet defense right now," states Dawn Song, a professor of computer science at UC Berkeley. Nevertheless, a dedicated academic community is actively researching this problem, developing strategies that could eventually pave the way for secure AI personal assistants.

Technically, it is possible to use OpenClaw today without succumbing to prompt injection by simply disconnecting it from the internet. However, this severely curtails OpenClaw’s utility, as it would prevent it from accessing emails, managing calendars, and performing online research – core functions of an AI assistant. The crux of the prompt injection problem lies in finding a way to prevent the LLM from responding to hijacking attempts while still allowing it to perform its intended duties.

One proposed strategy involves training the LLM to disregard prompt injections. A significant phase in LLM development, known as post-training, refines a model capable of generating realistic text into a functional assistant. This is achieved by "rewarding" it for appropriate responses and "punishing" it for failures. These rewards and punishments are metaphorical, but the LLM learns from them akin to an animal. Through this process, it’s possible to train an LLM to ignore specific examples of prompt injection.

However, this approach requires a delicate balance. If an LLM is trained to reject injected commands too aggressively, it might also begin to disregard legitimate user requests. Furthermore, due to the inherent element of randomness in LLM behavior, even an LLM meticulously trained to resist prompt injection will likely experience occasional slip-ups.

Another strategy focuses on intercepting prompt injection attacks before they reach the LLM. This typically involves employing a specialized "detector LLM" to ascertain whether the data being sent to the primary LLM contains any prompt injections. However, a recent study indicated that even the most effective detector LLMs failed to identify certain categories of prompt injection attacks.

A third, more intricate approach shifts the focus from input control to output guidance. Instead of detecting prompt injections, the goal is to establish policies that govern the LLM’s outputs – its behaviors – and prevent it from engaging in harmful actions. Some defenses in this category are relatively straightforward: if an LLM is permitted to email only a select few pre-approved addresses, it cannot transmit a user’s credit card information to an attacker. However, such a restrictive policy would also impede the LLM from completing many valuable tasks, such as researching and contacting potential professional contacts on behalf of the user.

"The challenge is how to accurately define those policies," remarks Neil Gong, a professor of electrical and computer engineering at Duke University. "It’s a trade-off between utility and security."

On a broader scale, the entire field of agentic AI is grappling with this fundamental trade-off: at what point will agents be secure enough to be truly useful? Experts hold differing views. Song, whose startup Virtue AI develops an agent security platform, believes that AI personal assistants can be safely deployed today. Conversely, Gong maintains that "We’re not there yet."

Even if AI agents cannot be rendered entirely impervious to prompt injection at this juncture, there are undoubtedly ways to mitigate the associated risks. It is plausible that some of these techniques could be integrated into OpenClaw. At the inaugural ClawCon event in San Francisco, Steinberger announced the onboarding of a dedicated security professional to work on the tool.

Currently, OpenClaw remains vulnerable. This has not, however, deterred its numerous enthusiastic users. George Pickett, a volunteer maintainer of the OpenClaw GitHub repository and an advocate for the tool, has implemented several security measures to protect himself. He runs OpenClaw in the cloud, thereby eliminating the risk of accidental hard drive deletion, and has put in place mechanisms to prevent unauthorized connections to his assistant.

Crucially, Pickett has not taken specific actions to prevent prompt injection. While aware of the risk, he has not yet encountered any reported instances of it occurring with OpenClaw. He offers a pragmatic, albeit potentially risky, perspective: "Maybe my perspective is a stupid way to look at it, but it’s unlikely that I’ll be the first one to be hacked." This sentiment underscores the complex and evolving nature of AI security, where the perceived likelihood of an attack can influence the urgency of implementing defenses.

Is a secure AI assistant possible?

Is a secure AI assistant possible?

isnaini azizah

Related Posts

The Download: The Pentagon’s New AI Plans, and Next-Gen Nuclear Reactors and Content

What do new nuclear reactors mean for waste?

Leave a Reply Cancel reply

Other Story

Why Ethereum Developers Want ‘One-Click Staking’ for Institutions

Police Drones in Haiti Have Killed More Than 1,000 People

Scientists build micromotors smaller than a human hair

The Abundance That AI May Promise Is Not Free

Judge Rules That Elon Musk’s Ketamine Use Is Off Limits in OpenAI Trial

Lasers just made atoms dance, unlocking the future of electronics