英语轻松读发新版了,欢迎下载、更新

New 'benevolent hacking' method prevents AI from giving rogue prompts

2025-09-07 10:35:00 英文原文

作者:Bojan Stojkovski

The rush for efficiency is leading to models that are more vulnerable to producing dangerous content.

New ‘benevolent hacking’ method could prevent AI models from giving rogue prompts

Open-source AI systems can return harmful results when run on lower-powered devices.

EurekAlert

AI is steadily moving off giant cloud servers and into everyday devices like smartphones, cars, and household gadgets. To make that possible, models are often pared down to conserve energy and processing power. 

The problem is that what gets cut isn’t always cosmetic, and sometimes the very safeguards designed to block harmful outputs, such as hate speech or criminal instructions, are weakened or lost.

Open-source models amplify this risk – they can be freely downloaded, altered, and run offline, enabling rapid innovation but also removing layers of oversight. Without the monitoring and guardrails that proprietary systems rely on, stripped-down versions become more exposed to tampering and potential misuse, raising questions about how to balance accessibility with safety.

Efficiency tradeoffs put open-source AI at risk of misuse

Researchers at the University of California, Riverside, found that the very layers meant to block harmful outputs – like pornography or step-by-step weapon guides – are often the first to be cut in the name of efficiency. These stripped-down versions may run faster and consume less memory, but they also carry higher risks.

Amit Roy-Chowdhury, professor of electrical and computer engineering and senior author of the study, explained that some of these dropped layers are critical to preventing unsafe outputs. Without them, the model may start answering questions it should never touch.

To tackle the problem, the researchers redesigned the AI from the inside out. Rather than relying on add-on filters or quick software fixes, they retrained the model’s core structure so it could still recognize and block dangerous prompts, even after being stripped down for smaller devices. This approach reshapes how the model interprets risky content at its foundation, ensuring safeguards remain intact even when efficiency demands that layers be removed.

Retrained models reject dangerous prompts 

The researchers set out to ensure that AI models maintain safe behavior even after being reduced in size. To test their approach, they used LLaVA 1.5, a vision-language model that processes both text and images. Their experiments showed that certain combinations – like a benign image paired with a harmful question – could slip past the model’s safety filters. In one case, the trimmed-down model produced step-by-step instructions for building a bomb.

After retraining, the AI model consistently rejected harmful queries, even when operating with only a fraction of its original structure. Instead of relying on filters or add-on guardrails, the researchers reshaped the model’s internal understanding, ensuring it behaved safely by default – even when slimmed down for low-power devices.

The researchers call their approach a form of benevolent hacking that helps strengthen AI systems before weaknesses can be exploited. Graduate students Saketh Bachu and Erfan Shayegani aim to push the method further, developing techniques that embed safety into every internal layer. By doing so, they hope to make AI models more resilient and dependable when deployed in real-world conditions.

Meanwhile, Roy-Chowdhury notes that although much work remains, the research represents a concrete step toward developing AI that is both open to innovation and responsibly designed.

ABOUT THE AUTHOR

Bojan Stojkovski Bojan Stojkovski is a freelance journalist based in Skopje, North Macedonia, covering foreign policy and technology for more than a decade. His work has appeared in Foreign Policy, ZDNet, and Nature.

关于《New 'benevolent hacking' method prevents AI from giving rogue prompts》的评论


暂无评论

发表评论

摘要

The pursuit of efficiency in AI development is leading to models that are more vulnerable to producing harmful content such as hate speech or instructions for criminal activities. As AI moves from cloud servers to everyday devices, models are often downsized, which can remove critical safeguards against dangerous outputs. Researchers at the University of California, Riverside have developed a method to retrain these models so they maintain safety even when stripped down for efficiency on smaller devices. This approach involves restructuring the model's core to ensure it continues to reject harmful prompts, thus balancing accessibility with safety in open-source AI systems.