Generative AI is making pen-test vulnerability remediation much worse

Organizations already struggle to fix flaws discovered during penetration testing. Gen AI apps bring added complexity and the need for greater expertise.

Technical, organizational, and cultural factors are preventing enterprises from resolving vulnerabilities uncovered in penetration tests — a problem the advent of generative AI is exacerbating rather than relieving.

According to a study by penetration testing as a service firm Cobalt, organizations fix less than half of all exploitable vulnerabilities (48%), a figure that drops to 21% for flagged gen AI app flaws.

Vulnerabilities identified in security audits that were rated either high or critical severity are more likely to be fixed, scoring a resolution rate of 69%.

Since 2017, the median time to resolve serious vulnerabilities has decreased dramatically — from 112 days down to 37 days last year. This demonstrates the positive impact of “shift left” security programs, according to Cobalt.

Patching headaches

Sometimes organizations make a conscious business decision to accept certain risks rather than disrupt operations or incur the significant costs that come with resolving some vulnerabilities.

Poor remediation planning and resource limitations also play a factor in slow patching. In some cases, vulnerabilities are found in legacy software or hardware that cannot be easily updated or replaced.

“Some organizations do only what they’re required to do for compliance or third-party approval — get a pentest,” Cobalt’s researchers wrote. “Remediating risk is of less immediate concern. For the most part, though, it comes down to a host of organizational issues spanning people, processes, and technology.”

Next gen-AI-eration

The latest annual edition of Cobalt’s State of Pentesting Report found that most firms have performed pen testing on large language model (LLM) web apps, with a third (32%) of tests finding vulnerabilities warranting a serious rating.

A variety of LLM flaws, including prompt injection, model manipulation, and data leakage, were identified with only 21% of flaws getting fixed. AI development is “racing ahead without a safety net,” Cobalt warns.

The figures are based on an analysis of data collected during more than 5,000 pen tests run by Cobalt. In a related survey of its customers, more than half of security leaders (52%) said they were under pressure to prioritize speed over security.

Vulnerabilities ‘flagged but not fixed’

Independent security experts told CSO that Cobalt’s findings line up with what they are witnessing in the arena of bug remediation.

“Most organizations are still too slow to address known vulnerabilities, and it’s rarely down to a lack of awareness,” James Lei, veteran engineering executive turned chief operating officer at legal services firm Sparrow, told CSO. “The vulnerabilities are being flagged — but they’re not being fixed.”

Vulnerability mitigation is getting delayed because businesses face competing priorities.

“Security teams are overstretched, engineering teams are focused on shipping features, and unless there’s regulatory pressure or a breach, fixing a ‘known issue’ just doesn’t get the same attention,” Lei said.

Gen AI apps, in particular, introduce a different set of problems that complicate vulnerability remediation.

“A lot of them are built quickly, using new frameworks and third-party tools that haven’t been fully tested in production environments,” Lei said. “You’ve got unfamiliar attack surfaces, models that behave unpredictably, and dependencies that teams don’t fully control.”

Lei added: “So even when vulnerabilities are found, resolving them can be complex and time-consuming — assuming you even have the in-house expertise.”

A generative AI app has two components: the app and the gen AI itself, typically an LLM, such as ChatGPT.

“The traditional application vulnerabilities are as easy to fix as normal vulnerabilities; there is no difference,” said Inti De Ceukelaire, chief hacker officer at bug bounty platform Intigriti.

For example, a gen AI app may decide to use a programmed functionality to look up certain documents. If there is a vulnerability in that programmed functionality, developers can simply change the code.

By contrast, a vulnerability in the LLM itself (the neural network or “brain” of the AI) is “much harder to fix as it is not always easy to understand why certain behavior is triggered,” De Ceukelaire said.

“One may make assumption and train or adjust the model to avoid this behavior, but you cannot be 100% certain that the issue is resolved,” he said. “In that sense, comparing it with traditional ‘patching’ is perhaps a bit of a stretch.”

When asked about by Intigriti’s comments, Cobalt said its gen AI-related work and findings were primarily focused on “validating the integrity of LLM-supported systems, not evaluating the entire breadth of the LLM’s trained behavior or output”.

Bug triage

If CISOs want to improve remediation rates, they need to make it easier for teams to prioritize security fixes. That might mean integrating security tooling earlier in the development process or setting performance measures around resolution time for serious findings.

“It also means having clear ownership — someone who’s accountable for making sure vulnerabilities actually get fixed, not just filed,” Sparrow’s Lei said.

Other experts argued security professionals should concentrate their limited resources on the riskiest classes of vulnerabilities, such as serious vulnerabilities exposed directly to the internet.

Accidental exposures and reducing technical debt should also be prioritized, according to Tod Beardsley, VP of security research at exposure management tools vendor runZero.

“A good penetration test will help CISOs identify those areas where criminals are likely to thrive, rather than simply list out a set of critical vulnerabilities without context,” Beardsley told CSO.

Security teams can easily become overwhelmed by the number of vulnerabilities to remediate from sources including regular penetration tests together with the results of vulnerability scanning tools.

“It is information overload, and teams do struggle to manage it all and prioritize remediation based on the severity of risk,” said Thomas Richards, infrastructure security practice director at application security testing firm Black Duck.

Much like runZero’s Beardsley, Richards argued that the results of pen tests needs to be viewed in the correct context.

“When given a report after a penetration test, internal security teams will review the report to determine its accuracy and what actions to take next,” Richards said. “This step does take time but allows organizations to prioritize remediating the highest risks first.”

Results from vulnerability scanning tools need to be treated with still greater caution.

“We often find with our automated tooling that the default severity from the output isn’t always accurate given other factors such as an exploit being available, network accessibility, and other remediation that reduce the risk of the vulnerability,” Richards explained. “Oftentimes, the issue is patched, even on critical systems.”

SUBSCRIBE TO OUR NEWSLETTER

From our editors straight to your inbox

Get started by entering your email address below.

Generative AI is making pen-test vulnerability remediation much worse

Organizations already struggle to fix flaws discovered during penetration testing. Gen AI apps bring added complexity and the need for greater expertise.

Patching headaches

Next gen-AI-eration

Vulnerabilities ‘flagged but not fixed’

Bug triage

From our editors straight to your inbox

关于《Generative AI is making pen-test vulnerability remediation much worse》的评论

发表评论

摘要

相关新闻

相关讨论