AI models can acquire backdoors from surprisingly few malicious documents

October 9, 2025

Fine-tuning experiments with 100,000 clean samples versus 1,000 clean samples showed similar attack success rates when the number of malicious examples stayed constant. For GPT-3.5-turbo, between 50 and 90 malicious samples achieved over 80 percent attack success across dataset sizes spanning two orders of magnitude.

Limitations

While it may seem alarming at first that LLMs can be compromised in this way, the findings apply only to the specific scenarios tested by the researchers and come with important caveats.

“It remains unclear how far this trend will hold as we keep scaling up models,” Anthropic wrote in its blog post. “It is also unclear if the same dynamics we observed

→ Continue reading at Ars Technica

Comments

First housing project revealed for massive OMSI redevelopment

Childhood vaccines safe for a little longer as CDC cancels advisory meeting

AI models can acquire backdoors from surprisingly few malicious documents

Related articles

Comments

Share article

Latest articles

Dream Deli Marries Jewish and Italian Deli Traditions in Portland

Why Is Everyone Skiing in Japan?

Oregon Governor Teases Possibility of Asking Voters First Before Approving Massive Legislation

This hacker conference installed a literal antivirus monitoring system

Nursing programs excluded from ‘professional degree’ status under new federal loan rules. Here’s what would change

How Oregon agencies use AI to help workers navigate complex Medicaid, SNAP rules