Fine-tuning experiments with 100,000 clean samples versus 1,000 clean samples showed similar attack success rates when the number of malicious examples stayed constant. For GPT-3.5-turbo, between 50 and 90 malicious samples achieved over 80 percent attack success across dataset sizes spanning two orders of magnitude.
Limitations
While it may seem alarming at first that LLMs can be compromised in this way, the findings apply only to the specific scenarios tested by the researchers and come with important caveats.
“It remains unclear how far this trend will hold as we keep scaling up models,” Anthropic wrote in its blog post. “It is also unclear if the same dynamics we observed
→ Continue reading at Ars Technica