NewScientist’s (24 November 2023, paywall) Matthew Sparkes has the report:
AI [artificial intelligence] models can trick each other into disobeying their creators and providing banned instructions for making methamphetamine, building a bomb or laundering money, suggesting that the problem of preventing such AI “jailbreaks” is more difficult than it seems. …
Now, Arush Tagade at Leap Laboratories and his colleagues have gone one step further by streamlining the process of discovering jailbreaks. They found that they could simply instruct, in plain English, one LLM to convince other models, such as GPT-4 and Anthropic’s Claude 2, to adopt a persona that is able to answer questions the base model has been programmed to refuse. This process, which the team calls “persona modulation”, involves the models conversing back and forth with humans in the loop to analyse these responses.
It would be interesting to see a few transcripts of such attacks, or a summary characterization of such attacks in order to understand the strategy.
Something I’ve not seen mentioned in the popular press, which is all I have to go on, is an analog to the brain exhaustion/regeneration cycle, and how it may play into human intelligence and whether it has application in AI.
Just a thought.