Microsoft researchers recently uncovered a new form of “jailbreak” attack they’re calling a “Skeleton Key” that’s capable of removing the protections that keep generative artificial intelligence (AI) systems from outputting dangerous and sensitive data.
According to a Microsoft Security blog post, the Skeleton Key attack works by simply prompting a generative AI model with text asking it to augment its encoded security features.
Skeleton Key
In one example given by the researchers, an AI model is asked to generate a recipe for a “Molotov Cocktail” — a simple firebomb popularized during World War II — and the model refused, citing safety guidelines.
The Skeleton Key, in this case, was simply telling the model that the user was an expert in a laboratory setting. The model then acknowledged that it was augmenting its behavior and subsequently outputted what appeared to be a workable Molotov Cocktail recipe.
While the danger here might be mitigated by the fact that similar ideas can be found through most search engines, there is one area where this form of attack could be catastrophic: data containing personally identifiable and financial information.
According to Microsoft, the Skeleton Key attack works on most popular generative AI models including GPT-3.5, GPT-4o, Claude 3, Gemini Pro, and Meta Llama-3 70B.
Attack and Defense
Large language models such as Google’s Gemini, Microsoft’s CoPilot, and OpenAI’s ChatGPT are trained on data troves often described as “internet sized.” While that may be an exaggeration, the fact remains that many models contain trillions of data points encompassing entire social media networks and information depository sites such as Wikipedia.
The possibility that personally identifiable information such as names connected to phone numbers, addresses, and account numbers exists within a given large language model’s dataset is only constrained by how selective the engineers who trained it were with the data they chose.
Furthermore, any business, agency, or institution spinning up its own AI models, or adapting enterprise models for commercial/organizational use are also at the mercy of their base model’s training dataset. If, for example, a bank connected a chatbot to its customer’s private data and relied on existing security measures to prevent the model from outputting PID and private financial data, then it’s possible that a Skeleton Key attack could trick some AI systems into sharing sensitive data.
According to Microsoft there are several steps organizations can take to prevent this from happening. These include hard coded input/output filtering and secure monitoring systems to prevent advanced prompt engineering beyond the system’s safety threshold.
Related: US presidential debate inexplicably omits AI and quantum