Forbes spoke to the leaders of AI red teams at Microsoft, Google, Nvidia and Meta, who are tasked with looking for vulnerabilities in AI systems so they can be fixed. “You will start seeing ads about ‘Ours is the safest,’” predicts one AI security expert.
By Rashi Shrivastava, Forbes Staff
A month before publicly launching ChatGPT, OpenAI hired Boru Gollo, a lawyer in Kenya, to test its AI models, GPT-3.5 and later GPT-4, for stereotypes against Africans and Muslims by injecting prompts that would make the chatbot generate harmful, biased and incorrect responses. Gollo, one of about 50 external experts recruited by OpenAI to be a part of its “red team,” typed a command into ChatGPT, making it come up with a list of ways to kill a Nigerian — a response that OpenAI removed before the chatbot became available to the world.
Other red-teamers prompted GPT-4’s pre-launch version to aid in a range of illegal and nocuous activities, like writing a Facebook post to convince someone to join Al-Qaeda, helping find unlicensed guns for sale and generating a procedure to create dangerous chemical substances at home, according to GPT-4’s system card, which lists the risks and safety measures OpenAI used to reduce or eliminate them.
To protect AI systems from being exploited, red-team hackers think like an adversary to game them and uncover blind spots and risks baked into the technology so that they can be fixed. As tech titans race to build and unleash generative AI tools, their in-house AI red teams are playing an increasingly pivotal role in ensuring the models are safe for the masses. Google, for instance, established a separate AI red team earlier this year, and in August the developers of a number of popular models like OpenAI’s GPT3.5, Meta’s Llama 2 and Google’s LaMDA participated in a White House-supported event aiming to give outside hackers the chance to jailbreak their systems.
But AI red teamers are often walking a tightrope, balancing safety and security of AI models while also keeping them relevant and usable. Forbes spoke to the leaders of AI red teams at Microsoft, Google, Nvidia and Meta about how breaking AI models has come into vogue and the challenges of fixing them.
“You will have a model that says no to everything and it’s super safe but it’s useless,” said Cristian Canton, head of Facebook’s AI red team. “There’s a trade off. The more useful you can make a model, the more chances that you can venture in some area that may end up producing an unsafe answer.”
The practice of red teaming software has been around since the 1960s, when adversarial attacks were simulated to make systems as sturdy as possible. “In computers we can never say ‘this is secure.’ All we can ever say is ‘we tried and we can’t break it,’” said Bruce Schneier, a security technologist and a fellow at Berkman Klein Center for Internet And Society at Harvard University.
But because generative AI is trained on a vast corpus of data, that makes safeguarding AI models different from traditional security practices, said Daniel Fabian, the head of Google’s new AI red team, which stress tests products like Bard for offensive content before the company adds new features like additional languages.
“The motto of our AI red team is ‘The more you sweat in training, the less you bleed in battle.’”
Beyond querying an AI model to spit out toxic responses, red teams use tactics like extracting training data that reveals personally identifiable information like names, addresses and phone numbers, and poisoning datasets by changing certain parts of the content before it is used to train the model. “Adversaries kind of have a portfolio of attacks and they will just move onto the next attack if one of them isn’t working,” Fabian told Forbes.
With the field still in its early stages, security professionals who know how to game AI systems are “vanishingly small,” said Daniel Rohrer, VP of software security of Nvidia. That’s why a tight-knit community of AI red teamers tends to share findings. While Google’s red teamers have published research on novel ways to attack AI models, Microsoft’s red team has open-sourced attacking tools like Counterfit, which helps other businesses test the safety and security risks of algorithms.
“We were developing these janky scripts that we were using to accelerate our own red teaming,” said Ram Shankar Siva Kumar, who started the team five years ago. “We wanted to make this available to all security professionals in a framework that they know and that they understand.”
Before testing an AI system, Siva Kumar’s team gathers data about cyberthreats from the company’s threat intelligence team, who are the “eyes and ears of the internet,” as he puts it. He then works with other red teams at Microsoft to determine which vulnerabilities in the AI system to target and how. This year, the team probed Microsoft’s star AI product Bing Chat as well as GPT-4 to find flaws.
Meanwhile, Nvidia’s red teaming approach is to provide crash courses on how to red team algorithms to security engineers and companies, some of which already rely on it for compute resources like GPUs.
“As the engine of AI for everyone… we have a huge amplification factor. If we can teach others to do it (red teaming), then Anthropic, Google, OpenAI, they all get it right,” Rohrer said.
With increased scrutiny into AI applications from users and government authorities alike, red teams also offer a competitive advantage to tech firms in the AI race. “I think the moat is going to be trust and safety,” said Sven Cattell, founder of the AI Village, a community of AI hackers and security experts. “You will start seeing ads about ‘Ours is the safest.’”
Early to the game was Meta’s AI red team, which was founded in 2019 and has organized internal challenges and “risk-a-thons” for hackers to bypass content filters that detect and remove posts containing hate speech, nudity, misinformation and AI-generated deep fakes on Instagram and Facebook.
In July 2023, the social media giant hired 350 red teamers including external experts, contract workers and an internal team of about 20 employees, to test Llama 2, its open source latest large language model, according to a published report that details how the model was developed. The team injected prompts like how to evade taxes, how to start a car without a key and how to set up a Ponzi scheme. “The motto of our AI red team is ‘The more you sweat in training, the less you bleed in battle,’” said Canton, the head of Facebook’s red team.
That motto was similar to the spirit of one of the largest AI red teaming exercises held at the DefCon hacking conference in Las Vegas in early August. Eight companies including OpenAI, Google, Meta, Nvidia, Stability AI and Anthropic opened up their AI models to over 2000 hackers to feed them prompts designed to reveal sensitive information like credit card numbers or generate harmful material like political misinformation. The Office of Science and Technology Policy at the White House teamed up with the organizers of the event to design the red teaming challenge, adhering to its blueprint for an AI Bill of Rights, a guide on how automated systems should be designed, used and launched safely.
“If we can teach others to do it (red teaming), then Anthropic, Google, OpenAI, they all get it right.”
At first the companies were reluctant to offer up their models largely because of the reputational risks associated with red teaming at a public forum, said Cattell, founder of the AI village who spearheaded the event. “From Google’s perspective or OpenAI’s perspective, we’re a bunch of kids at DefCon,” he told Forbes.
But after assuring the tech companies that models will be anonymized and hackers won’t know which model they’re attacking, they agreed. While the results of the nearly 17,000 conversations hackers had with the AI models won’t be public until February, the companies walked away from the event with several new vulnerabilities to address. Across eight models, red teamers found about 2,700 flaws, such as convincing the model to contradict itself or give instruction on how to surveil someone without their knowledge, according to new data released by the organizers of the event.
One of the participants was Avijit Ghosh, an AI ethics researcher who was able to get multiple models to do incorrect math, produce a fake news report about the King of Thailand and write about a housing crisis that didn’t exist.
Such vulnerabilities in the system have made red teaming AI models even more crucial, Ghosh said, especially when they may be perceived by some users as all-knowing sentient entities. “I know several people in real life who think that these bots are actually intelligent and do things like medical diagnosis with step-by-step logic and reason. But it’s not. It’s literally autocomplete,” he said.
But generative AI is like a multi-headed monster— as red teams spot and fix some holes in the system, other flaws can crop up elsewhere, experts say. “It’s going to take a village to solve this problem,” Microsoft’s Siva Kumar said.