OpenAI asked a judge to dismiss parts of The New York Times‘ lawsuit against it, alleging that the media company “paid someone to hack OpenAI’s products,” such as ChatGPT, to generate 100 examples of copyright infringement for its case.
In a filing Monday in Manhattan federal court, OpenAI alleged it took the Times “tens of thousands of attempts to generate the highly anomalous results,” and that the company did so using “deceptive prompts that blatantly violate OpenAI’s terms of use.”
“Normal people do not use OpenAI’s products in this way,” OpenAI wrote in the filing.
The “hacking” that OpenAI alleges in the filing could also be called prompt engineering or “red-teaming,” a common way for artificial intelligence trust and safety teams, ethicists, academics and tech companies to “stress-test” AI systems for vulnerabilities. It’s a common practice in the AI industry and a popular way to alert companies to issues within their systems, similar to how cybersecurity professionals stress-test companies’ websites for weaknesses.
“In this filing, OpenAI doesn’t dispute — nor can they — that they copied millions of The Times’s works to build and power its commercial products without our permission,” Ian Crosby, Susman Godfrey partner and lead counsel for the Times, said in a statement to CNBC.
He added, “What OpenAI bizarrely mischaracterizes as ‘hacking’ is simply using OpenAI’s products to look for evidence that they stole and reproduced The Times’s copyrighted works. And that is exactly what we found. In fact, the scale of OpenAI’s copying is much larger than the 100-plus examples set forth in the complaint.”
The filing comes as a broader battle heats up between OpenAI and publishers, authors and artists over using copyrighted material for AI training data, including the high-profile Times lawsuit, which some see as a watershed moment for the industry. The news outlet’s lawsuit, filed in December, seeks to hold Microsoft and OpenAI accountable for billions of dollars in damages.
In the past, OpenAI has said it’s “impossible” to train top AI models without copyrighted works.
“Because copyright today covers virtually every sort of human expression—including blog posts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” OpenAI wrote in a filing last month in the U.K., in response to an inquiry from the U.K. House of Lords.
“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens,” OpenAI continued in the filing.
As recently as last month, in Davos, Switzerland, OpenAI CEO Sam Altman said he was “surprised” by the Times’ lawsuit, saying OpenAI’s models didn’t need to train on the publisher’s data.
“We actually don’t need to train on their data,” Altman said at an event organized by Bloomberg in Davos. “I think this is something that people don’t understand. Any one particular training source, it doesn’t move the needle for us that much.”
Although one publisher may not make a difference in ChatGPT’s operating abilities, OpenAI’s filing suggests that a decision by many publishers to opt out may have an effect. In recent months, the company began courting publishers to allow content to be used for training data.
The company has already struck deals with Axel Springer, the German media conglomerate that owns Business Insider, Morning Brew and other outlets, and is also reportedly in talks with CNN, Fox Corp. and Time to license their work.
“We expect our ongoing negotiations with others to yield additional partnerships soon,” OpenAI wrote in the filing.
In the filing and its blog posts, OpenAI has highlighted its opt-out process for publishers, which allows outlets to prohibit the company’s web crawler from accessing their websites. But in the filing, OpenAI says the content is vital to training today’s AI models.
“While we look forward to continuing to develop additional mechanisms to empower rightsholders to opt-out of training, we are actively engaged with them to find mutually beneficial arrangements to gain access to materials that are otherwise inaccessible, and also to display content in ways that go beyond what copyright law otherwise allows,” the company wrote.
— CNBC’s Ryan Browne contributed to this report.
Don’t miss these stories from CNBC PRO: