If you’ve ever uploaded photos or art, written a review, “liked” content, answered a question on Reddit, contributed to open source code, or done any number of other activities online, you’ve done free work for tech companies, because downloading all this content from the web is how their AI systems learn about the world.
Tech companies know this, but they mask your contributions to their products with technical terms like “training data,” “unsupervised learning,” and “data exhaust” (and, of course, impenetrable “Terms of Use” documents). In fact, much of the innovation in AI over the past few years has been in ways to use more and more of your content for free. This is true for search engines like Google, social media sites like Instagram, AI research startups like OpenAI, and many other providers of intelligent technologies.
This exploitative dynamic is particularly damaging when it comes to the new wave of generative AI programs like Dall-E and ChatGPT. Without your content, ChatGPT and all of its ilk simply would not exist. Many AI researchers think that your content is actually more important than what computer scientists are doing. Yet these intelligent technologies that exploit your labor are the very same technologies that are threatening to put you out of a job. It’s as if the AI system were going into your factory and stealing your machine.
But this dynamic also means that the users who generate data have a lot of power. Discussions over the use of sophisticated AI technologies often come from a place of powerlessness and the stance that AI companies will do what they want, and there’s little the public can do to shift the technology in a different direction. We are AI researchers, and our research suggests the public has a tremendous amount of “data leverage” that can be used to create an AI ecosystem that both generates amazing new technologies and shares the benefits of those technologies fairly with the people who created them.
Data leverage can be deployed through at least four avenues: direct action (for instance, individuals banding together to withhold, “poison,” or redirect data), regulatory action (for instance, pushing for data protection policy and legal recognition of “data coalitions”), legal action (for instance, communities adopting new data-licensing regimes or pursuing a lawsuit), and market action (for instance, demanding large language models be trained only with data from consenting creators).
Let’s start with direct action, which is a particularly exciting route because it can be done immediately. Because of generative AI systems’ reliance on web scraping, website owners could significantly disrupt the training data pipeline if they disallow or limit scraping by configuring their robots.txt file (a file that tells web crawlers which pages are off limit).
Large user-generated content sites like Wikipedia, StackOverflow, and Reddit are particularly important to generative AI systems, and they could prevent these systems from accessing their content in even stronger ways—for example, by blocking IP traffic and API access. According to Elon Musk, Twitter has recently done exactly this. Content producers should also take advantage of the opt-out mechanisms that are increasingly being provided by AI companies. For instance, programmers on GitHub can opt out of BigCode’s training data via a simple form. More generally, simply being vocal when content has been used without your consent has been somewhat effective. For example, major generative AI player Stability AI agreed to honor opt-out requests collected via haveibeentrained.com after a social media uproar. By engaging in public forms of action, as in the case of mass protest against AI art by artists, it may be possible to force companies to cease business activities that most of the public perceives as theft.