Finding good quality data on which to train generative AI models has been difficult. And it’s going to get harder, too, as rights holders to vast quantities of internet data begin to close ranks in recognition of just how valuable a resource it all is, observes Stephen Marcinuk of Intelligent Relations.
There’s little question that AI has captured the public’s imagination. And businesses of all kinds are touting generative AI, in particular, as the key to maintaining and increasing productivity. Indeed, venture funding for generative AI startups has grown by a staggering $7.9 billion in the last year alone.
The problem is that although these models are unquestionably valuable, the data they are trained on is an integral part of that value. Think of it this way: generative AI is all the non-edible infrastructure that goes into a buffet-dining experience: trays, plates, utensils, and serveware. But without food – in the case of AI, data – the chafing dishes are empty, and everyone goes hungry.
As access to wider data becomes more restricted, developers are going to need to train models on narrower sets. But before delving into these cautionary points in more detail, let’s first examine how modelers acquire data to begin with.
Data Collection and Its Limits
Up to now, large, foundational language models like the various iterations of GPT have acquired most of their data from web scraping. This involves extracting data from websites using automated tools, like AI crawlers. Their sources are search engine content, Wikipedia, online academic journals, news sources, and other places people go for reliable content.
However, this indiscriminate approach to data collection is about to change. OpenAI, the company behind ChatGPT, is facing a slew of copyright lawsuits concerning its web-scraping practices.
Beyond legal action, rights holders are also taking proactive steps to protect their content. The New York Times announced in early August that it would prohibit AI modelers from scraping its content for training purposes. The Internet Archive, which maintains one of the largest backlogs of old web pages in the world, began actively blocking the IP addresses of data scrapers in May. Getty Images has sued one generative AI operator to stop it from allegedly scraping “millions” of images from its site.
All of this is to say that, in the months and years ahead, AI modelers will need to be more discerning and strategic about how they source data and where they source it from.
See More: The Hidden Costs of Generative AI
Data Management Vulnerabilities
So, just as demand for data is skyrocketing with the proliferation of generative AI, data keepers are throttling the supply. This means that developers will need to train their models on proprietary data, like confidential company information – for example, an AI-powered chatbot that customer service representatives use to pull up customer and transaction information.
As another example, take my own PR company. We use a proprietary, AI-powered platform to provide our clients with an array of public relations services. Clients provide us with data about their businesses, and we train our model on it to achieve customized PR results.
Of course, much of this data is sensitive and confidential and, therefore, attractive to hackers. So, for our clients’ security and our own insulation from liability, we’ve set up our data warehouses and file libraries as if they could be prime targets for hackers at any moment. We do this with the best data-protection infrastructure around – encryption, access control, authentication and authorization, firewalls and intrusion detection systems, secure APIs, and more.
But having solid cybersecurity systems in place isn’t enough. Modelers should also conduct regular security audits and penetration testing to identify vulnerabilities in file libraries and cloud systems. As part of this process, it’s important to have robust monitoring and logging mechanisms to check for unusual activities or patterns that could indicate a data breach.
It’s also critical that modelers regularly back up data and their model infrastructure to a secure and separate location. This way, in the event of a breach that affects your ability to access data – like a ransomware attack – you’ll still have clean data copies from which to restore your systems.
One of the surest strategies for ensuring data security in generative AI is data minimization. In other words, only take what you need. Modelers – especially those who work with highly specific, confidential client or customer data – should only collect and store the data they absolutely require for their model to function as intended. The less data you have on hand, the less appealing your system is to attackers, and the less trouble you’ll have in the event of a breach.
Building Resilient, Narrower Models
Data minimization is going to be the buzzword you start hearing everywhere in the near future. We’re reaching a point where it’s no longer about pumping more training data into AI models. The thing that’s going to take you to the next level of modeling will be a quality-over-quantity approach.
What’s more, there are perfectly useful ways of repurposing the data and results you already have to refine your generative AI model. For example, you can utilize human-integrated feedback loops to retrain an AI model over and over again, making outputs better over time without necessarily throwing more fresh data into the mix. Google has teams of experts analyzing Bard’s outputs for this purpose almost every day.
Another game changer on this front is code interpreter (CI) plugins, which use AI to analyze private, proprietary files and assets. You can use CI plugins to analyze huge amounts of existing, offline data – like a client’s document library. The model can then answer specific questions about that client based on content in the document library. When you upload a single document, you’re training the model in real-time on a one-off data source. For example, if I upload a client’s sales reports from the last three months, I can prompt a specialized model to graph sales patterns over that same period.
This is the future of generative AI, not huge, catch-all models that scrape the web for shallow – sometimes inaccurate – answers to basic questions, but rather compressed, efficient models built to meet specific needs. As data becomes a more closely guarded resource, these streamlined, purposeful models, capable of running on more limited data sets, will be the new gold standard.
Feasting or Limited Data
Over the course of the generative AI boom these past few years, modelers have gorged themselves on what they thought was the internet’s never-ending feast of data. At the same time, they’ve exploited a no-rules, frontier culture in this burgeoning field of cutting-edge technology. That’s all going to change.
But modelers who get ahead of this shift and build generative AI products that can produce better outputs with more limited training data won’t go hungry. They’ll develop palates for more refined data that suits their needs and develop offerings that stand out on a table crowded with empty calories.
Do you have a steady stream of quality data to feed and train your generative AI tools? Share with us on Facebook, X, and LinkedIn. We’d love to hear from you!
Image Source: Shutterstock