Good bot, bad bot: Using AI and ML to solve data quality problems

Contents

Event

The existential threat

Bad bot, meet good bot

Create a measure of quality

Look at the quality behind the data

Get AI to do your cleaning for you

Fight nefarious AI with good AI

DataDecisionMakers

Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More

More than 40% of all website traffic in 2021 wasn’t even human.

This might sound alarming, but it’s not necessarily a bad thing; bots are core to functioning the internet. They make our lives easier in ways that aren’t always obvious, like getting push notifications on promotions and discounts.

But, of course, there are bad bots, and they infest nearly 28% of all website traffic. From spam, account takeovers, scraping of personal information and malware, it’s typically how bots are deployed by people that separates good from bad.

With the unleashing of accessible generative AI like ChatGPT, it’s going to get harder to discern where bots end and humans begin. These systems are getting better with reasoning: GPT-4 passed the bar exam in the top 10% of test takers and bots have even defeated CAPTCHA tests.

Event

Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.

In many ways, we could be at the forefront of a critical mass of bots on the internet, and that could be a dire problem for consumer data.

The existential threat

Companies spend about $90 billion on market research each year to decipher trends, customer behavior and demographics.

But even with this direct line to consumers, failure rates on innovation are dire. Catalina projects that the failure rate of consumer packaged goods (CPG) is at a frightful 80%, while the University of Toronto found that 75% of new grocery products flop.

Readers Also Like: Q4 results today: Bosch, Hindalco, Ashok Leyland, ZEEL among 256 companies to post earnings on May 24 | Company Business News - Mint

What if the data these creators rely on was riddled with AI-generated responses and didn’t actually represent the thoughts and feelings of a consumer? We’d live in a world where businesses lack the fundamental resources to inform, validate and inspire their best ideas, causing failure rates to skyrocket, a crisis they can ill-afford now.

Bots have existed for a long time, and for the most part, market research has relied on manual processes and gut instinct to analyze, interpret and weed out such low-quality respondents.

But while humans are exceptional at bringing reason to data, we are incapable of deciphering bots from humans at scale. The reality for consumer data is that the nascent threat of large language models (LLMs) will soon overtake our manual processes through which we’re able to identify bad bots.

Bad bot, meet good bot

Where bots may be a problem, they could also be the answer. By creating a layered approach using AI, including deep learning or machine learning (ML) models, researchers can create systems to separate low-quality data and rely on good bots to carry them out.

This technology is ideal for detecting subtle patterns that humans can easily miss or not understand. And if managed correctly, these processes can feed ML algorithms to constantly assess and clean data to ensure quality is AI-proof.

Here’s how:

Create a measure of quality

Rather than relying solely on manual intervention, teams can ensure quality by creating a scoring system through which they identify common bot tactics. Building a measure of quality requires subjectivity to accomplish. Researchers can set guardrails for responses across factors. For example:

Spam probability: Are responses made up of inserted or cut-and-paste content?
Gibberish: A human response will contain brand names, proper nouns or misspellings, but generally track toward a cogent response.
Skipping recall questions: While AI can sufficiently predict the next word in a sequence, they are unable to replicate personal memories.

Readers Also Like: White House gets AI firms to agree to voluntary safeguards, but not new regulations

These data checks can be subjective — that’s the point. Now more than ever, we need to be skeptical of data and build systems to standardize quality. By applying a point system to these traits, researchers can compile a composite score and eliminate low-quality data before it moves on to the next layer of checks.

Look at the quality behind the data

With the rise of human-like AI, bots can slip through the cracks through quality scores alone. This is why it’s imperative to layer these signals with data around the output itself. Real people take time to read, re-read and analyze before responding; bad actors often don’t, which is why it’s important to look at the response level to understand trends of bad actors.

Factors like time to response, repetition and insightfulness can go beyond the surface level to deeply analyze the nature of the responses. If responses are too fast, or nearly identical responses are documented across one survey (or multiple), that can be a tell-tale sign of low-quality data. Finally, going beyond nonsensical responses to identify the factors that make an insightful response — by looking critically at the length of the response and the string or count of adjectives — can weed out the lowest-quality responses.

By looking beyond the obvious data, we can establish trends and build a consistent model of high-quality data.

Get AI to do your cleaning for you

Ensuring high-quality data isn’t a “set and forget it” process; it requires consistently moderating and ingesting good — and bad — data to hit the moving target that is data quality. Humans play an integral role in this flywheel, where they set the system and then sit above the data to spot patterns that influence the standard, then feed these features back into the model, including the rejected items.

Readers Also Like: Amarillo area business news for the week of June 18, 2023 - amarillo.com

Your existing data isn’t immune, either. Existent data shouldn’t be set in stone, but rather subject to the same rigorous standards as new data. By regularly cleaning normative databases and historic benchmarks, you can ensure that every new piece of data is measured against a high-quality comparison point, unlocking more agile and confident decision-making at scale.

Once these scores are in-hand, this methodology can be scaled across regions to identify high-risk markets where manual intervention could be needed.

Fight nefarious AI with good AI

The market research industry is at a crossroads; data quality is worsening, and bots will soon constitute an even larger share of internet traffic. It won’t be long and researchers should act fast.

But the solution is to fight nefarious AI with good AI. This will allow for a virtuous flywheel to spin; the system gets smarter as more data is ingested by the models. The result is an ongoing improvement in data quality. More importantly, it means that companies can have confidence in their market research to make much better strategic decisions.

Jack Millership is the data expertise lead at Zappi.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!