The revised privacy policy, effective from July 1, 2023, now explicitly states that Google uses information to enhance its services and develop new products and technologies for the benefit of users and the public. It specifies that publicly available information can be utilized to train Google’s AI models and build various products and features, such as Google Translate, Bard, and Cloud AI capabilities.
However, the policy does not detail how Google will prevent copyrighted materials from being included in the data pool used for training. Many publicly accessible websites have implemented policies banning data collection or web scraping for training large language models and other AI tools. The implications of Google’s approach will be significant, considering global regulations like the General Data Protection Regulation (GDPR), which protect individuals from the misuse of their data without explicit consent.
The origin of training data used by generative AI systems, like OpenAI’s GPT-4, has become a contentious issue. Manufacturers are increasingly secretive about the sources of their training data, raising concerns about the inclusion of copyrighted works and social media posts. The legal status of fair use doctrine in this context remains unclear, leading to lawsuits and calls for stricter regulations on how AI companies collect and use training data.
Additionally, the processing of such vast quantities of training data poses challenges. The individuals responsible for sorting through this data often endure long hours and extreme working conditions. There are concerns about potential dangerous failures within AI systems that could arise from inadequate processing and the impact on those involved.
Google’s dominance in the digital ad market has prompted Gannett, the largest newspaper publisher in the United States, to sue Google and Alphabet, alleging a monopoly facilitated by advancements in AI technology. Google’s AI search beta has also faced criticism, with accusations of being a “plagiarism engine” and harming website traffic.Social platforms like Twitter and Reddit, which contain extensive public information, have taken measures to restrict data scraping by other companies. However, these changes have generated backlash within their respective communities due to their negative impact on the core user experiences of Twitter and Reddit.
FAQs
Q1: What is scraped web data?
A1: Scraped web data is data that has been collected from websites without the permission of the website owner. This data can include text, images, and other types of content.
Q2: Why is Google using scraped web data to train Bard?
A2: Google is using scraped web data to train Bard because it provides a large and diverse dataset of information. This allows Bard to learn a wider range of topics and generate more creative and informative text.
Disclaimer Statement: This content is authored by a 3rd party. The views expressed here are that of the respective authors/ entities and do not represent the views of Economic Times (ET). ET does not guarantee, vouch for or endorse any of its contents nor is responsible for them in any manner whatsoever. Please take all steps necessary to ascertain that any information and content provided is correct, updated, and verified. ET hereby disclaims any and all warranties, express or implied, relating to the report and any content therein.