The insatiable voracity of AI

This article belongs to 'Artificial', the newsletter on artificial intelligence that Francesc Bracero sends every Friday.

Oliver Thansan
Oliver Thansan
11 April 2024 Thursday 16:26
3 Reads
The insatiable voracity of AI

This article belongs to 'Artificial', the newsletter on artificial intelligence that Francesc Bracero sends every Friday. If you want to receive it in your mailbox, sign up here.

The training of large artificial intelligence language models threatens to become a kind of black hole that devours everything that happens around it. Data, data and more data -in addition to enormous volumes of energy-. Nothing seems to be enough to cover your models' training data needs. The Reuters agency reports that Meta, Google, Amazon and Apple each reached an agreement with the Shutterstock photo repository in 2022, shortly after the launch of ChatGPT. Any type of data is essential for AI, and that includes images and videos.

The strategies to obtain the data are varied, including its illicit use without asking permission from the owners of their rights and with possible privacy violations. The newspaper The New York Times, which last December filed a lawsuit in federal court against OpenAI and Microsoft for using its articles without permission, assures that the ChatGPT company was facing a problem of lack of data at the end of 2021 and that its Engineers created speech recognition software called Whisper to transcribe YouTube videos, owned by Google. After an internal debate about this use, an OpenAI team ended up transcribing one million hours of video without asking permission.

AI companies are not only acting on their own, they are also paying. According to Reuters, the private repository Photobucket, which currently has two million users who save - upon payment - their 13 billion photos and videos on this platform, is in talks with AI companies so that they can train their models with rates of between 5 cents and 1 dollar per photo and more than 1 dollar per video. It would be more practical for AI companies to take the middle path: use what is on the Internet without negotiating anything, but lawsuits have led them to seek access to what they call "quality content", which is -! surprise- the one that has been created by humans.

But they have a problem: reaching agreements to use huge volumes of data is expensive and time-consuming. In Meta they have also faced this dilemma. They have the technology but they need much more data than they are able to get. According to The New York Times, directors, lawyers and engineers of the Facebook and Instagram company considered secretly using internet data and also buying the publishing house Simon.

The promise of AI faces several problems of scale. Projections from the artificial intelligence research institute Epoch suggest that the stock of low-quality linguistic data on the Internet will have been exhausted by companies between 2030 and 2050, while high-quality linguistic data will run out before 2026 and vision data between 2030 and 2060. The researchers warn that their conclusions are based "on the unrealistic assumption" that current trends in the use and production of machine learning data will continue and that there will not be large innovations in data efficiency. So everything can be accelerated.

One of the options left to companies to train their models is to resort to so-called "synthetic data", which is that generated by the artificial intelligence systems themselves. It can be a way out of possible collapse. While OpenAI's most advanced model, GPT-4, was trained with 12 billion tokens, the estimate is that GPT-5, which could arrive this year, would require 60 to 100 billion tokens. Last year, the US Copyright Office received complaints against artificial intelligence companies for their use of original works from more than 10,000 trade groups, authors and companies. All data is little.