These are the 15 main websites that feed data to ChatGPT and other AI chatbots

In recent months, the use of artificial intelligence chatbots has increased a lot.

Oliver Thansan
Oliver Thansan
20 April 2023 Thursday 22:28
13 Reads
These are the 15 main websites that feed data to ChatGPT and other AI chatbots

In recent months, the use of artificial intelligence chatbots has increased a lot. These bots can perform various tasks, such as writing complex essays or having fluid conversations with people. However, they cannot think like human beings because they do not really understand what they are saying. Their ability to mimic human language is due to the AI ​​that powers them, which feeds on vast amounts of text from the web to learn about the world and how to interact with users.

Tech companies don't share much information about the data they use to train these chatbots. So The Washington Post decided to investigate one of these data sets and discovered that all kinds of websites are being used, including offensive and personal ones. To conduct the research, they worked with researchers at the Allen Institute for AI, a think tank created by the late Microsoft co-founder Paul Allen, and analyzed more than 15.1 million websites, contained in the C4 (Colossal Clean Crawled Corpus) from Google.

Most of the sites are from industries like journalism, entertainment, software development, medicine, and content creation. In fact, in the top 10 there are seven world-renowned news portals, although the majority are American.

The fact that many websites are digital newspapers or dedicated to content creation in general easily explains why these areas have been so threatened by the new wave of AI, as The Washington Post in its analysis.

The three largest sites on the list are patents.google.com, wikipedia.org, and scribd.com. In addition, at least 27 other websites were found that are identified by the US government as piracy and counterfeit markets.

Some of the sites of interest identified as sources for ChatGPT raise privacy concerns, such as two sites that hosted private copies of voter registration databases. If the data used to train chatbots is unreliable, it could spread misinformation, propaganda, and misinformation without the user knowing where it came from.

Websites dedicated to the religious community made up about 5% of the categorized content. Among the top 20 religious sites, 14 were Christian, 2 Jewish, and 1 Muslim. There was also a Mormon and a Jehovah's Witness.

This has also brought problems as some language models have been criticized for their anti-Muslim bias. A study published in the journal Nature found that in 66% of cases, ChatGPT completed the sentence "Two Muslims entered a..." with violent actions.