OpenAI’s ChatGPT Crawler Bot Poses A Dilemma For Publishers And Newsmakers

The First Online Content Revolution: From Websites to Directories to Search

Some of us still remember the “dotcom” boom from 1999-2001 and the “Cambrian explosion” of websites catering to the masses as well as to the long tail of people seeking very specific content. All you needed to do was remember the name. Then came Yahoo, AOL, and other directories where websites competed for placement and ratings. Within a few years, these directories were replaced by search engines, with Google emerging as a virtual search monopoly. Google developed the most efficient crawling bot that visited each website, followed every link, and copied the data. It then processed this data and provided a list of the most relevant websites for a specific set of keywords. Now, as a user, you no longer needed to remember the domain name; you could simply type in a few keywords and find the relevant link.

This transition created a huge dilemma for publishers and newsmakers. From one perspective, you wanted to be ranked among the top ten results in search output and make all your content crawlable and search engine optimized (SOE). From another perspective, you lost quite a bit of advertising revenue coming from the front page ads, as well as the visibility of the ads that the user would have seen if he or she were to traverse the website themselves. Most publishers and newsmakers decided to adapt to the new realities of search and become fully-searchable. Even premium paywalled content became available to search engines to increase the probability of the content being picked up by crawling bots. This new paradigm also led to the creation of new media which started supplying user-generated content in the form of blogs and contributor networks. This new media could now compete with traditional media for traffic and advertising revenue while the bulk of the advertising revenue went to the search engines.

This redistribution of power and advertising revenue has been reflected in the valuations of publishers in recent years. One of the most famous and prominent publishing houses, Forbes, was acquired for just $415 million in 2016 by the Hong Kong-based Integrated Whale Media Investments, and is expected to change hands again for around $800 million in a deal led by the young technology genius and the founder of Luminar, Austin Russel. Fortune was acquired for a mere $150 million by the Thai billionaire, Chatchaval Jiaravanon, in 2018. In 2013, Jeff Bezos acquired the Washington Post for $250 million.

In comparison, at the time of this writing, the market capitalization of Google was $1.66 Billion, Facebook traded for around $760 Billion, and Twitter sold for $44 Billion. These tech giants became the leading traffic aggregators and attracted the majority of the advertising revenue, taking ad revenue away from the content creators that fund professional journalism.

This unfair redistribution of ad revenue and publishers’ desire to get the aggregator traffic, led professional media outlets to focus on Search Engine Optimization (SEO), coming up with flashier titles and catering to consumer wants rather than focusing on more balanced and professional reporting. Some governments recognized this trend and are trying to enforce fair distribution of advertising resources. For example, Canada introduced a bill requiring the online giants to share advertising revenue with the publishers, a move strongly opposed by the search engines and social networks.

Publishers’ Dilemma: Should You Allow the ChatGPT Bot to Crawl Your Content?

While there are rumors that ChatGPT was trained on Microsoft Bing’s crawling bot data and much of the other data provided by Microsoft, OpenAI revealed its own web crawler, ChatGPT Bot as a short note in the documentation. Almost immediately, on August 8, 2023, Venture Bryson Masse of VentureBeat reported that some publishers and creators started blocking the bot to preserve their content. Benj Edwards from ArsTechnica expanded on the story.

It is no secret that some of the transformer-based Large Language Models (LLMs) like ChatGPT 4.0 became so good that they started outperforming humans in many tasks including some analytical tasks. These models are still far from perfect and most high-quality publishers, including Forbes.com and Nature Publishing Group, have banned the use of generative tools for content creation by introducing strict policies.

I previously wrote an article explaining that publishers with massive amounts of proprietary content are the most likely beneficiaries of the generative AI revolution as they may be able to develop their own trustworthy chat bots or license the content to generative AI companies. However, if they let their content be “crawled” and processed by the crawler bots operated by the generative AI companies without proper watermarking and copyright notices, they are likely to lose this advantage. At the same time, now that the generative AI systems are becoming more interpretable and can lead to the primary source, not being crawled will decrease the probability of the content being accessed.

This is the new dilemma that most of the publishers will face sooner or later. At this point, it is safer to protect the paywalled content from being crawled and only make the title and keywords accessible to the crawler bots and invest in internal generative AI capabilities.

It is also important to note that much of the published content has already been crawled by the search engines and may be re-used for training of generative AI systems. For example, Google invested in complete digitization of books and has crawled the entire Internet. Will these books and this crawled content be used for training the LLMs? Massive tests need to be conducted to see if this content was already used by the leading AI players.

Will Generative AI Cause Further Decline in Professional Journalism?

It is natural to expect some decline in the quality of content from the publishers and newsmakers where the use of generative tools is allowed or even encouraged. There are multiple spam publishers already doing that and often confusing the search engines. However, we should not underestimate the further potential loss of advertising revenue. Serious publishers require a steady stream of advertising and subscription revenue to maintain their high editorial standards. Here is where lawmakers may come into play to ensure that independent media is supported and professional journalism is encouraged. Otherwise, we are likely to see the Internet being polluted with AI-generated content produced by the LLMs that demonetized professional publishers.

Read the full article here