Home » Cloudflare Accuses Perplexity Of Evading Anti-bot Rules

Cloudflare Accuses Perplexity Of Evading Anti-bot Rules

Cloudflare observed AI startup Perplexity bypassing website content access restrictions, alleging the company obscured its bot identities to circumvent digital preferences. This activity involved Perplexity altering bot user agents and autonomous system networks to evade detection across numerous domains. The internet infrastructure provider Cloudflare reported that AI startup Perplexity has been crawling and scraping content from websites that had explicitly disallowed such activity.

Cloudflare published research on Monday, detailing its observations that Perplexity ignored existing blocks and concealed its crawling and scraping operations. The network infrastructure company accused Perplexity of obscuring its identity while attempting to scrape web pages, stating this was “an attempt to circumvent the website’s preferences.” AI products, including those offered by Perplexity, rely on the ingestion of substantial data volumes from the internet. AI startups have frequently scraped text, images, and videos from the internet, often without explicit permission, to facilitate product functionality. Websites have increasingly utilized the Robots.txt file, a web standard designed to inform search engines and AI companies about pages permissible for indexing and those that are not, with varying degrees of success in recent times.

Cloudflare stated that Perplexity appeared to be intentionally circumventing these blocks by modifying its bots’ “user agent,” which is a signal identifying a website visitor by their device and version type. The company also noted that Perplexity altered its autonomous system networks (ASN), a numerical identifier for large networks on the internet, as part of these efforts. Cloudflare’s post specified, “This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals.”

Jesse Dwyer, a spokesperson for Perplexity, dismissed Cloudflare’s blog post as a “sales pitch.” In an email to TechCrunch, Dwyer asserted that the screenshots included in the post “show that no content was accessed.” In a subsequent email, Dwyer claimed the bot identified in the Cloudflare blog was not associated with Perplexity. Cloudflare indicated that it initially detected this behavior after customers reported that Perplexity was crawling and scraping their sites, despite the implementation of Robots.txt rules and specific blocks targeting known Perplexity bots. Cloudflare subsequently conducted tests to verify these claims and confirmed Perplexity’s circumvention of existing blocks.

Cloudflare stated, “We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked.” The company confirmed it has de-listed Perplexity’s bots from its verified list and has implemented new technical methods to block them. Cloudflare has recently adopted a public stance regarding AI crawlers. Last month, Cloudflare announced a new marketplace designed to enable website owners and publishers to levy charges against AI scrapers visiting their sites. At that time, Cloudflare’s chief executive, Matthew Prince, expressed concerns, asserting that AI was disrupting the internet’s business model, particularly for publishers. In the preceding year, Cloudflare also introduced a free tool intended to prevent bots from scraping websites for AI training purposes.

This is not the first instance of Perplexity facing accusations of unauthorized scraping. Last year, news organizations, including Wired, alleged that Perplexity engaged in content plagiarism. Weeks later, during an interview with TechCrunch’s Devin Coldewey at the Disrupt 2024 conference, Perplexity’s CEO, Aravind Srinivas, was unable to provide an immediate definition of plagiarism when asked.


Featured image credit

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *