Perplexity AI Accused of Ignoring Website Scraping Blocks

▼ Summary
– AI startup Perplexity is scraping content from websites that explicitly block scraping, according to Cloudflare, which observed the startup hiding its activities.
– Cloudflare found Perplexity bypassing blocks by altering its bots’ user agent and network identifiers, affecting tens of thousands of domains daily.
– Perplexity denied Cloudflare’s claims, calling the report a “sales pitch” and stating the identified bot wasn’t theirs, despite Cloudflare’s evidence.
– Cloudflare has taken steps to block Perplexity’s bots and recently launched tools to help websites charge or prevent AI scrapers from accessing their content.
– This isn’t the first time Perplexity has faced scraping accusations, with past allegations of plagiarism and unclear definitions of content use from its CEO.
Cloudflare has accused AI startup Perplexity of bypassing website restrictions designed to prevent unauthorized data scraping, raising concerns about ethical web crawling practices. According to the internet infrastructure provider, Perplexity allegedly ignored explicit blocks and disguised its scraping activities by altering digital fingerprints used to identify its bots.
Cloudflare’s research revealed that Perplexity modified its user-agent identifiers, digital signals that reveal a visitor’s device and browser, to mimic legitimate traffic. Additionally, the company reportedly switched autonomous system network (ASN) numbers, which help trace large-scale internet activity. These tactics allegedly allowed Perplexity to evade detection while scraping data from thousands of domains, processing millions of daily requests.
Perplexity denied the allegations, dismissing Cloudflare’s report as a marketing tactic. A company spokesperson claimed the screenshots in the post didn’t prove any content was accessed and insisted the bot in question didn’t belong to them. However, Cloudflare countered that its findings were based on machine learning and network analysis after multiple customers reported unauthorized scraping despite implementing robots.txt blocks, a standard method for controlling web crawlers.
The controversy highlights growing tensions between AI companies reliant on web data and publishers seeking to protect their content. Cloudflare has actively opposed unchecked AI scraping, recently introducing tools to let website owners charge for access or block bots entirely. Last year, the company also launched free anti-scraping measures amid concerns that AI training practices were undermining online publishers.
This isn’t the first time Perplexity has faced scrutiny. Earlier accusations from outlets like Wired alleged the company reproduced articles without proper attribution. During a TechCrunch interview, Perplexity’s CEO struggled to define plagiarism when pressed, further fueling skepticism about its data-handling policies.
As debates over AI ethics and content ownership intensify, Cloudflare’s findings add pressure on tech firms to prioritize transparency and respect for publisher preferences. The situation underscores the need for clearer industry standards to balance innovation with fair data usage.
(Source: TechCrunch)