Artificial intelligence search startup Perplexity is embroiled in a high-stakes controversy after internet infrastructure giant Cloudflare accused it of systematically bypassing anti-scraping measures to harvest content from restricted websites. The allegations, detailed in a Cloudflare technical report, claim Perplexity employed sophisticated deception tactics to ignore publisher consent.
Cloudflare asserts it documented Perplexity, disguising its web crawlers as human users through fake browser identifiers and rotating networks to evade blocks. The company conducted controlled tests using newly registered, hidden domains explicitly configured to block all bots via robots.txt
files and web application firewall (WAF) rules. Despite these safeguards, Perplexity’s AI allegedly accessed protected content and reproduced it in responses to user queries.
Evidence of Evasion Tactics
Cloudflare’s investigation revealed Perplexity used two distinct crawling methods. It’s declared that crawlers identified by user agents “PerplexityBot” and “Perplexity-User” were properly blocked during testing. However, researchers simultaneously detected millions of daily stealth requests from an undeclared user agent impersonating Google Chrome on macOS. This disguised crawler used IP addresses outside Perplexity’s documented range and rotated across autonomous system networks (ASNs) to obscure its origin.
“This activity was observed across tens of thousands of domains and millions of requests per day,” Cloudflare stated, noting the pattern violated internet crawling norms outlined in the RFC. When blocked, Perplexity allegedly shifted to third-party services like BrowserBase or scraped alternative sites, though results lacked original content depth.
Perplexity’s Firm Rejection
Perplexity spokesperson Jesse Dwyer dismissed the report as a “sales pitch” and denied ownership of the flagged bots. The company later argued Cloudflare fundamentally misunderstood “user-driven fetching,” where AI agents access sites dynamically to answer specific queries rather than mass-scrape content. “The difference between automated crawling and user-driven fetching isn’t just technical, it’s about who gets to access information on the open web,” Perplexity contended.
Notably, some technologists defended this distinction. “If I, as a human, request a website, I should be shown the content. Why would an LLM accessing it on my behalf be treated differently?” questioned one Hacker News commenter.
Broader Implications for AI and Publishers
This incident intensifies ongoing tensions between AI firms and content creators. Publishers increasingly use robots.txt
WAF rules to protect intellectual property and ad revenue, especially as generative AI diverts traditional web traffic. Cloudflare reports over 2.5 million sites now block AI crawlers through its tools.
Legal experts warn that ignoring robots.txt
Though not legally binding could expose companies to copyright claims. “Ignoring website directives erodes trust and may breach terms of service,” noted digital rights advocate Elena Gomez. “It invites not just technical blocks but legal escalation.” The New York Times’ ongoing lawsuit against OpenAI underscores these risks.
Cloudflare has delisted Perplexity as a verified bot and updated systems to block its disguised traffic. Meanwhile, Perplexity maintains its approach aligns with evolving internet usage, stating: “Modern AI assistants work fundamentally differently from traditional web crawling”.
As AI agents proliferate, this clash highlights urgent questions about consent, attribution, and the ethics of real-time data access in an increasingly automated web.
Subscribe to my whatsapp channel
Comments are closed.