
AI is only as good as the data it has been trained on. And the amount of information it can retrieve, depends on the information it can access.
In the digital world where generative AI is taking over, AI is becoming an increasingly lucrative market. Since OpenAI introduced ChatGPT, the technology started disrupting other industries, creating new opportunities, and making people a bit lazier.
Perplexity is one of the few big fishes that compete in the sea of generative AI.
The company that describes itself as a "free AI search engine," has been facing a variety of criticisms, after major publications accused Perplexity of stealing its news and republishing it on various platforms, and that the company allegedly ignores the a web's standard and continue scraping various websites, even when it's told not to.
And here, the AI company is facing scrutiny over its web scraping and chatbot accuracy.
The web standard in question, is the robots.txt.
The text file usually located at the root of websites, contains instructions about which pages they are allowed to access, and which they're not allowed to access.
Basically, search engines such as Google and Bing, AI companies, and other web entities that use crawlers or bot, scan the web, jumping from one link to another, in order to collect vast amount of information from the internet.
Search engines do this to populate its database of websites and their web pages, making them indexable and searchable, AI companies on the other hand, use crawlers to collect training data for their AI products.
And here, web developers and website owners can use robots.txt to control crawlers when they visit their website.
Since 1994, websites no matter their size, use this file, and that crawlers, wherever they come from, are meant to obey the instructions.
On its documentations page, Perplexity clearly states how developers can disallow its bot from ever crawling their website.
But here, the crawlers from Perplexity allegedly disobey this protocol, and can still crawl websites when it's told it shouldn't.
By violating the instructions, Perplexity is able to make its AI generate summaries with a variety of details, by including the web pages it's not supposed to crawl. Not only that, Perplexity also allegedly use headless browsers to scrape content, to also ignore the robots.txt instructions.

In an interview, Perplexity CEO Aravind Srinivas responded to this allegation, saying that his company "does not ignore the Robot Exclusions Protocol and then lie about it."
Srinivas explained that the company uses third-party web browsers as well as its own browsers.
As for how things go, Srinivas said that, "it’s complicated."
Srinivas defended his company’s practices, stating that the robots.txt protocol "is not a legal framework," and suggested that a new type of relationship might need to be established between publishers and companies.
He also hinted that the accusations by the big news agencies used certain inputs to make Perplexity’s chatbot behave this way, so casual users won’t get the same results.
Regarding the false summaries produced by the tool, Srinivas said, “We never said that we did not see hallucinations.”
The problem here is not about why Perplexity does this, but the copyright issue that comes from developers and publishers who don't want Perplexity from crawling their websites.
Violating the robots.txt protocol, is the equivalent of entering someone's house, after it's told not to.
While robots.txt file is present and is made for the crawlers to obey, there is no 'must' in the equation.
Crawlers' compliance is completely voluntary.
The file only contains instructions to web crawlers, and that the instructions can be bypassed very easily.
It's worth noting that Perplexity isn't actually the only AI company committing these violations.
Perplexity isn't the single thing that chooses to ignore the robots protocol to continue collecting content.
Reports suggest that other companies, including OpenAI and Anthropic, developers of the ChatGPT and Claude chatbots respectively, also ignore robots.txt file signals.
Both companies have previously claimed that they respect the “do not crawl” instructions on robots.txt file, but some researches suggest that they bypass the rules and continue scraping.
Read: Google Said It Has The Right To 'Collect' Public Information From The Web To Train Its Bard AI