This tweet discusses the behavior of scraping services in relation to Web Application Firewalls (WAFs) and website scraping rules. It highlights that many companies offering scraping services attempt to bypass WAF protections and often do not respect the robots.txt file, which is a standard used by websites to communicate with web crawlers about which parts of the site should not be accessed or scraped. However, it also mentions a /crawl endpoint which respects the robots.txt rules, not scraping sites that disallow it. The tweet does not specify a particular type of WAF, vendor, or a specific bypass payload. Instead, it focuses on the general practice of bypassing WAF and ignoring robots.txt by some scrapers, while some endpoints or services may adhere to these rules. This implies a universal concern regarding web scraping bypassing WAF protections and ignoring website scraping policies, which could involve various types of vulnerabilities but is not detailed in this tweet.
Check out the original tweet here: https://twitter.com/harshil1712/status/2031689115343700415
Subscribe for the latest news: