Today I checked the site’s access logs and found a spider with the User Agent Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) crawling the site very frequently, and judging by the 20-plus MB of logs it’s been doing this for quite a while.

Website logs
Judging from the pages it crawls, it seems to randomly combine parameters from previously crawled pages and then re-crawl them, so the URLs are extremely long and almost all return 404. Coupled with requests every few seconds, normal access records in the logs are basically drowned out by tens of thousands of lines of spam spider entries.
At first I thought that since it’s a spider it should respect robots.txt, so I tried adding rules to robots.txt:
User-agent: SemrushBot
Disallow: /
After a quick search online, however, I discovered that this thing apparently doesn’t obey robots.txt 😅 (the official page claims the spider strictly follows robots.txt, but user reports say otherwise), and it’s not just SemrushBot; many marketing spiders ignore robots.txt. The only option left was to block them with Nginx. In Baota’s Nginx Free Firewall, under Global Config, click Rules in User-Agent Filter, and add the following regular expression (compiled from the web; I didn’t expect so many, and I threw in some useless UAs as well—check before use):
(nmap|NMAP|HTTrack|sqlmap|Java|zgrab|Go-http-client|CensysInspect|leiki|webmeup|Python|python|curllCurl|wget|Wget|toutiao|Barkrowler|AhrefsBot|a Palo Alto|ltx71|censys|DotBot|MauiBot|MegaIndex.ru|BLEXBot|ZoominfoBot|ExtLinksBot|hubspot|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|JikeSpider|SemrushBot)
After a few seconds you can see the blocked requests.

Blocked data
Later, watching the interception counter climb, I realized that whether it returns 404 or 444, every request still consumes server resources, so a request every few seconds isn’t sustainable. Digging deeper, I saw that these spiders all come from overseas IPs, and my site’s foreign DNS already points to Cloudflare, so I could let Cloudflare block them before they ever reach the server.

Spider IPs
In Cloudflare’s dashboard for the domain, go to Security → WAF, click Add Rule. In the Expression Preview, add the following expression (again, check for any UAs you actually need):
Then choose Block as the action and save.

Cloudflare custom rule
From here, if Baota’s Nginx Free Firewall risk-interception counter stops rising while Cloudflare’s firewall rule counter skyrockets, it proves the garbage spiders are now being stopped at Cloudflare.
Woke-up update
After a nap, over two thousand requests had already been blocked—this thing is relentless 😅.

Cloudflare custom rule