Blocking Bad/Blank Referrers and Bots

I have posted previously about blocking IP ranges and IP addresses using the htaccess previously so won’t go into too much detail about that file, head over and have a read if you’re interested (Blocking Chinese hackers from your web site)

Right back to the here and now, so I have been sifting through my logs for a few weeks monitoring IP addresses and referrers to start to block those resource stealing bots that seem never-ending. One of the most interesting things is that the largest ISP/Server farm by far for bots and general unwanted traffic seems to be Amazon, in fact, they were so bad I just started block whole CIDR ranges for them as just too many IP addresses to deal with!

In order for to ban these pesky critters I employed the use of the htaccess file again, this time using 2 main directives the Limit and SetEnvIfNoCase User-Agent but more on that one a second, firstly let us get banning all those Amazon bots.

This is my Limit directive in my htaccess file, as you can see there are a fair amount of Amazon ranges in there but not limited to them, I also try and monitor and document as I go to ensure that I know what’s what when I come back to it in a months time.

<Limit GET POST HEAD>
Order Allow,Deny
deny from 195.154.250.39 # French Server
deny from 203.236.50.22 # Korean hacker
deny from 31.184.238.132 # Russian hacker
deny from 74.208.0.0/16 # 1 and 1 
#deny from 166.78.0.0/16 # Rackspace Hosting
deny from 173.234.0.0/16 # Ubiquity hosting
deny from 207.182.128.0/19 # XLHost
deny from 104.236.0.0/16 # Digital Ocean
deny from 188.226.192.0/18 # Digital Ocean
deny from 91.200.12.0/22 # Ukraine Spam Bot
deny from 213.111.128.0/18 # Ukraine Spam Bot
#deny from 202.46.32.0/19 # China IP Range
deny from 180.76.0.0/16 # Baidu
# Vegas Internet suppliers/ possible Bing/Microsoft connection
#deny from 204.79.180.0/24 # Drake Holdings
# Easynet also do internet access!!
#deny from 90.192.0.0/13 # Easynet 
#deny from 90.214.0.0/15 # Easynet 
# Following few ranges seem to block own server!!
#deny from 51.254.0.0/15 # OVH UK
#deny from 209.68.0.0/18 # pair Networks
#deny from 176.31.0.0/16 # OVH Netherlands
deny from 148.251.0.0/16 # Hetzner Germany
# Seems to be blocking normal looking traffic?
#deny from 62.210.128.0/17 # online.net
deny from 192.210.128.0/17 # ColoCrossing
deny from 23.104.0.0/13 # ubiquity.io
deny from 54.72.0.0/13 # Amazon Web Services
deny from 54.80.0.0/12 # Amazon Web Services
deny from 54.240.0.0/12 # Amazon Web Services
deny from 54.160.0.0/12 # Amazon Web Services
deny from 54.144.0.0/12 # Amazon Web Services
deny from 50.16.0.0/14 # Amazon Web Services
deny from 52.32.0.0/11 # Amazon Web Services
deny from 54.224.0.0/12 # Amazon Web Services
deny from 54.192.0.0/12 # Amazon Web Services
# Duck and Go search engine in this range!!
#deny from 107.20.0.0/14 # Amazon Web Services
deny from 174.129.0.0/16 # Amazon Web Services
deny from 23.20.0.0/14 # Amazon Web Services
allow from all
Deny from env=bad_bot
</Limit> 

So that takes care of most of the bots and nasties trying to suck the life out your server, now to the User-Agent part. This can very easily be spoofed so should not be relied on totally as its way to easy to change a UA on a script which will mean it would bypass these checks. That said it’s still handy to keep as a few bots, namely Majestic (MJ12bot) can have any number of IP’s but tend to keep the same UA so you can capture it, same with a couple of others like linkdex, seo-spider.

Heres my SetEnvIfNoCase User-Agent directive part for the htaccess, I tend to have this above the Limit section in the file as you ban the bot in that part so not much good being after it!

# Block bad bots
#SetEnvIfNoCase User-Agent "^sogou" bad_bot
#SetEnvIfNoCase User-Agent "CCBot" bad_bot
#SetEnvIfNoCase User-Agent "^YoudaoBot" bad_bot
#SetEnvIfNoCase user-agent "SearchmetricsBot" bad_bot
#SetEnvIfNoCase user-agent "seokicks" bad_bot
#SetEnvIfNoCase User-Agent "80legs" bad_bot
#SetEnvIfNoCase User-Agent "360Spider" bad_bot
SetEnvIfNoCase User-Agent "seo-service" bad_bot
SetEnvIfNoCase user-agent "ahrefsbot" bad_bot
SetEnvIfNoCase user-agent "MJ12bot" bad_bot
SetEnvIfNoCase user-agent "linkdexbot" bad_bot
SetEnvIfNoCase user-agent "crawler" bad_bot

# Ripping tools
SetEnvIfNoCase user-agent "HTTrack" bad_bot
#SetEnvIfNoCase user-agent "^Screaming\ Frog\ SEO" bad_bot
#SetEnvIfNoCase user-agent "sistrix" bad_bot
#SetEnvIfNoCase user-agent "sitebot" bad_bot
#SetEnvIfNoCase user-agent "^SuperHTTP" bad_bot
SetEnvIfNoCase user-agent "^WebLeacher" bad_bot
SetEnvIfNoCase user-agent "^WebReaper" bad_bot
SetEnvIfNoCase user-agent "^WebSauger" bad_bot
#SetEnvIfNoCase user-agent "^Website\ eXtractor" bad_bot
SetEnvIfNoCase user-agent "^WebWhacker" bad_bot
SetEnvIfNoCase user-agent "^WebZIP" bad_bot
SetEnvIfNoCase user-agent "^Wget" bad_bot

# Vulnerability Scanners
#SetEnvIfNoCase User-Agent "Acunetix" bad_bot
#SetEnvIfNoCase User-Agent "FHscan" bad_bot

# Aggressive Chinese/Russian Search Engines
#SetEnvIfNoCase User-Agent "Baiduspider" bad_bot
#SetEnvIfNoCase User-Agent "Yandex" bad_bot

The last thing I do is create error document to send them to, I think that’s better than just straight out blocking them as I can monitor who I have blocked and tweak where necessary. To add a custom error document just create the following line at the top of your htaccess file.ErrorDocument 403 /error.php

Obviously, you’ll need to create the error.php file as well though!