Ever had comment spam on your blog? Ever had your content stolen? If so - it was more than likely a bot, rather than a person with too much time on their hands.
Black hats of today have more tools in their toolbox than Bob the Builder; those of concern to us are mostly bots and web crawlers. Whilst some are making great progress, it is by definition, impossible to keep every grubby automated spammer out of your website.
However, there are some simple steps we can take to keep the great unwashed out.
What are bad bots?
A bot is a piece of software that crawls the web with a purpose. Googlebot is an example of a good bot, in that it reads our web pages, and indexes them for inclusion in the Google search index.
Whilst there are many good bots, there are also bad bots. A bad bot is one which crawls your site with selfish intent, likely to your detrement. This could be:
- A scraper, looking for content to steal,
- A harvester, looking for an email address to spam, or
- A comment spammer, looking for a blog to post spammy comments.
Everybody who runs a legitimate web site should try and keep bad bots out!
I’m not going to go into any deep analysis here, rather I’m going to tell you what I decided to do about bad bots on my website; I hope that this will be useful for you.
Project Honeypot
My first thought was to create a honeypot to trap them nasty bots. But then after a bit of searching, I came across Project Honeypot.
“Project Honey Pot is a distributed network of decoy web pages website administrators can include on their sites in order to gather information about robots, crawlers, and spiders.”
Basically, project honeypot maintains a list of IP addresses that are being used maliciously. They gather this list through a network of participating web masters (like you!) who host honeypots on their behalf. A honeypot here is simply a one page script.
Additionally, project honeypot allows you to make use of this data through their HTTP Blacklist service.
Multiple layers of defence
Every good security consultant knows that security should consist of multiple layers of controls. The following took me about 20 minutes to set up on a Wordpress blog.
If you use wordpress, you should already be using Akismet to shield your blog from the simplest, stupidest of comment spammers. If not, I recommend you consider this first!
In the first instance, we’re going to setup a simple honeypot using project honeypot.
Then we will use a Wordpress plugin that implements project honeypot’s HTTP Blacklist. If it encounters a known bot, it blocks them and sends them off to the honeypot.
Our third layer is a combination of .htaccess rules, and a honeypot. We will use .htaccess as an additional control, filtering out realatively easy to spot bots, and sending them to our honeypot.
First, set up your own honeypot
You don’t have to use project honeypot - there are alternative honeypots you could use instead, or you could write your own - but it’s very easy to setup; besides, you can always change it later if you’re not happy with it.
Go over to project honeypot and get your own honeypot (I used php), then save it in the home directory of your website.
Once you’ve uploaded the honeypot to your web server, go and edit your /robots.txt file and add a reference at the top: (yours won’t be called honeypot.php).
User-agent: *
Disallow: /honeypot.php
Next, Google up a 1 pixel gif, or get one from here, save it in your web root folder. This is going to be used instead of anchor text to link to your honeypot.
Then you can add the link to the honeypot from your website; I use wordpress, and added it to the theme, right after the opening body tag:
<a href="/honeypot.php" rel="nofollow" > <img src="/honeypot.gif" border="0"></a>
Test your website, make sure that you haven’t screwed up the layout.
Now block the blacklisted IP’s
Thanks to this wordpress plugin, it is surprisingly easy - just install it like any other plugin, configure through the wp admin console - make sure you specify your honeypot on the configuration page. Doing this will ensure that any blacklisted crawlers will be denied your content, and sent to the sin bin!
If you are not using wordpress, there are several alternative HTTP Black list scripts.
That takes care of the majority of harvesters and comment spammers. Remember the comment spammers are often the same guys who scrape content for their spam sites - so by definition, you’ve already blocked out a good percentage of them.
However, there remains plenty of scrapers, who are not explicitly targeted by project honeypot at this time. Never fear….
Htaccess Wizardry is here
Htaccess is your friend - at least if your web site runs on Apache, like most of the Internet.
Htaccess (actually .htaccess) is a simple text file you create and store in a folder on your website. The file isn’t browsable, but rather provides additional configuration information to Apache, that applies strictly to that folder and it’s sub folders. Capisce?
It is an incredibly versatile file, and can be used to pretty much do anything that can be done in the apache2.conf file (not performance or TCP/IP configuration, but pretty much everything else).
Here, we will use .htaccess in the root folder of our website to monitor HTTP requests, we will look for signatures of scraper bots and desktop scraper software, and then confine them to the honeypot!
If you are familiar with .htaccess, check out these guys - they have it sussed!
Caution: RTFM!
NB1: I severely urge you not to implement anything that follows without first referring to the Apache documentation; it is your own responsibility to first understand what everything means!!!
NB2: When editing your .htaccess file, you will probably find some code already in there - for example, wordpress adds code to do permalinks. You should be careful to keep your changes separate from any existing entries.
First, lets look at the User-Agent header. Some scrapers come with User-Agent headers that are a dead giveaway. The following snippet is lifted directly from 0×000000.com and can be used to redirect those user agents to your honeypot!
Please cut and paste the following into your .htaccess file (Note you need to modify the last line).
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl [OR]
RewriteCond %{HTTP_USER_AGENT} ^libwwwperl [OR]
RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
RewriteCond %{HTTP_USER_AGENT} ^ASPSeek [OR]
RewriteCond %{HTTP_USER_AGENT} ^appie [OR]
RewriteCond %{HTTP_USER_AGENT} ^AbachoBOT [OR]
RewriteCond %{HTTP_USER_AGENT} ^autoemailspider [OR]
RewriteCond %{HTTP_USER_AGENT} ^anarchie [OR]
RewriteCond %{HTTP_USER_AGENT} ^antibot [OR]
RewriteCond %{HTTP_USER_AGENT} ^asterias [OR]
RewriteCond %{HTTP_USER_AGENT} ^B2w [OR]
RewriteCond %{HTTP_USER_AGENT} ^BackWeb [OR]
RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bandit [OR]
RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Black\ Hole [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baidu [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlowFish [OR]
RewriteCond %{HTTP_USER_AGENT} ^BuiltBotTough [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto [OR]
RewriteCond %{HTTP_USER_AGENT} ^BotALot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bullseye [OR]
RewriteCond %{HTTP_USER_AGENT} ^bumblebee [OR]
RewriteCond %{HTTP_USER_AGENT} ^BunnySlippers [OR]
RewriteCond %{HTTP_USER_AGENT} ^ClariaBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^curl [OR]
RewriteCond %{HTTP_USER_AGENT} ^clsHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^CheeseBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPickerSE [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPickerElite [OR]
RewriteCond %{HTTP_USER_AGENT} ^Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^COAST\ WebMaster [OR]
RewriteCond %{HTTP_USER_AGENT} ^cosmos [OR]
RewriteCond %{HTTP_USER_AGENT} ^CopyRightCheck [OR]
RewriteCond %{HTTP_USER_AGENT} ^ColdFusion [OR]
RewriteCond %{HTTP_USER_AGENT} ^Copier [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^DA [OR]
RewriteCond %{HTTP_USER_AGENT} ^DTS\ Agent [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [OR]
RewriteCond %{HTTP_USER_AGENT} ^DittoSpyder [OR]
RewriteCond %{HTTP_USER_AGENT} ^Diamond [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Wonder [OR]
RewriteCond %{HTTP_USER_AGENT} ^Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^dloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Drip [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^Extreme\ Picture\ Finder [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^EasyDL [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EroCrawler [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST\ WebCrawler [OR]
RewriteCond %{HTTP_USER_AGENT} ^FileHound [OR]
RewriteCond %{HTTP_USER_AGENT} ^Fetch\ API\ Request [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlickBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^FrontPage [OR]
RewriteCond %{HTTP_USER_AGENT} ^FreeFind.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetSmart [OR]
RewriteCond %{HTTP_USER_AGENT} ^Generic [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^gotit [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gulliver [OR]
RewriteCond %{HTTP_USER_AGENT} ^Harvest [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^Heretrix [OR]
RewriteCond %{HTTP_USER_AGENT} ^HitboxDoctor [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTPapp [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTPTrack [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTPviewer [OR]
RewriteCond %{HTTP_USER_AGENT} ^httplib [OR]
RewriteCond %{HTTP_USER_AGENT} ^httpfetcher [OR]
RewriteCond %{HTTP_USER_AGENT} ^httpscraper [OR]
RewriteCond %{HTTP_USER_AGENT} ^hloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^humanlinks [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^InfoNaviRobot [OR]
RewriteCond %{HTTP_USER_AGENT} ^InternetSeer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^Iria [OR]
RewriteCond %{HTTP_USER_AGENT} ^IRLbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^JoBo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [OR]
RewriteCond %{HTTP_USER_AGENT} ^JustView [OR]
RewriteCond %{HTTP_USER_AGENT} ^Jonzilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^JennyBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Kenjin\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Keyword\ Density [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Lachesis [OR]
RewriteCond %{HTTP_USER_AGENT} ^LexiBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^libWeb [OR]
RewriteCond %{HTTP_USER_AGENT} ^Libby_ [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkScan [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkextractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^lftp [OR]
RewriteCond %{HTTP_USER_AGENT} ^likse [OR]
RewriteCond %{HTTP_USER_AGENT} ^Link [OR]
RewriteCond %{HTTP_USER_AGENT} ^lwp-trivial [OR]
RewriteCond %{HTTP_USER_AGENT} ^lwp\ request [OR]
RewriteCond %{HTTP_USER_AGENT} ^Magnet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mag-Net [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIIxpc [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [OR]
RewriteCond %{HTTP_USER_AGENT} ^MSFrontPage [OR]
RewriteCond %{HTTP_USER_AGENT} ^MSIECrawler [OR]
RewriteCond %{HTTP_USER_AGENT} ^MicrosoftURL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Missigua [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mewsoft\ Search\ Engine [OR]
RewriteCond %{HTTP_USER_AGENT} ^moget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mata\ Hari [OR]
RewriteCond %{HTTP_USER_AGENT} ^Memo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Metacarta [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mercator [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^MFC_Tear_Sample [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mirror [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIIxpc [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^NationalDirectory\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Nikto [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetResearchServer [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetMechanic [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Probe [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZip [OR]
RewriteCond %{HTTP_USER_AGENT} ^nexuscache [OR]
RewriteCond %{HTTP_USER_AGENT} ^Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^NPBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^our\ agent [OR]
RewriteCond %{HTTP_USER_AGENT} ^onestop [OR]
RewriteCond %{HTTP_USER_AGENT} ^oBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Openfind [OR]
RewriteCond %{HTTP_USER_AGENT} ^Openfind\ data\ gatherer [OR]
RewriteCond %{HTTP_USER_AGENT} ^OrangeBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^PHP\ version [OR]
RewriteCond %{HTTP_USER_AGENT} ^PHP [OR]
RewriteCond %{HTTP_USER_AGENT} ^PHPot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Perl [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^Pockey [OR]
RewriteCond %{HTTP_USER_AGENT} ^Ping [OR]
RewriteCond %{HTTP_USER_AGENT} ^PingALink\ Monitoring\ Services [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProWebWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProPowerBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Pump [OR]
RewriteCond %{HTTP_USER_AGENT} ^Pompos [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Python\ urllib [OR]
RewriteCond %{HTTP_USER_AGENT} ^Python-urllib [OR]
RewriteCond %{HTTP_USER_AGENT} ^QueryN [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^Reaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Recorder [OR]
RewriteCond %{HTTP_USER_AGENT} ^RepoMonkey [OR]
RewriteCond %{HTTP_USER_AGENT} ^psycheclone [OR]
RewriteCond %{HTTP_USER_AGENT} ^RMA [OR]
RewriteCond %{HTTP_USER_AGENT} ^Rico [OR]
RewriteCond %{HTTP_USER_AGENT} ^Robozilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^Snake [OR]
RewriteCond %{HTTP_USER_AGENT} ^spanner [OR]
RewriteCond %{HTTP_USER_AGENT} ^Stealer [OR]
RewriteCond %{HTTP_USER_AGENT} ^SpaceBison [OR]
RewriteCond %{HTTP_USER_AGENT} ^SpankBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Spinne [OR]
RewriteCond %{HTTP_USER_AGENT} ^Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^slysearch [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Snoopy [OR]
RewriteCond %{HTTP_USER_AGENT} ^ScoutAbout [OR]
RewriteCond %{HTTP_USER_AGENT} ^Scooter [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Snapbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^suzuran [OR]
RewriteCond %{HTTP_USER_AGENT} ^Szukacz [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sqworm [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft [OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^turingos [OR]
RewriteCond %{HTTP_USER_AGENT} ^toCrawl [OR]
RewriteCond %{HTTP_USER_AGENT} ^TightTwatBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^True_Robot [OR]
RewriteCond %{HTTP_USER_AGENT} ^The\ Intraformant [OR]
RewriteCond %{HTTP_USER_AGENT} ^TheNomad [OR]
RewriteCond %{HTTP_USER_AGENT} ^Titan [OR]
RewriteCond %{HTTP_USER_AGENT} ^UrlDispatcher [OR]
RewriteCond %{HTTP_USER_AGENT} ^URLy\ Warning [OR]
RewriteCond %{HTTP_USER_AGENT} ^Vayala [OR]
RewriteCond %{HTTP_USER_AGENT} ^Vagabondo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Vintage [OR]
RewriteCond %{HTTP_USER_AGENT} ^Vacuum [OR]
RewriteCond %{HTTP_USER_AGENT} ^VCI [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^W3C_Validator [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webdownloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webhook [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webmole [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webminer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webmirror [OR]
RewriteCond %{HTTP_USER_AGENT} ^Websucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Websites [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebViewer [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEnhancer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wells [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Whacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wildsoft\ Surfer [OR]
RewriteCond %{HTTP_USER_AGENT} ^WinHttpRequest [OR]
RewriteCond %{HTTP_USER_AGENT} ^WinHttp [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webster\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZip [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW-Collector-E [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xenu [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xara [OR]
RewriteCond %{HTTP_USER_AGENT} ^Y!TunnelPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^YahooYSMcm [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zade [OR]
RewriteCond %{HTTP_USER_AGENT} ^ZBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_USER_AGENT} ^zerxbot
RewriteRule ^(.*)$ /yourhoneypot.php
An alternative to the last line, if you’d rather not send the bots to a trap, you can always give them a 403 forbidden:
RewriteRule ^.* - [F]
Next, save your changes, and TEST
Summary
Congratulations! If you’ve got this far:
You’ve setup a honeypot on your website, you’re trapping known web slurpers and content scrapers with .htaccess - before they even see your content - and your sending the bastards off to your honeypot. Nice.
You’re also taking advantage of the IP Blacklist produced by the project honeypot network to keep content spammers and email harvesters away, hopefully trapping a few that Akismet missed.
In part II of this series we will look deeper into techniques to detect and block craftier bad bots. We’ll also take a look at mod_security, an incredible useful tool to protect your site from trojan code injections and other haxxor nastieness.
Simple description and useful information. Thanks.
Some really useful info here, I have implemented these tips to stop scrapers on my website and am now waiting to see what gets caught in my spider trap. I am a little confused about how to do the IP blacklist for Apache. Tried to follow instructions on honeypot.org, but got a little lost on that, i found the module for Apache, but am not sure how to go about it. Hopefully the honeypot and htaccess rules will help for now.