Download
Ok, So website analytics are getting fowled up buy the hundreds(thousands?) of robots trolling the web while recording every thing they see posing are real problem for web developers and enthusiasts who set up analytic services like woopra to track their users.. Well in the obvious a robot does not want to read our blog, use our services or even stay on the site for any longer than it takes to load and record.
So I and a whole lot of other people don’t want these robots to be filtered and here is my proposal:
Robot Sticky Tape, a php script added to the robot.txt that records the ip and reverseDNS host into a mysql database and then RSS Fed to others.. There is plenty of room for development but this can be phase 1..
Here are the quickly written instructions:
ROBOT STICKY TAPE RSS
This PHP Script built to display database entries that are collected but a PHP enabled robots.txt.
Credits:
Requirements:
PHP 4+
MySQL
Databse Stucture:
CREATE TABLE `system_robots` (
`id` int(11) NOT NULL auto_increment,
`ip_addr` varchar(20) collate utf8_unicode_ci NOT NULL,
`ip_host` varchar(255) collate utf8_unicode_ci NOT NULL,
`dt` timestamp NOT NULL default CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `ip_addr` (`ip_addr`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=48 ;
ROBOTS.TXT INSTALL (Windows2003 Internet Information Service)
Goto properties of the disired website
Select ‘Home Directory’ Tab
Click on “Configuration” Button
Click on “Add”
Browse to find your php executable (php-cgi.php)
Input into “extension” field “.txt”
After that,
Add the following PHP to Robots.txt ( make sure to have the same database settings ):
//////// PHP
//DATBASE SETTINGS
$DB_USER = ‘root’;
$DB_PASS = ”;
$DB_HOST = ‘localhost’;
$DB_DATABASE = ‘RST_DB’;
$DB_TABLE = ‘robots’;
/////////////////
// ____________________ EDIT AT YOUR OWN RISK __________ //
$ip = $_SERVER['REMOTE_ADDR'];
$host = gethostbyaddr($ip);
if (!$link = mysql_connect($DB_HOST, $DB_USER, $DB_PASS)) {
die(‘Could not connect to mysql’);
exit;
}else if (!mysql_select_db($DB_DATABASE, $link)) {
die(‘Could not connect to database’);
exit;
}
$sql= “INSERT IGNORE INTO `{$DB_TABLE}` ( `id` , `ip_addr`,`ip_host` )
VALUES (
NULL , ‘{$ip}’ , ‘{$host}’
);”;
$result = mysql_query($sql, $link);
/////// END PHP
_____________________________________ UPDATE 3/12/10
Added this code to the robot.txt to help filter some spoofing and label ips:
$wanIP =file_get_contents(“http://www.whatismyip.com/automation/n09230945.asp”);
//Spoofing and local servers
if($host == gethostbyaddr($_SERVER['LOCAL_ADDR'])){
if($ip != $wanIP){
$host = ‘Spoofed Reverse DNS’;
}else{
$host = ‘Localhost’;
}
}else if($host == $ip){
$host = ‘No Reverse DNS Specified’;
}
_____________________________________ UPDATE 3/15/10
- Noticed that every entry was return as AM so i went in and sure enough my DATE() format was in 12 hour time not 24 so I change the code in robots.php (line 134 & 146) to show this:
$last_update = date(“D, d M Y H:m:s O”, strtotime($line['dt']));
- Updated table definition of in the “database structure” text:
- Updated Download