Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Googlebot and Apache CLOSE_WAIT's: SOLVED!
Monday - Feb 7th 2011 - by - (0 comments)

I previously wrote in article Googlebot freezes Apache and server load increase about a weird behavior between the Googlebot (Google Spider/Crawler) and the Apache webserver. Short summary: A connection was opened by Googlebot, Apache gives answer, connection is never correctly closed and even if meanwhile new Apache child processes were spawned, the process kept alive due to this open connection which caused at the end a huge increase in server load.

Well after further analysis of this problem this weekend, I figured out, that always the same website (out of hundreds) was responsible to not properly close the connection to Googlebot.

Once more, top shows which processes use most CPU:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1242 www-data  20   0  706m 166m  51m S  126  4.2 175:39.77 apache2
32486 www-data  20   0  551m 130m  43m S   49  3.3 102:12.12 apache2

We look for connections by these processes:

COMMAND   PID     USER   FD   TYPE     DEVICE SIZE NODE NAME
apache2  1145     root    3u  IPv6 1215939860       TCP *:www (LISTEN)
apache2  1242 www-data   26u  IPv6 1216870094       TCP area-1.ch:www->crawl-66-249-66-136.googlebot.com:55825 (CLOSE_WAIT)
apache2  1242 www-data   28u  IPv6 1216884422       TCP area-1.ch:www->crawl-66-249-66-136.googlebot.com:42851 (CLOSE_WAIT)
apache2 16944 www-data    3u  IPv6 1215939860       TCP *:www (LISTEN)
apache2 16944 www-data   26u  IPv6 1217132916       TCP area-1.ch:www->crawl-66-249-71-197.googlebot.com:34349 (ESTABLISHED)
apache2 17207 www-data    3u  IPv6 1215939860       TCP *:www (LISTEN)
apache2 17207 www-data   26u  IPv6 1217132929       TCP area-1.ch:www->net66-219-58-45.static-customer.corenap.com:57036 (ESTABLISHED)
apache2 32486 www-data   34u  IPv6 1216853652       TCP area-1.ch:www->crawl-66-249-66-136.googlebot.com:47004 (CLOSE_WAIT)

And now we take a look at the extended Apache status (ExtendedStatus on) for further Apache child information:

0-1    32486    0/34/438    W     20.03    7057    0    0.0    2.86    20.77     66.249.66.136    BADWEBSITE    GET /category.php?id_category=2&id_lang=1 HTTP/1.1

2-1    1242    0/32/349    W     291.95    6310    0    0.0    5.52    21.18     66.249.66.136    BADWEBSITE    GET /category.php?id_category=2&id_lang=1 HTTP/1.1

2-1    -    0/0/361    .     520.53    6133    40    0.0    0.00    24.98     84.22.49.237    AGOODWEBSITE    POST /?_task=mail&_action=autocomplete HTTP/1.1

2-1    -    0/0/362    .     529.47    6133    0    0.0    0.00    22.47     88.65.227.133    ANOTHERGOODONE    GET /images/klein-shop-vfb-ksc-poster-200x133.jpg HTTP/1.1

2-1    -    0/0/367    .     523.36    6133    274    0.0    0.00    16.91     84.22.49.237    AGOODWEBSITE    POST / HTTP/1.1

2-1    -    0/0/363    .     511.97    6133    30    0.0    0.00    20.74     91.8.246.156    GOODWEBSITE2    GET /images/product_images/info_images/670_0.jpg HTTP/1.1

2-1    -    0/0/361    .     536.08    6132    458    0.0    0.00    30.53     207.46.199.23    BGOODWEBSITE    GET /wp-content/uploads/shadowbox-js/d46661d2c927dea304addb1b47

2-1    -    0/0/359    .     515.40    6133    40    0.0    0.00    56.72     178.82.216.64    CGOODWEBSITE    GET /_images/nav/beratung_b.gif HTTP/1.1

2-1    1242    0/18/344    W     4.81    6682    0    0.0    2.05    16.81     66.249.66.136    BADWEBSITE    GET /category.php?id_category=2&id_lang=1 HTTP/1.1

In the extended status we can find the same process id's and seen in top and lsof again. And what a surprise, it always is BADWEBSITE which is still holding the connection to the Googlebot (66.249.66.136) (take a look at the W which stays for Sending Reply).

I don't know what the programmer of this website did in the file category.php but I definitely don't want Googlebot to crawl these pages anymore. So the solution is to use a robots.txt to "talk" to the Googlebot:

# Googlebot ain't allowed to check the pages here
User-Agent: Googlebot
Disallow: /

# All other bots go ahead
User-agent: *
Disallow:

Since then the server load is a steady below 1 and there are no problems anymore!

 

Add a comment

Show form to leave a comment

Comments (newest first):

No comments yet.

Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

6979 Days
until Death of Computers
Why?