Use dtrace to analyze Apache process consuming 100% CPU

Written by - 3 comments

Published on - Listed in Linux Internet Apache

A couple of days ago I was able to follow a course on dtrace, which is a powerful and mighty tool to debug and troubleshoot (almost) any kind of issues on a Unix host.
The original dtrace program was developed from Sun for Solaris and it's now slowly getting ported on Linux (check out dtrace4linux). Once compiled, dtrace can then be loaded as a kernel module and the executable can be used to trace processes, i/o operations, etc.

Today I had a case where an Apache child process used 100% of a CPU for a long time (and kept eating CPU).

# ps auxf | grep www
root     19428  0.0  0.0 112312   852 pts/0    S+   19:17   0:00                      \_ grep www
www-data 18980  0.0  2.2 1516220 362164 ?      Sl   Oct01   0:20  \_ /usr/sbin/apache2 -k start
www-data 31152 99.1  1.9 1192780 323240 ?      R    16:42 154:13  \_ /usr/sbin/apache2 -k start
www-data  5476  0.0  0.1 239548 28844 ?        S    17:30   0:00  \_ /usr/sbin/apache2 -k start
www-data 19336 15.2  1.6 1400428 266288 ?      Sl   19:17   0:05  \_ /usr/sbin/apache2 -k start

What I usually do in such cases is a quick look at the server-status table, which, in 90% of the times, shows the website to blame. But in this case, the whole process was full with W entries (Sending Reply). So let's use dtrace to find the bad guy!

First I wanted to see what kind of function is used the most, so I counted all functions, which happen (live) in this process and output the summary as number:

# dtrace -n 'syscall:::entry { @[probefunc] = count() }' -p 31152
proc-stub:rd_errstr err=26
dtrace: description 'syscall:::entry ' matched 633 probes
  read                                                           1095
  munmap                                                         2072
  mmap                                                           2073
  close                                                          2421
  open                                                           3047
  stat                                                           3225
  fstat                                                          4107
  lstat                                                        211141

So the lstat function is the one used the most often - 50x more than fstat.
Let's see who (which file) is using lstat so often.

# dtrace -n 'syscall::lstat:entry { @[copyinstr(arg0)] = count() } END {trunc(@, 5);}' -p 31152
proc-stub:rd_errstr err=26
dtrace: description 'syscall::lstat:entry ' matched 3 probes
CPU     ID                    FUNCTION:NAME
  6      2                             :END

  /var/www/web7/html/                           22143
  /var/www/web7                                            30231
  /var/www/web7/html                                       30231
  /var                                                     37380
  /var/www                                                 37380

What I do in the above dtrace command is simply to check for all lstat functions which enter the system from PID 31152. Instead of counting the functions, this time I'm counting the filenames. Because the filename is used in user-space (Apache is a user process) and we're "debugging in kernel space" (simple description) the filename (a string) needs to be copied (copyinstr). arg0 is the filename itself. Instead of having a list of thousands of files, I wanna truncate the result and only show the top 5.

As this is a shared hosting and all websites are in /var/www, it is no surprise that this path is the most often called. But right behind is which resided in /var/www/web7.

Let's check in the access log what's happening:

# tail -f /var/www/web7/log/access_log - - [04/Oct/2012:20:01:46 +0200] "GET /administrator/index.php HTTP/1.1" 200 4167 "" "Mozilla/5.0 (compatible; bingbot/2.0; +" - - [04/Oct/2012:20:01:47 +0200] "GET /administrator/ HTTP/1.1" 200 4028 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +" - - [04/Oct/2012:20:01:47 +0200] "POST /administrator/index.php HTTP/1.1" 303 - "" "Mozilla/5.0 (compatible; bingbot/2.0; +" - - [04/Oct/2012:20:01:49 +0200] "GET /administrator/index.php HTTP/1.1" 200 4167 "" "Mozilla/5.0 (compatible; bingbot/2.0; +" - - [04/Oct/2012:20:01:49 +0200] "GET /administrator/ HTTP/1.1" 200 4028 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +" - - [04/Oct/2012:20:01:50 +0200] "POST /administrator/index.php HTTP/1.1" 303 - "" "Mozilla/5.0 (compatible; bingbot/2.0; +" - - [04/Oct/2012:20:01:51 +0200] "GET /administrator/index.php HTTP/1.1" 200 4167 "" "Mozilla/5.0 (compatible; bingbot/2.0; +" - - [04/Oct/2012:20:01:52 +0200] "GET /administrator/ HTTP/1.1" 200 4028 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +"

Oh indeed, the IP makes several requests per second (GET and POST) on a Joomla Administrator login. Probably a brute-force attack. The HTTP Header shows bingbot though but that's probably fake, as I already saw this (see post GoogleBot used as HTTP Agend by Hackers/Botscripts). A quick nslookup will show if the IP is really from bing:

# nslookup

ERRR! Nope, that's definitely not bing. Looks like a hacked shared hosting server (due to the name of cpanel).

Let's see how many TCP connections there are from this IP:

# netstat -an | grep "" | head
tcp6       0      0    TIME_WAIT
tcp6       0      0    TIME_WAIT
tcp6       0      0    TIME_WAIT
tcp6       0      0    TIME_WAIT
tcp6       0      0    TIME_WAIT

# netstat -an | grep "" -c

The TIME_WAIT entries explain why this Apache child process hasn't been killed as has been run for such a long time (and is still running). As long as there are TIME_WAIT's (or others as ESTABLISHED) in the child process, Apache (master) won't kill it.

I don't want this website be allowed to be hacked and I hate CPU intensive processes which run forever, so lets stop the attack through a simple .htaccess file and a deny:

# cat /var/www/web7/html/.htaccess
order allow,deny
deny from
allow from all

Shortly after that the TCP connections went down:

# netstat -an | grep "" -c


Add a comment

Show form to leave a comment

Comments (newest first)

Claudio from Switzerland wrote on Oct 6th, 2012:

That's a very good idea, will apply that. Thanks.

Warper from wrote on Oct 6th, 2012:

To avoid this brute force attack, that is very popular nowadays, you should add this rule to your /administrator/.htaccess instead to ban every individual ip:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider) [NC]
RewriteRule .* - [R=403,L]

The bot attackers are using bingbot always as user referer, so I think this is a good way to avoid it.
Anyway, those bots have nothing to do in /administrator/ folder, so block them is safe.


Dennis B. from Germany wrote on Oct 5th, 2012:

Very nice example ! I'll bookmark this :)