In the good old days, all the web pages are static and the web server is fragile. Hence we use some tools like webstone to stress test our web server. This is what I did ten years ago.
Nowadays, webstone cannot fit our use for two reasons: (1) our web server is more robust that you can hardly exhaust it in a simple way and (2) we made the web a system and it is dynamic in nature, with whole bunch of stuff like Flash, cookies, web sessions, AJAX, Javascript, etc. Hence we need something more advanced.
To stress test a web, I suggest to write a script to do so rather than use wget or curl or something alike because of the same reason. It is troublesome to write a C++ or Java program and it is kind of overkill to do so. The best tool for a quick and dirty stress test is Perl actually. Hence I made this tool.
To stress test a web page is easy: Just craw this page for an infinite time. To stress test a web site is not so trivial, because there are many pages and we don’t want to craw the same page in the same order again and again — this is not what people will do when they are browsing.
My idea of web stress test is to use a Markov Chain to simulate human behavior: We start on a page and then go to another page and then yet another page, with the possibility of going forward and backward, or jump to something else. This is actually a trail following certain Markov Chain.
To model this, I created the following code: (also available as markovcaller.pl)
markovcaller.pl - Markov Chain trailwalker for calling functions
#!/usr/bin/perl # Random function caller # Put function pointers into an array and define a transition probability # matrix, an initial probability array, such that we jump from one state # to another state and in each state, we call the corresponding function # once. # # Adrian Sai-wah Tam # Wed Apr 4 21:26:21 HKT 2007 use strict; use warnings; use vars qw(@transprob @funcvec @initvec $termcheck); # the n-by-n transition probability matrix, assumed irreducible @transprob = ( [0.2, 0.2, 0.1, 0.5], [0.3, 0.2, 0.4, 0.1], [0.1, 0.3, 0.3, 0.3], [0.2, 0.3, 0.3, 0.2] ); # the initial probability n-vector @initvec = (1.0, 0, 0, 0); # the function pointers stored in n-vector @funcvec = (\&func1, \&func2, \&func3, \&func4); # termination condition checker $termcheck = \&termCheckFunc; ################################## # The main code: Random transition # (nothing shall be changed below) # Sanity check: Make sure the probability matrix is a probability matrix # and the probability vector is a probability vector for my $i (0..$#transprob) { die "Transition probability matrix is not square (on row $i)\n" if ($#{$transprob[$i]} != $#transprob); my $sum=0.0; my $j; for $j (0..$#{$transprob[$i]}) { $sum += $transprob[$i][$j]; }; if ($sum != 1.0) { warn "Sum of row $i of transition probability matrix isn't 1. Fixed.\n"; for $j (0..$#{$transprob[$i]}) { $transprob[$i][$j] /= $sum; }; }; }; my $sum=0.0; for (0..$#initvec) { $sum += $initvec[$_]; }; if ($sum != 1.0) { warn "Initial probability vector does not sum to 1. Fixed\n"; for (0..$#initvec) { $initvec[$_] /= $sum; }; }; die "Dimension mismatch between initial prob vector & transition prob matrix\n" if ($#initvec != $#transprob); die "Dimension mismatch between function vector & transition prob matrix\n" if ($#funcvec != $#transprob); for (0..$#funcvec) { die "Element $_ of function vector is not a function\n" if (ref($funcvec[$_]) ne 'CODE'); }; # Select initial state my $state; my $random = rand(1); $sum=0.0; for ($state=0; $state<=$#initvec; $state++) { $sum += $initvec[$state]; last if ($sum >= $random); }; # until terminating condition is met, traverse the Markov chain until (&$termcheck()) { &{$funcvec[$state]}(); # Call the corresponding function $random = rand(1); # Toss a coin $sum=0.0; # Lookup the transition matrix my $oldstate=$state; # and jump accordingly for ($state=0; $state<=$#transprob; $state++) { $sum += $transprob[$oldstate][$state]; last if ($sum >= $random); }; }; # code ends: Random transition # (nothing shall be changed above) ################################## ######################################## # Things should be changed for real use: # sub func1 { print "1"; }; sub func2 { print "2"; }; sub func3 { print "3"; }; sub func4 { print "4"; }; BEGIN { my $counter=0; my $endcount=100; sub termCheckFunc { return (++$counter == $endcount); }; }
The above code does the following: It defines a Markov Chain transition probability matrix, and an initial probability vector. When the program starts, it tosses a coin to select the initial state according to the initial probability vector. Subsequently, it changes its state by generating a random number and consulting the transition probability matrix. Every state in the program is associated with a function so that the corresponding function is called for once whenever we are transiting to certain state.
When you run the above program, you will see a line of a hundred random digits:
14112342143141322334414334....
To create a mutithreaded Perl script for crawling web, you may want to choose from the following:
The simplest way for doing multithread on Perl is the following piece of code:
use threads; use threads::shared; use Time::HiRes qw( sleep ); # For allowing floating point seconds select((select(STDOUT), $|=1)[0]); # Make STDOUT `hot', i.e. no buffering my $run:shared=1; # Flag to control the thread to continue $SIG{TERM}=sub { $run=0; }; # Setting signal handler as anonymous function my $thread1=threads->new(\&func,1); # Fork a thread my $thread2=threads->new(\&func,2) # Fork another thread func(0); # Myself is a thread as well $thread1->join(); # If ^C received, join the threads $thread2->join(); ### Thread function: sub func { while($run) { print $_[0]; sleep 0.1; }; };
For random sleep in exponential distribution, you can simply replace the sleep 0.1 with
sleep -log(1-rand(1))*$mean
where $mean is the mean sleep time in seconds.
Actually, sleep for an exponential time does not mean you can simulate Poisson behavior. For more accurately simulate Poisson, you should make sure your sleep time has your “wake time” subtracted:
use Time::HiRes qw( sleep time ); # For allowing floating point timer sub func { sub max { return $_[0]>$_[1]?$_[0]:$_[1]; } while($run) { my $before = time # Current time in floating point print $_[0]; # Suppose this will take a significant time my $after = time; sleep max(0,-log(1-rand(1))*$mean-$after+$before); }; };
Of course, you may also use semaphore and shared variables to allow threads communicate.
Combining the code in the previous two sections, we have the following: (also available as threadmarkov.pl
threadmarkov.pl - Multithreaded Markov Chain trailwalker
#!/usr/bin/perl # Multithreaded random function caller # Start a number of threads, each of them running a loop such that it # traverses a Markov Chain and calls a function on every step. # # Adrian Sai-wah Tam # Fri Apr 6 02:07:58 HKT 2007 use strict; use warnings; use threads; use threads::shared; use Time::HiRes qw( sleep time ); # For allowing floating point seconds use vars qw(@transprob @funcvec @initvec $termcheck); ###### Markov chain parameters: # # the n-by-n transition probability matrix, assumed irreducible @transprob = ( [0.2, 0.2, 0.1, 0.5], [0.3, 0.2, 0.4, 0.1], [0.1, 0.3, 0.3, 0.3], [0.2, 0.3, 0.3, 0.2] ); # the initial probability n-vector @initvec = (1.0, 0, 0, 0); # the function pointers stored in n-vector @funcvec = (\&func1, \&func2, \&func3, \&func4); # termination condition checker $termcheck = \&termCheckFunc; ######################## # Main body # Thread control select((select(STDOUT), $|=1)[0]); # Make STDOUT `hot', i.e. no buffering my $run : shared = 1; # Flag to control the thread to continue $SIG{TERM} = sub { $run=0; }; # Setting signal handler as anonymous function my $numthread = 2; # How many threads we want to run? my $mean = 0.1; # Mean sleep time my @threads; # Thread pool sanitycheck(); # sanity check the prob matrix & vectors for (0..$numthread-1) { $threads[$_] = threads->new(\&markovtrail); }; for (0..$numthread-1) { # If ^C received, join the threads $threads[$_]->join(); }; print "\n"; exit; # Main body ends ######################## # Maximum of two sub max { return $_[0]>$_[1]?$_[0]:$_[1]; }; # Sanity check for the probability matrix and vectors sub sanitycheck { # Sanity check: Make sure the probability matrix is a probability matrix # and the probability vector is a probability vector for my $i (0..$#transprob) { die "Transition probability matrix is not square (on row $i)\n" if ($#{$transprob[$i]} != $#transprob); my $j; my $sum=0.0; for $j (0..$#{$transprob[$i]}) { $sum += $transprob[$i][$j]; }; if ($sum != 1.0) { warn "Sum of row $i of transition probability matrix isn't 1. Fixed.\n"; for $j (0..$#{$transprob[$i]}) { $transprob[$i][$j] /= $sum; }; }; }; my $sum=0.0; for (0..$#initvec) { $sum += $initvec[$_]; }; if ($sum != 1.0) { warn "Initial probability vector does not sum to 1. Fixed\n"; for (0..$#initvec) { $initvec[$_] /= $sum; }; }; die "Dimension mismatch between initial prob vector & transition prob matrix\n" if ($#initvec != $#transprob); die "Dimension mismatch between function vector & transition prob matrix\n" if ($#funcvec != $#transprob); for (0..$#funcvec) { die "Element $_ of function vector is not a function\n" if (ref($funcvec[$_]) ne 'CODE'); }; }; # Markov Chain function sub markovtrail { # Select initial state my $state=-1; my $random = rand(1); my $sum=0.0; for (; $sum<$random; $sum+=$initvec[++$state]) {}; # until terminating condition is met, traverse the Markov chain my $before; my $after; # Initialize timers until (&$termcheck()) { $before = time; &{$funcvec[$state]}(); # Call the corresponding function $random = rand(1); # Toss a coin $sum=0.0; # Lookup the transition matrix my $oldstate=$state; # and jump accordingly for ($state=-1; $sum<$random; $sum += $transprob[$oldstate][++$state]) {}; $after = time; sleep max(0,-log(1-rand(1))*$mean-$after+$before); }; }; ######################################## # Things should be changed for real use: sub func1 { print "1"; }; sub func2 { print "2"; }; sub func3 { print "3"; }; sub func4 { print "4"; }; BEGIN { my $counter=0; my $endcount=50; sub termCheckFunc { return 0 if ($run==0); return (++$counter == $endcount); }; }
LWP stands for library for WWW in Perl. The simplest way of using it is to download a piece of data through HTTP protocol:
use LWP::Simple; # These two are equivalent... my $data = get('http://example.net/index.php?var=data'); print $data if defined $data; getprint('http://example.net/index.php?var=data');
This kind of usage provide convenience and code readability, but not functionality. It cannot support the use of cookies and authorization. To do these, you need to emulate a browser:
use LWP; my $browser = LWP::UserAgent->new; # Emulate a browser $browser->agent('Mozilla/4.76 [en] (Win98; U)'); # claim as NS 4.76 on Win98 # instead of just declaring 'agent', we can declare a full header fields for later use my @ns_headers = { 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Language' => 'en-US' ); # Two ways to GET something and save in a HTTP::Response object my $response = $browser->get('http://example.net/index.php?var=data'); my $response = $browser->get('http://example.net/index.php?var=data', @ns_headers); # Check if response is 200, or print stuff like "404 File not found." thou' status_line die "Download failed: ".$resonse->status_line unless $response->is_success; # Check content type, print the content only if it is HTML if ($response->content_type eq 'text/html') { print $response->content; } else { # Print both full header and content otherwise print $response->as_string; };
The above code shows how we can use GET method, to use POST method, things are similar:
# Two ways to POST something and save in a HTTP::Response object my $response = $browser->post('http://example.net/index.php?var=data', [ 'var' => 'data', 'foo' => 'bar', ]); my $response = $browser->post('http://example.net/index.php?var=data', [ 'var' => 'data', 'foo' => 'bar', ], @ns_headers);
For a complicated GET request, one may want to build the URI in a better way:
use URI; my $url = URI->new('http://example.net/index.php'); $url->query_form( 'var' => 'data', 'foo' => 'bar' ); my $response = $browser->get($url);
In case cookies are required, a cookie jar should be assigned, with an optional file for which to read/write cookies from/to.
use HTTP::Cookies; #Different means of use: $browser->cookie_jar(HTTP::Cookies->new( 'file' => '/path/to/cookies.lwp', # file to read/write, in proprietary format 'autosave' => 1, # save file when done )); $browser->cookie_jar(HTTP::Cookies::Netscape->new( 'file' => '/path/to/cookies.txt', # Netscape's format )); $browser->cookie_jar({}); # Neither save to nor read from a file
Moreover, you can also add a bunch of HTTP authentication information to the browser like this:
$browser->credentials( 'server.name.org:port', 'realm-name', 'username' => 'password' );
After these, our code is much like a normal browser. But we still have the following things to tune up:
# Give up request if no response for $n seconds $browser->timeout($n); # Use the HTTP/1.1 "Keep-Alive" feature to reuse the same socket connection for speeding up requests use LWP::ConnCache; $browser->conn_cache(LWP::ConnCache->new()); # Make the browser honours http_proxy environment variable $browser->env_proxy;
More detail about the other usage of LWP (and related HTTP stuff) can be found over CPAN, following are some pointers:
Now let’s see how can we make use of LWP to crawl a web:
Assume we have a very simple web-based multiple choice system, where
To do this task, we can build a couple of code snippets, first of all, the function to crawl the index page and that to do login:
# Create browser, supports cookie and follows redirect for POST requests use LWP::UserAgent; my $browser = new LWP::UserAgent; $browser->cookie_jar({}); push @{$browser->requests_redirectable}, 'POST'; # Global variable for browser behaviour functions my $server_addr = "http://192.168.123.456/"; my $user_id = "JohnDoe"; my $user_pw = "foobar"; # Global variable for storing multiple choice stuff my @problems; # Load the login page sub getIndexPage { my $response=$browser->get($server_addr."dgp"); if ($response->is_success) { my $content=$response->content; return 0 unless (index($content,"</html>")>=0); return 1 if ($content =~ m/this\.location\.href.*mainBody/s); return 2 if ($content =~ m/parent\.location\.href.*index/s); }; return 0; }; # Emulate login behavior sub loginAndGetMainPage { my $response=$browser->post($server_addr."dgp/login.php", [ uid => $user_id, pwd => $user_pw, submit => 'Login' ] ); if ($response->is_success) { my $content=$response->content; return 0 unless (index($content,"</html>")>=0); return 0 unless ($content =~ m/<div id='main'\s[^>]*>/s); my @links = ($content =~ m#/problemset\.php\?frame=main[^']*#sg); if ($#links==29) { @problems = @links; return 1; }; return 2; }; return 0; };
The functions are returning zero if failure and other codes to mean success. This is to help we check whether the web response correctly. In the function to login and get the main page (which is the most complicated), we do the following to check whether the web page returned by the server is complete and correct:
div with the ID main, otherwise, the page is malformed (PHP error?)problemset.php, which are the links to problems to answer. The way to get the problem links is to use the regex matching with flag /g and collect the result by an arrayThis should show some idea on how you should program the function calls. Once all sort of functions are created, we can build the transition probability matrix, initial probability vector, and the function vectors. The control loop in the Markov Chain trailwalker should also be modified to check the return code of the functions called, and perform appropriate actions accordingly (e.g. flag error if there is any).
One can see that, in the previous section, we build the web stress test program by emulating a browser behaviour. But how can we know the browser behaviour?
This should be discusses in two aspects: From a microscopic points of view, we should use tcpdump to see exactly what is provided by the browser and what is obtained from the server. Items that we should pay attention to includes the cookies, data passed via POST method, browser headers, to name a few. We can also measure the response time, connection concurrency, and whether keep-alive is in use by investigating tcpdump data.
From a macroscopic point of view, we can also make use of tcpdump to collect user behaviour: we try to browse the web once with a real browser and record the traversal pattern exactly. This includes obtaining the fully qualified URLs and check whether some page is redirected via HTTP header, or the META tag on a HTML page.
In other words, this part of work is very human and no way to automate (right?). Given the Perl script, we should spend most of our time in knowing how to program the crawling functions.
perlstone.pl - Multithreaded web stress test engine
#!/usr/bin/perl # Perlstone # Stress testing a web site by running multiple Markov process # # Adrian Sai-wah Tam # Fri Apr 6 03:00:22 HKT 2007 use strict; use warnings; use threads; use threads::shared; use LWP::UserAgent; use Time::HiRes qw( sleep time ); # For allowing floating point seconds ######################## # Markov chain parameters: # the n-by-n transition probability matrix, assumed irreducible my @transprob = ( [0.2, 0.2, 0.1, 0.5], [0.3, 0.2, 0.4, 0.1], [0.1, 0.3, 0.3, 0.3], [0.2, 0.3, 0.3, 0.2] ); # the initial probability n-vector my @initvec = (1.0, 0, 0, 0); # the function pointers stored in n-vector my @funcvec = (\&func1, \&func2, \&func3, \&func4); # termination condition checker my $termcheck = \&termCheckFunc; # Markov chain parameters end ######################## # Main body # Thread control select((select(STDOUT), $|=1)[0]); # Make STDOUT `hot', i.e. no buffering my $run : shared = 1; # Flag to control the thread to continue $SIG{TERM} = sub { $run=0; }; # Setting signal handler as anonymous function my $numthread = 2; # How many threads we want to run? my $mean = 0.1; # Mean sleep time my @threads; # Thread pool sanitycheck(); # sanity check the prob matrix & vectors for (0..$numthread-1) { $threads[$_] = threads->new(\&markovtrail); }; for (0..$numthread-1) { # If ^C received, join the threads $threads[$_]->join(); }; print "\n"; exit; # Main body ends ######################## # Markov chain functions # Maximum of two sub max { return $_[0]>$_[1]?$_[0]:$_[1]; }; # Sanity check for the probability matrix and vectors sub sanitycheck { # Sanity check: Make sure the probability matrix is a probability matrix # and the probability vector is a probability vector for my $i (0..$#transprob) { die "Transition probability matrix is not square (on row $i)\n" if ($#{$transprob[$i]} != $#transprob); my $j; my $sum=0.0; for $j (0..$#{$transprob[$i]}) { $sum += $transprob[$i][$j]; }; if ($sum != 1.0) { warn "Sum of row $i of transition probability matrix isn't 1. Fixed.\n"; for $j (0..$#{$transprob[$i]}) { $transprob[$i][$j] /= $sum; }; }; }; my $sum=0.0; for (0..$#initvec) { $sum += $initvec[$_]; }; if ($sum != 1.0) { warn "Initial probability vector does not sum to 1. Fixed\n"; for (0..$#initvec) { $initvec[$_] /= $sum; }; }; die "Dimension mismatch between initial prob vector & transition prob matrix\n" if ($#initvec != $#transprob); die "Dimension mismatch between function vector & transition prob matrix\n" if ($#funcvec != $#transprob); for (0..$#funcvec) { die "Element $_ of function vector is not a function\n" if (ref($funcvec[$_]) ne 'CODE'); }; }; # Markov Chain function sub markovtrail { # Select initial state my $state=-1; my $random = rand(1); my $sum=0.0; for (; $sum<$random; $sum+=$initvec[++$state]) {}; # until terminating condition is met, traverse the Markov chain my $before; my $after; # Initialize timers until (&$termcheck()) { $before = time; &{$funcvec[$state]}(); # Call the corresponding function $random = rand(1); # Toss a coin $sum=0.0; # Lookup the transition matrix my $oldstate=$state; # and jump accordingly for ($state=-1; $sum<$random; $sum += $transprob[$oldstate][++$state]) {}; $after = time; sleep max(0,-log(1-rand(1))*$mean-$after+$before); }; }; # Markov chain functions end ######################## # Browser emulation # Global variable for browser emulation my $browser = new LWP::UserAgent; $browser->cookie_jar({}); push @{$browser->requests_redirectable}, 'POST'; # Global variable for browser behaviour functions my $server_addr = "http://192.168.123.456/"; my $user_id = "JohnDoe"; my $user_pw = "foobar"; # Browser emulation end ######################## # Crawling functions my @problems; sub crawling1 { my $response=$browser->get($server_addr."dgp"); if ($response->is_success) { my $content=$response->content; return 0 unless (index($content,"</html>")>=0); return 1 if ($content =~ m/this\.location\.href.*mainBody/s); return 2 if ($content =~ m/parent\.location\.href.*index/s); }; return 0; }; sub crawling2 { my $response=$browser->post($server_addr."login.php", [ uid => $user_id, pwd => $user_pw, submit => 'Login' ] ); if ($response->is_success) { my $content=$response->content; return 0 unless (index($content,"</html>")>=0); return 0 unless ($content =~ m/<div id='main'\s[^>]*>/s); my @links = ($content =~ m#/problemset\.php\?frame=main[^']*#sg); if ($#links==29) { @problems = @links; return 1; }; return 2; }; return 0; }; # Crawling functions end ######################## # Dumb stuff sub func1 { die "Server error" unless crawling1(); }; sub func2 { die "Server error" unless crawling1(); }; sub func3 { die "Server error" unless crawling2(); }; sub func4 { die "Server error" unless crawling2(); }; BEGIN { my $counter=0; my $endcount=50; sub termCheckFunc { return 0 if ($run==0); return (++$counter == $endcount); }; }
Also available as: perlstone.pl
Discussion