Mining Web Server Logs


Previous Topic Previous Next Topic Next
Xoc Software

Other Xoc managed sites:
http://grr.xoc.net
http://www.986faq.com
http://www.mayainfo.org
https://mayacalendar.xoc.net
http://www.yachtslog.com

Copyright 2002 by Greg Reddick

Active Server Pages (ASP) and ASP.NET has the capability to log every access to the web server to a text file or database. You can then perform analysis of the data to learn much about the people hitting your server.

There are two distinct classes of analysis programs:

  • Programs that analyze the data that the IIS server logs
  • Programs that perform their own logging based off of special content on pages on the web site.

Analyzing the Data From the IIS Server Logs

The first task is to turn on logging on the web server. In the Internet Information Server (IIS) Manager dialogs, right click on the web site and select properties. On the Web Site tab, is a checkbox named Enable Logging. There are a variety of different ways that the information can be logged:

  • NCSA Common Log File Format
  • ODBC Logging
  • W3C Extended Log File Format

NCSA format is the least useful, as it logs less information that the others, plus the standard tools for logging don't understand this format.

ODBC Logging allows you to place the information for the logs directly into an ODBC database. This would be very useful if the standard log file analysis programs understood how to read the database. You could, however, export the information from this format into a W3C format, if necessary. More useful, though, would be to go the other direction and import W3C format into a database, if you found the database useful. You can write a tool that does this. I have written one, and will eventually make it available on http://www.xoc.net when I get it into a shipping state.

You will find the W3C Extended Log File Format the most useful. Virtually all log file analysis programs understand this format. However, to get the most use from this, you need tell the server what to log. You want to press the Properties button next to where you selected W3C Extended Log File Format.

In the General Properties tab of the Extended Logging Properties dialog, the first thing is to select the New Log Time Period. Choices are:

  • Hourly
  • Daily
  • Weekly
  • Monthly
  • Unlimited file size
  • When file size reaches: x MB

This option allows you to control when a new log file is produced. You want to pick an option that produces files of a manageable size, somewhere below 50 MB max, depending on the amount of traffic you get. On any small site, Monthly would be adequate, but if you were getting a lot of traffic, you could change it to be more frequent.

The next option is "Use local time for file naming and rollover". This would allow you to cut off files at midnight local time instead of Universal Time Coordinated (UTC) (archaically known as GMT). Keep this one unchecked.

The last option on this tab, allows you to pick the directory that the log files are written to. The default is %WinDir%\System32\LogFiles. You may want to change that to another directory, so that it doesn't get lost if you reinstall the file system. I place the logfile data in a directory off the root of the web server.

You could also have the logs written into the directory structure of the web site that is being managed. This allows users to get the data without having to pester the machine administrator for access, since it then just becomes another web page. However, you may want to secure that part of the web site from allowing anyone to access the site, since you may not want your competitors to get your log information.

Next you need to tell it what to log. On the Extended Properties tab, there are checkboxes for:

  • Date
  • Time
  • Client IP Address (c-ip)
  • User Name (cs-username)
  • Service Name (s-sitename)
  • Server Name (s-computername)
  • Server IP Address (s-ip)
  • Server Port (s-port)
  • Method (cs-method)
  • URI Stem (cs-uri-stem)
  • URI Query (cs-uri-query)
  • Protocol Status (sc-status)
  • Win32 Status (sc-win32-status)
  • Bytes Sent (sc-bytes)
  • Bytes Received (cs-bytes)
  • Time Taken (time-taken)
  • Protocol Version (cs-version)
  • Host (cs-host)
  • User Agent (cs(User-Agent))
  • Cookie (cs(Cookie))
  • Referer (cs(Referer))

This is followed by a section called Process Accounting. This allows you to check:

  • Process Event (s-event)
  • Process Type (s-process-type)
  • Total User Time (s-user-time)
  • Total Kernel Time (s-kernel-time)
  • Total Page Faults (s-page-faults)
  • Total Processes (s-total-procs)
  • Active Processes (s-active-procs)
  • Total Terminated Processes (s-stopped-procs)

I recommend that for most purposes that you check every checkbox in this list. If you have a very high traffic server, you might consider unchecking some of them, since it does take some server time to process the log file. But only if your log files tell you that your server is maxed out should you uncheck some of these.

If you are not the webmaster of the IIS server, you will need to get the webmaster to change these for you. Bribe them. Threaten them. Whatever it takes.

The Server Log Entries

An example of a log file line looks like this:

#Fields: date time c-ip cs-username s-sitename s-computername s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-win32-status sc-bytes cs-bytes time-taken cs-version cs-host cs(User-Agent) cs(Cookie) cs(Referer)
2001-06-11 08:15:30 192.168.1.2 - W3SVC1 DARKSTAR 127.0.0.1 80 GET /default.asp - 302 0 0 250 94 HTTP/1.1 localhost:80 Mozilla/4.0+(compatible;+MSIE+5.5;+Windows+NT+5.0) - -

So what can you determine from this line?

  • Date and time that the user hit the site
  • IP address of the user
  • Username of the user (if you required login to the site)
  • Name of the site
  • Name of the server
  • Server's IP address
  • Port used to get to the server
  • Method for the request
  • Page being hit
  • Query string, if there is one
  • Status code returned
  • Win32 status code
  • Number of bytes returned
  • Time taken
  • HTTP version
  • Host
  • User Agent string passed from the client's browser
  • Any cookies passed by the client browser
  • Referrer string

By itself. this data would be useful to examine if you thought your server were being attacked. In particular, the IP address of the client would be useful. You could then ban that IP address so that the user could not return. However, mostly this information will be useful in aggregate form. You will need to run a statistical package on the data to produce tables and graphs that shows rankings and trends. But before I talk about that, let me talk some terminology.

Hits, Page Views, Sessions, Users, and Spiders

When trying to view what happened in a web log, there is a definite problem: each interaction from a user is a discrete transaction. The browser requests a page, and the server provides it. From the perspective of the web, the transaction is done. If the user asks for another page, there isn't any information retained from the last transaction to hook this transaction to the last one.

Each line in the log file is a "hit". If the request is for an asp or html page, then it is considered a page view. It could be a request for a graphic or other supplementary information. So one page view could result in many hits if there are a number of GIF or JPG files on the page.

IIS and .NET provides some artificial mechanisms for maintaining the concept of a "session", through using cookies or hidden fields that get posted back. The web really doesn't have the concept of a "session". Usually, a log analysis program defines a session as a series of page views from a single IP address without a break of a defined amount of time. In IIS itself, that time break defaults to 20 minutes. There are problems with this definition. If an IP address is used by two different users within the set amount of time, then it gets counted as one session, even though this undercounts the number of users hitting the web site.

For example, let's suppose that Jack and Jill are both typical users from a typical ISP. Jack hits your web site, browses around for a while, then logs off. Jill then logs in and gets assigned the same IP address as Jack. Jill hits your web site within 20 minutes of Jack. The analysis program will count that as one user continuing to visit your web site.

Even worse, is that some ISP have a local cache. Take AOL for example. If Jack hits your web site and browses around, AOL will cache your pages on their site. If Jill then browses around, AOL serves up those pages from the cache rather than hitting your site. This is good for your bandwidth, but bad for your server statistics, since Jill's browsing never shows up in your web server logs.

Realize that not every hit on your web site represents a human. Spiders are automated tools that grab web pages. Requests from these tools appear in the web server log, just like everything else. If you are trying to measure the traffic on your site, these spiders artificially inflate the number of page views on the site.

So, on the one hand there are some things that artificially reduce the amount of traffic on the site, and on the other, artificially inflates it. So can you really know with any precision how many people are hitting your site? Unless you make them log in, the answer is "no". The best you can do is look at the logs, look at trends, and make some inferences.

Web Server Log Analysis Programs

There are a variety of web server log programs out there. I'm not going to catalog them. However, I am going to show what they do, and what you can pull from them. I like freeware, so I'll use a program called Analog as an example.

Analog can be downloaded from http://www.analog.cx. It's available in a variety of languages, and has a number of add-on tools available for it. It has many options for configuring it. I will point out two add-on products for it that I use:

  • QuickDNS
  • Report Magic

Regardless of which log analysis program you use, you will need a tool or feature that converts IP addresses into domain names. This is called a reverse IP address lookup. From that, you can determine something about where the user is coming from. It may not be entirely precise, but it can help with determining things such as what countries you are getting visits from. All programs that do this cache the result of these lookups in a file, so that later runs don't have to look them up again.

Analog has such a feature, but it turns out it is slow. Since sometimes a reverse IP address lookup fails, it takes time for it to timeout. If each IP address is processed entirely before the next is processed, it can take a very long time to run through a log file, especially the first time you analyze your web server logs. Someone wrote an add-on tool that multi-tasks the lookup of IP addresses, to make the process much, much faster. QuickDNS does that.

Analog's standard output is also a little crude. Someone else wrote a post-processor to make the output much nicer to look at, and provide better charting, called Report Magic. Both QuickDNS and Report Magic can be found from the Analog web site.

Other Analysis Programs

Another kind of analysis program provides analysis using a different technique than using the logfiles. Instead, they place a special string on each page that causes a Perl script to run. When the script runs, it keeps track of the statistics for the web site.

I have found that none of the analysis programs I use provides all the information I'm interested in. However, there is absolutely no reason that you can't run more than one. I run three different ones automatically on my sites every evening, and others periodically. They all provide slightly different ways of looking at the information.

Some of the Things You Want to Know

There are several categories of information that you want to track:

  • How much traffic are you getting?
  • Where is that traffic coming from?
  • What are they using to visit the site?
  • What are they hitting on the site?
  • What problems are there on the server?

How Much Traffic is Your Site Receiving

The first and primary thing you want to know is how many page views you are getting. A page view count by itself is pretty worthless. Instead, you need to know how many did I get this period, compared to how many I got that period. Yesterday versus the day before. This month versus last month. Six A.M. compared to six P.M.

In most cases, you will need to remove page views by spiders from the count, before you measure traffic. This is critical. If you included spiders, you are not getting an accurate count of how effective your web site is. If Google or Fast suddenly finds your site, you may get thousands of page views that you weren't getting before. You still haven't got a single additional human visitor. Some spiders are practically a Denial of Service attack on your web server.

You can find an Analog configuration file that will identify search engine spiders at http://www.science.co.il/analog/SearchQuery.txt. With some modification, you could use this list with almost any analysis program. Realize that if you don't remove spiders from your count, your statistics on the number of visitors and the amount of traffic they generate are virtually meaningless.

Analog lets you graph and count how many Page Views you got over a year, quarter, month, day, and by hour of the day. The main thing you are looking for:

  • Are things getting better or worse?
  • Are there cycles?

Look at the trends. Are your page view counts better than they were before or worse? If you did a rewrite of your site, and suddenly your traffic is half, you have a major problem. On the other hand, maybe you find that your traffic doubled, inspiring you to use those techniques on other sites.

If you are selling stuff on your site, you also need to track how many of those people convert into money. If your traffic halves, but your conversion rate doubles, you have more directed traffic, generally a good thing. Conversion rates generally can't be measured in the Web Server logs, so you will need to track that using tools and statistics beyond what I'm talking about here.

The other thing to look at is the cycles. Do you get more traffic during business hours, or after hours. Are weekends good or bad. If you need to take down the web server for maintenance, then knowing what hour of the day receives the least traffic would be useful.

Looking at how often FAVICON.ICO is downloaded gives you an indication of how often your site is bookmarked using Internet Explorer. Looking at how often ROBOTS.TXT gets downloaded gives you an idea of how many spiders are hitting your web site.

Where is the Traffic Coming From?

It may be useful to know what countries visitors are coming from. This can tell you whether it would be useful to translate a site. A Domain Report gives some indication of the countries and amount of traffic from those countries.

Next, a Referring Site Report gives what sites are linking to your site and providing you with traffic. The sites are are near the top of this list are your proven traffic providers. You want to check out these sites and look at the links to your pages. Is there anything that you can do to foster the relationship with those sites?

A Referring URL Report is one of your most important tools. It shows how the exact URL of the page that is linking to you. Any page toward the top of this list is a proven traffic provider. You must visit each of those URLs and see what is linking to you. Are the descriptions good? If so, try to get other sites to use the same descriptions. Are the pages listed in the major search engines? If not, then submit them to the few engines that still support free submittal.

What are They Using to Visit the Site?

You want to know what User Agents are hitting your site. This can be broken into two categories:

  • Human
  • Spider

For humans, you are concerned with what web browser and versions are being used. The Browser Summary Report will tell you what web browsers are hitting your site, and what specific version numbers are being used. You want to lay your hands on the top browsers listed and check your site to make sure that it renders correctly in those browsers.

For spiders, see the section on Arachnophilia below.

An Operating System Report will tell you what operating system visitors are using. This can be useful for understanding whether you need to check your site on Apple machines as well as Intel, and maybe other architectures such as WebTV.

What are they Hitting on the Site?

What parts of your site are most important? A Directory Report and Request Report will tell you which directories and which files are being hit most frequently. These are the pages that you want to have working the most smoothly on your site. Just as important as what is being hit, is what is not. Why not? Are they not important pages, or are they not well organized?

A File Size Report will tell you how large the downloads you are making. If you have a substantial number downloads over 100K in size, you need to look at that content that is being downloaded to make sure it is appropriate and working correctly. A few large downloads can kill your bandwidth if those downloads are hit frequently.

An Internal Search Query Report will show what terms they are using on your internal search feature. An Internal Search Word Report will show the individual words in those searches. If there are searches that are frequently done that provide zero answers, consider creating content that provides the answer to the search.

What Problems are there on the Server?

A Status Code Report will tell you what status codes are being returned by the Server. You can find the definitions of the HTTP status codes in RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt). In particular, you want to know why you get any 500 series codes. These usually indicate a bug in the ASP code on the server. These need to be tracked down and fixed.

You also should be interested in what 404 errors show up in the log files, as these pages are being requested, but not provided. A Failure Report will tell you what requests are being returned 404. A common reason for returning a 404 is a page that has been moved to another URL on the web site. You never want to lose traffic because of a reorganization. There are ways to retaining that traffic through a 404 handler that redirects 404 errors to the new location for the file. See http://www.xoc.net for an example of a 404 handler for IIS.

Another reason that you may be getting a 404 is attacks on your web server. You may have pages being requested that are common attack points. Consider banning the IP address that are requesting those pages.

A Redirection Report will show what pages have been redirected to another location. You may want to visit the pages hyperlinking to these and have them updated to the new location.

A User Failure Report will show what logins are being attempted to the site. For a public web site, all logins generally indicate an attack on the server. Consider banning those IP addresses.

A Processing Time Report will show how long it took to provide a given page. You should aim to have most pages returned in less than 10 seconds. If your pages are taking substantially longer than that, you should look at why.

Arachnophobia and Arachnophilia, or Dealing with Spiders

There are good spiders and bad spiders. A good spider is one that eventually turns into directed traffic to the web site. Bad spiders are wasters of bandwidth of the site.

A good spider is one that you want to visit your site. Examples of spiders that I like to hit my sites:

  • Scooter (AltaVista)
  • Slurp (Inktomi search engines, such as HotBot)
  • ArchitectSpider (Excite)
  • Lycos_Spider/TRex (Lycos)
  • FAST-WebCrawler (AllTheWeb.Com, fast.no)
  • Googlebot (Google)
  • Zyborg (Wisenut)

This list changes frequently, so this is as of this writing (March 2002). Even that list is a little out of date, since ArchitectSpider probably died when Excite went bankrupt, and Wisenut was just bought by Lycos.

Good spiders hit your web site and index the content. After indexing, they provide search engine results that direct people to your site. If they come by frequently, they pick up any new content you have added to your site, allowing you to get additional traffic from that content.

Bad spiders suck the marrow from a site. First they consume bandwidth that can be dedicated to serving humans. Bad spiders can hammer a web server with hundreds of page views a minute. By convention, a well-behaved spider will not retrieve more than one page view a minute from any given domain. As I mentioned above, spiders also artificially inflate the amount of traffic a web server gets.

Second, bad spiders can use the data from your site in ways that you may wish to avoid. Some spiders search your site for email addresses to spam. Other spiders may come from competitors who are mining your site for data, or even setting up mirror sites using your content! I have even been hit by spiders looking for trademark infringement in graphics on my web sites. (I have no infringement on my sites, but even the thought that I've got lawyers from major corporations prowling around my sites trying to bust me makes me annoyed--I banned that spider.)

If you are getting traffic from a spider, you need to evaluate whether it is a good spider or a bad spider. First, do you care? If you have bandwidth to spare, maybe the hassle of dealing with even an evil spider isn't worth the effort. But what if you get mentioned on the national nightly news just at the moment your site is getting hammered by a spider collecting email addresses to spam?

So the first step is to recognize spiders. The easy way is to look at the results from the User Agent field of the server logs. Most spiders will identify themselves in the User Agent field of the Web Server log. You will see an entry such as:

Googlebot/2.1+(+http://www.googlebot.com/bot.html)

More difficult is to look at the pattern of traffic from a particular IP address. Some spiders will lie and tell you that they are Internet Explorer in the User Agent field. If a particular IP address is visiting pages in a exhaustive pattern of following every hyperlink on every page on your site, you can be pretty sure it is not a human.

Any spider that makes it near the top of the User Agent list in the report is due to be reviewed on whether it is acceptable, as it is consuming bandwidth. All other spiders are to be treated with suspicion on their purpose for visiting your site. Suspicious spiders can be investigated in the Tracking and Logging forum on http://www.webmasterworld.com.

To ban a spider, you can either block it at the firewall or at the Web Server. I prefer to block IP addresses at the firewall because I can make my site just vanish to the spider. The firewall can just not respond to requests to port 80. The technique for doing that depends on your firewall.

You can also block the spider at the web server. In IIS, you can manage that in the IIS Manager dialogs. On the Directory Security tab is a button that allows you to "Grant or deny access to this resource using IP addresses or internet domain names." Add the IP address of the spider that you want to ban to the list.

The traffic from a good spider needs to be analyzed carefully, especially for the keywords that show up in the QueryString. These keywords show what words people are searching for that are resulting it in hits to your site. Those words are the important words to stress throughout the remainder of your site.

Summary

By looking carefully at the results of your server statistics, you can find out a huge amount of information about your site. Frequent analysis of your server logs will show you how much traffic are you getting, where is that traffic coming from, what are they using to visit the site, what are they hitting when they visit the site, and what problems are there on the server.


Top