Abstract

One area where Internet-based entertainment and commerce is at a disadvantage to more traditional forms is in the analysis of customers' patterns and habits. This analysis can be used to assess the effectiveness and performance of a business. Traditional approaches to web analytics (the analysis of usage patterns and traffic on the web) obtain raw data from web server log files. The data contained in these log files often is not sufficient to distinguish between different clients due to proxies, firewalls, browser caching and dynamic IP address allocation. As a result, statistical data extracted from these log files can be misleading or inaccurate. This not only makes it difficult for administrators to estimate a web site's efficacy, but also complicates the task of detecting and preventing web site abuse.

Using multiple techniques including session management and relational data storage, a system was devised for gathering raw usage data that overcomes many of the inaccuracies inherent in log file based web analytics systems. I created tools to analyze the raw data and provide administrators with the ability to observe the activities of their clients in real time. I also wrote analysis tools that take advantage of learning algorithms and Baysian neural network techniques. These tools look for emergent properties in the raw data. Using these tools, it is possible to build a framework for a site where the content organizes itself based on clients' browsing habits.