Assignment 2: Analyzing Clickstream Data with R

due: 5:00pm April 16 by email to cs303@cs.stanford.edu,
submission instructions below

Introduction:
In this assignment, you will use R to do some analysis of clickstream data in the style of the Joachims et al. paper on Accurately Interpreting Clickthrough Data.

As the clickstream data, we will use part of the search data that was released by some AOL researchers in 2006. The underlying search results were being provided by Google. While the AOL researchers anonymized user IDs, this data release was seen as an enormous privacy breach, for which AOL later apologized. Analysis of the data highlighted the subtlety of deidentification in text data (an issue that also comes up with patient medical records), and it was one factor that led to changes in data retention policies at major search companies. See, for instance, this NYT article or this more academic article. We'll discuss this issue a little in class. But, since the data has now been around for years and has remained available on the web (on third party servers) the entire time, presumably any malefactors have already done whatever bad things can be done with the data, and so it seems safe enough for us to use.

But, you're probably best off not actually reading through the data. At any rate, here's the boilerplate (from the AOL release): Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL nor Stanford University is not the author of this data.

The dataset can be found here and is laid out as follows:

AnonIDQueryQueryTimeItemRankClickURL
1326 back to the future 2006-04-01 17:59:281http://www.imdb.com

There is a record for each search results page shown. If no link was clicked, then only the first three columns are non-empty. If link(s) were clicked, there is a row for each clicked link position, and the domain name of the page that the search result pointed to. See the AOL readme for more information.


A couple of R tips:

  1. Admittedly, if I were doing large scale search engine log analysis, I wouldn't be trying to do it all in R. I'd use Perl, or Python, or .... But this dataset is just 3.5 million lines, so you should be fine on a machine with 2 GB of RAM.
  2. By default, R interprets quotes as surrounding field values, and, again by default, the ragged nature of the data (3 column and 5 column rows) will make R barf while reading the data. But read.table has quite flexible options, and this incantation should work:
    clicks = read.table("user-ct-test-collection-01.txt", header=T, sep="\t", fill=T, quote="")
    You can find out the options to a command by typing, e.g., help(read.table). You will see that there's a fair bit of stuff you can do with suitable options.
  3. The basic function for drawing a histogram is hist()
  4. read.table returns a data frame. If you didn't become very familiar with data frames in the first assignment, now would be a good time to. In general in R, if you use know how to use data frames then your life will be pleasant, and otherwise it will be miserable. There are some nice commands for manipulating data in data frames. One is subset() for picking out subsets of data:
    clicks10 = subset(clicks, ItemRank <= 10)

Here's what to do:

  1. With this data, you should be able to produce a histogram roughly like the black bars (for clicks) in Figure 1 of the Joachims et al. paper. Well, actually there are a couple of things to decide about how best to do things. Provide a pdf histogram, your best R code, and in your write-up, (i) justify how you decided to do things, and (ii) discuss how the results from the AOL data are the same as or different from the results that Joachims et al. report. Are there any hypotheses you can make to explain any differences you observe? (If you'd like a hint of how you might approach this in R, making use of its data manipulation functions rather than writing your own program, you might look at the documentation of the split() function.) (4 points)
  2. If we accept the results of the paper, we should be able to use the clickthrough data to determine which queries were giving better and worse search results. This file contains 10 queries that are relatively frequent in the AOL search log that we're using. Using the ideas from the paper, for which queries are the returned results better or worse at satisfying user information needs? Put them into a rough ranking poset. Look a bit carefully at the data available for each query. Should some queries be put aside as not providing reliable information? Provide the R code you used and in your write-up show your rough result quality ranking and include any necessary discussion. (4 points)
  3. In this file, is data corresponding to the "normal" results in the upper half of Table 3 of the Joachims et al. paper. The easy way to make the table (excluding the marginal totals) in R is with the xtabs() function. The paper states that by looking at the 4 top middle cells of the upper part of the table (where one page is judged more relevant than the other and only one link or the other is clicked), the table shows, assuming a binomial distribution, that we can reject the hypothesis that users are not influenced in what they click on by the order of presentation. This can be most directly tested on these 4 cells using a Fisher's Exact test. The R command for this is fisher.test(). Confirm the statement from the paper. Give the R commands used for generating the crosstabs and the hypothesis test, and state how likely this or a more extreme distribution of the referenced cells is to have occurred by chance (according to Fisher's Exact test). (2 points)

Collaboration policy: You must complete problem 1 individually. You may collaborate on the remaining problems.

Submission guidelines:
Your submission should consist of an email to the staff mailing list. Incude attachments named: suid-1.pdf, suid-1.r, suid-2.r, suid-3.r, and suid-writeup.pdf, where suid is your SUID and the two PDFs are a historgram and your writeup for all problems. Thanks!


Additional Notes: (Updated 7pm 4/15:)
We've posted a new and improved version of the data file. Please make sure you have the latest - this should solve previous problems we had with the compression of the archive. There was also a parsing problem involving the presence of quotes [ ' ] in the text field. They've now been removed. Just something to look out for in the future!