Introduction:
In this assignment, you will use R to do some analysis
of clickstream data in the style of the Joachims et al. paper
on Accurately Interpreting Clickthrough Data.
As the clickstream data, we will use part of the search data
that was released by some AOL researchers in 2006. The
underlying search results were being provided by Google.
While the AOL
researchers anonymized user IDs, this data release was seen as
an enormous privacy breach, for which AOL later apologized.
Analysis of the data highlighted the subtlety of
deidentification in text data (an issue that also comes up
with patient medical records), and it was one factor that led
to changes in data retention policies at major search companies. See,
for instance,
this
NYT article or this more academic article.
We'll discuss this issue a little in
class. But, since the data has now been around for years and
has remained available on the web (on third party servers) the
entire time, presumably any malefactors have already done
whatever bad things can be done with the data, and so it seems
safe enough for us to use.
But, you're probably best off not actually reading
through the data. At any rate, here's the boilerplate (from
the AOL release): Please be aware that these queries are not
filtered to remove any content. Pornography is prevalent on
the Web and unfiltered search engine logs contain queries by
users who are looking for pornographic material. There are
queries in this collection that use SEXUALLY EXPLICIT
LANGUAGE. This collection of data is intended for use by
mature adults who are not easily offended by the use of
pornographic search terms. If you are offended by sexually
explicit language you should not read through this data. Also
be aware that in some states it may be illegal to expose a
minor to this data. Please understand that the data
represents REAL WORLD USERS, un-edited and randomly sampled,
and that AOL nor Stanford University is not the author of this data.
The dataset can be found here and is
laid out as follows:
AnonID | Query | QueryTime | ItemRank | ClickURL |
1326 | back to the future | 2006-04-01 17:59:28 | 1 | http://www.imdb.com |
There is a record for each search results page shown. If no link was
clicked, then only the first three columns are non-empty. If
link(s) were clicked, there is a row for each clicked link
position, and the domain name of the page that the search
result pointed to. See the AOL readme for more information.
A couple of R tips:
- Admittedly, if I were doing large scale search engine log
analysis, I wouldn't be trying to do it all in R. I'd use
Perl, or Python, or .... But this dataset is just 3.5 million
lines, so you should be fine on a machine with 2 GB of RAM.
- By default, R interprets quotes as surrounding field values, and, again by default,
the ragged nature of the data (3 column and 5 column rows) will make R barf while reading
the data. But
read.table has quite flexible options, and this incantation should work:
clicks = read.table("user-ct-test-collection-01.txt",
header=T, sep="\t", fill=T, quote="")
You can find out the options to a command by typing, e.g., help(read.table) . You will
see that there's a fair bit of stuff you can do with suitable options.
- The basic function for drawing a histogram is
hist()
read.table returns a data frame. If you didn't
become very familiar with data frames in the first assignment,
now would be a good time to. In general in R, if you use know
how to use data frames then your life will be pleasant, and
otherwise it will be miserable. There are some nice commands
for manipulating data in data frames. One is
subset() for picking out subsets of data:
clicks10 = subset(clicks, ItemRank <= 10)
Here's what to do:
-
With this data, you should be able to produce a histogram roughly
like the black bars (for clicks) in Figure 1 of the Joachims
et al. paper. Well, actually there are a couple of things to decide about
how best to do things. Provide a pdf histogram, your best R
code, and in your write-up, (i) justify how you decided to do
things, and (ii) discuss how the results from the AOL data are
the same as or different from the results that Joachims et
al. report. Are there any hypotheses you can make to explain any differences you observe?
(If you'd like a hint of how you might approach this in R, making use of its data manipulation functions rather than writing your own program, you might look at the documentation of the split() function.)
(4 points)
-
If we accept the results of the paper, we should be able to use
the clickthrough data to determine which queries were giving
better and worse search results.
This file contains 10 queries that
are relatively frequent in the AOL search log that we're
using. Using the ideas from the paper, for which queries are
the returned results better or worse at satisfying user
information needs? Put them into a rough ranking poset. Look
a bit carefully at the data available for each query. Should
some queries be put aside as not providing reliable information?
Provide the R code you used and in your write-up show your
rough result quality ranking and include any necessary
discussion. (4 points)
-
In this file, is data corresponding
to the "normal" results in the upper half of Table 3 of the
Joachims et al. paper. The easy way to make the table
(excluding the marginal totals) in R is with the
xtabs() function. The paper states that by
looking at the 4 top middle cells of the upper part of the
table (where one page is judged more relevant than the other
and only one link or the other is clicked), the table
shows, assuming a binomial distribution, that we can reject
the hypothesis that users are not influenced in what they
click on by the order of presentation. This can be most
directly tested on these 4 cells using a Fisher's
Exact test. The R command for this is fisher.test() .
Confirm the statement from the paper. Give
the R commands used for generating the crosstabs and the
hypothesis test, and state how likely this or a more extreme
distribution of the referenced cells is to have occurred by chance
(according to Fisher's Exact test). (2 points)
Collaboration policy: You must complete problem 1 individually. You may
collaborate on the remaining problems.
Submission guidelines:
Your submission should consist of an email to the
staff mailing list. Incude attachments named: suid-1.pdf,
suid-1.r, suid-2.r, suid-3.r, and suid-writeup.pdf,
where suid is your SUID and the two PDFs are a historgram and your
writeup for all problems. Thanks!
Additional Notes: (Updated 7pm 4/15:)
We've posted a new and improved version of the data file. Please make sure
you have the latest - this should solve previous problems we had with the
compression of the archive. There was also a parsing problem involving the
presence of quotes [ ' ] in the text field. They've now been removed. Just
something to look out for in the future!
|