Distributed manual verification of a corpus

John Graham-Cumming (of popfile, among other things), has setup a site for manual verification of the 2005 TREC Spam Track corpus.  The idea is that as many people as possible go to the site and manually classify the messages that are presented as ham or spam.

The TREC corpus was primarily classified automatically, so it's possible that there are errors in the corpus.  It's an interesting experiment, and I look forward to reading papers about the results (and possibly using a more correct corpus). It's a shame that the TREC corpus is the best one available, since the mail is pretty old, and it's a weird collection of (Enron, I believe) mail from different people.  It will be particularly interesting to see if there are messages that many people disagree on – some messages are particularly hard to classify, since you don't know what the interests/subscriptions of the original recipient were.

The site itself is particularly well done, IMO.  Not only do you get the raw email, but you are presented with a screenshot showing you what the message looks like in a typcial mail client.  This is a great idea. 

I encourage everyone to go to the site and classify at least a few emails.  It doesn't take much time, and it's a great contribution. 

technorati tags: , ,


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: