DMC 2008
DMC 2007
DMC 2006
DMC 2005
DMC 2004
DMC 2003
DMC 2002
DMC 2001
DMC 2000
DATA MINING CUP /2003/ Deutsch Englisch
 
Task Specification

Each day, about 25 Mio. of unwelcome emails - so-called spam mails - are sent. This corresponds to 10 percent of all emails worldwide. Performed investigations with the employees of many enterprises turned out that, on average, 40 percent of their daily received emails are spam mails. With some of the employees, this share even reaches up to 90 %. The task specification of the DMC 2003 contest will be to identify spam mails by means of Data Mining.



The Contest of the 4. DATA MINING CUP finished with the Deadline for handing-in the analysis results on Mai 14, 2003.

The detailed task specification and data will be available here for students and universities who interested in.


Download of task specification and data: Download of the class matching into the classification file (for evaluation purpose): Hint: For download please use your download key. If you have no download key you can get your key here. The key will be send to you automaticly via e-Mail.

If you have any qestions please mail to
.

Scenario

As a part of measures for the optimisation of its communication processes a company observes by tips of its employees that an exceedingly large part of all incoming emails are advertisement mails. The large amount of working time spent for the daily sorting and final erasure of the spam emails has unveiled a high potential of rationalization in the company of 120 employees.

For this reason, all emails of the company have been collected over some time and stored separately by spam and non-spam emails. Every email was described by a set of attributes.

Within the scope of the Data Mining Cup exemplarily data of 8000 emails along with their class assignment are considered.

The aim of the application of Data Mining is to create a classifier which is able to separate non-spam from spam emails. Then the classifier (rule) shall automatically check all incoming emails and only deliver non-spam mails directly. Spam emails should be stored in a separate mail directory of the employee.

In the context of the Data Mining Cup the classifier should be exemplarily applied to 11.177 emails.

The objective is to minimize the number of the spam emails which have passed the filter, subject to the following essential constraint: Within the filtered emails only a maximum of 1.0 % non-spam emails (based on the total number of all non-spam mails) are allowed. Attention: If the submitted solution does not satisfy this condition, the solution will be ignored.

Frequently Asked Questions - FAQ

Does the distribution of the values of the target attribute of the file data_dmc2003_class.txt correspond to that of the file data_dmc2003_train.txt?
Answer:
Yes, both training and classification data are taken from the same sample and, thus, have the same class distribution.

Who can I get a description of the features?
Answer:
The features are identical to the features of the Open Source Project SpamAssassin (see http://spamassassin.org). A short description of the tests and features can be found in the following PDF file. DMC2003_Merkmale.pdf (ca. 60kB, for download use your account data)

- sponsored link -
[PDF] [Branchen]