 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
|
Task Specification
Each day, about 25 Mio. of unwelcome emails - so-called spam mails
- are sent. This corresponds to 10 percent of all emails worldwide.
Performed investigations with the employees of many enterprises turned
out that, on average, 40 percent of their daily received emails are
spam mails. With some of the employees, this share even reaches up to
90 %. The task specification of the DMC 2003 contest will be to
identify spam mails by means of Data Mining.
The Contest of the 4. DATA MINING CUP finished with the Deadline for handing-in the analysis results
on Mai 14, 2003.
The detailed task specification and data will be available here for students and universities who interested in.
Download of task specification and data:
Download of the class matching into the classification file (for evaluation purpose):
Hint: For download please use your download key. If you have no download key you can get your key
here.
The key will be send to you automaticly via e-Mail.
If you have any qestions please mail to
.
|  |  |  | Scenario
As a part of measures for the optimisation of its communication processes
a company observes by tips of its employees that an exceedingly large
part of all incoming emails are advertisement mails. The large amount
of working time spent for the daily sorting and final erasure of
the spam emails has unveiled a high potential of rationalization
in the company of 120 employees.
For this reason, all emails of the company have been collected over
some time and stored separately by spam and non-spam emails. Every email
was described by a set of attributes.
Within the scope of the Data Mining Cup exemplarily data of 8000 emails
along with their class assignment are considered.
The aim of the application of Data Mining is to create a classifier
which is able to separate non-spam from spam emails. Then the classifier
(rule) shall automatically check all incoming emails and only deliver
non-spam mails directly. Spam emails should be stored in a separate
mail directory of the employee.
In the context of the Data Mining Cup the classifier should be
exemplarily applied to 11.177 emails.
The objective is to minimize the number of the spam emails
which have passed the filter, subject to the following essential
constraint: Within the filtered emails only a maximum of 1.0 % non-spam
emails (based on the total number of all non-spam mails) are
allowed. Attention: If the submitted solution does not satisfy this
condition, the solution will be ignored.
|  |  |  | Frequently Asked Questions - FAQ
Does the distribution of the values of the target attribute
of the file data_dmc2003_class.txt correspond to that of
the file data_dmc2003_train.txt?
Answer:
Yes, both training and classification data are taken from the
same sample and, thus, have the same class distribution.
Who can I get a description of the features?
Answer:
The features are identical to the features of the Open Source Project SpamAssassin (see http://spamassassin.org).
A short description of the tests and features can be found in the following PDF file.
DMC2003_Merkmale.pdf (ca. 60kB, for download use your account data)
|
|
|
- sponsored link -
|
 |
|
 |
|
|
|