MLnetOiS Logo left

MLnetOiS Logo right

 Resources:DatasetsDetails

 

Index
Resources
* Bibliography
* Courses
* Datasets
* Links
* Showcases
* Software


PU1

The PU1 anti-spam filtering corpus

Add Add a dataset to the database.
Update Update the entry for this dataset.

 

 

 

Back

 

 

 

up arrow

Name (abbrev)

Name (full)

Category

Last update

 

PU1

The PU1 anti-spam filtering corpus

text classification

b D, Y

up arrow

Application domain

Further specifications

 

Anti-spam email filtering

618 personal email messages 481 spam email messages 4 versions of the data

up arrow

Type

Format

Complexity

 

"encrypted" ASCII text

4314 KBytes

up arrow

WWW / FTP

 

 

http://www.iit.demokritos.gr/~ionandr/pu1_encoded.tar.gz
http://www.aueb.gr/users/ion/publications.html
http://

up arrow

Contact person(s)

Related group(s)

Optional contact address

 

  1. Androutsopoulos, Ion
  1. NCSR Demokritos, Institute of Informatics
  2. , lk98Uh

[email protected]

up arrow

References

 

. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, "An Experimental Comparison of Naive Bayesian and
Keyword-Based Anti-Spam Filtering with Encrypted Personal Messages". To appear in the Proceedings of the 23rd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), Athens, Greece, 2000.

up arrow

Annotations

 

 

Comments

 

This directory contains the PU1 corpus, as described in the paper "An
Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam
Filtering with Personal E-mail Messages" by I. Androutsopoulos,
J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, Proceedings of
the 23rd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR-2000), Athens, Greece.

There are 4 subdirectories, corresponding to the four "encrypted"
versions of the corpus mentioned in the paper:

bare: Lemmatiser disabled, stop-list disabled.
lemm: Lemmatiser enabled, stop-list disabled.
lemm_stop: Lemmatiser enabled, stop-list enabled.
stop: Lemmatiser disabled, stop-list enabled.

Each one of these 4 directories contains 10 subdirectories (part1, ...,
part10). These correspond to the 10 partitions of the corpus that
were used in the 10-fold experiments. In each repetition, one part
was reserved for testing and the other 9 were used for training.

Each one of the 10 subdirectories contains both spam and legitimate
messages, one message in each file. Files whose names have the form
*spmsg*.txt are spam messages. Files whose names have the form
*legit*.txt are legitimate messages.

You are free to use this corpus for non-commercial purposes, provided
that you acknowledge the use and origin of the corpus in any published
work of yours that makes use of the corpus, and that you notify the
person below about this work. To use this corpus for commercial
applications, you must obtain a written permission from the person below.

Ion Androutsopoulos
E-mail: [email protected]
http://www.iit.demokritos.gr/~ionandr
July 17, 2000

 

 

 

Index
Resources
* Bibliography
* Courses
* Datasets
* Links
* Showcases
* Software

 

 

Supported by EU project Esprit No. 29288, University of Magdeburg Logo Uni Magdeburg and GMD Logo GMD
© ECSC - EC - EAEC, Brussels-Luxembourg, 2000