MLnetOiS Logo left

MLnetOiS Logo right



* Bibliography
* Courses
* Datasets
* Links
* Showcases
* Software


The PU1 anti-spam filtering corpus

Add Add a dataset to the database.
Update Update the entry for this dataset.








up arrow

Name (abbrev)

Name (full)


Last update



The PU1 anti-spam filtering corpus

text classification

b D, Y

up arrow

Application domain

Further specifications


Anti-spam email filtering

618 personal email messages 481 spam email messages 4 versions of the data

up arrow





"encrypted" ASCII text

4314 KBytes

up arrow


up arrow

Contact person(s)

Related group(s)

Optional contact address


  1. Androutsopoulos, Ion
  1. NCSR Demokritos, Institute of Informatics

[email protected]

up arrow



. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, "An Experimental Comparison of Naive Bayesian and
Keyword-Based Anti-Spam Filtering with Encrypted Personal Messages". To appear in the Proceedings of the 23rd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), Athens, Greece, 2000.

up arrow






This directory contains the PU1 corpus, as described in the paper "An
Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam
Filtering with Personal E-mail Messages" by I. Androutsopoulos,
J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, Proceedings of
the 23rd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR-2000), Athens, Greece.

There are 4 subdirectories, corresponding to the four "encrypted"
versions of the corpus mentioned in the paper:

bare: Lemmatiser disabled, stop-list disabled.
lemm: Lemmatiser enabled, stop-list disabled.
lemm_stop: Lemmatiser enabled, stop-list enabled.
stop: Lemmatiser disabled, stop-list enabled.

Each one of these 4 directories contains 10 subdirectories (part1, ...,
part10). These correspond to the 10 partitions of the corpus that
were used in the 10-fold experiments. In each repetition, one part
was reserved for testing and the other 9 were used for training.

Each one of the 10 subdirectories contains both spam and legitimate
messages, one message in each file. Files whose names have the form
*spmsg*.txt are spam messages. Files whose names have the form
*legit*.txt are legitimate messages.

You are free to use this corpus for non-commercial purposes, provided
that you acknowledge the use and origin of the corpus in any published
work of yours that makes use of the corpus, and that you notify the
person below about this work. To use this corpus for commercial
applications, you must obtain a written permission from the person below.

Ion Androutsopoulos
E-mail: [email protected]
July 17, 2000




* Bibliography
* Courses
* Datasets
* Links
* Showcases
* Software