Dataset Details - Machine Learning network Online Information Service

MLnetOiS Logo left

MLnetOiS Logo right

PU1

The PU1 anti-spam filtering corpus

Add a dataset to the database.
Update the entry for this dataset.


	Name (abbrev)	Name (full)	Category	Last update
	PU1	The PU1 anti-spam filtering corpus	text classification	b D, Y
	Application domain		Further specifications
	Anti-spam email filtering		618 personal email messages 481 spam email messages 4 versions of the data
	Type	Format	Complexity
		"encrypted" ASCII text	4314 KBytes
	WWW / FTP
	http://www.iit.demokritos.gr/~ionandr/pu1_encoded.tar.gz http://www.aueb.gr/users/ion/publications.html http://
	Contact person(s)	Related group(s)	Optional contact address
	Androutsopoulos, Ion	NCSR Demokritos, Institute of Informatics , lk98Uh	[email protected]
	References
	. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, "An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Encrypted Personal Messages". To appear in the Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), Athens, Greece, 2000.
	Annotations

	Comments
	This directory contains the PU1 corpus, as described in the paper "An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages" by I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2000), Athens, Greece. There are 4 subdirectories, corresponding to the four "encrypted" versions of the corpus mentioned in the paper: bare: Lemmatiser disabled, stop-list disabled. lemm: Lemmatiser enabled, stop-list disabled. lemm_stop: Lemmatiser enabled, stop-list enabled. stop: Lemmatiser disabled, stop-list enabled. Each one of these 4 directories contains 10 subdirectories (part1, ..., part10). These correspond to the 10 partitions of the corpus that were used in the 10-fold experiments. In each repetition, one part was reserved for testing and the other 9 were used for training. Each one of the 10 subdirectories contains both spam and legitimate messages, one message in each file. Files whose names have the form spmsg.txt are spam messages. Files whose names have the form legit.txt are legitimate messages. You are free to use this corpus for non-commercial purposes, provided that you acknowledge the use and origin of the corpus in any published work of yours that makes use of the corpus, and that you notify the person below about this work. To use this corpus for commercial applications, you must obtain a written permission from the person below. Ion Androutsopoulos E-mail: [email protected] http://www.iit.demokritos.gr/~ionandr July 17, 2000

Supported by EU project Esprit No. 29288, University of Magdeburg and GMD
© ECSC - EC - EAEC, Brussels-Luxembourg, 2000