Dataset Details - Machine Learning network Online Information Service

MLnetOiS Logo left

MLnetOiS Logo right

Document understanding

Add a dataset to the database.
Update the entry for this dataset.


	Name (abbrev)	Name (full)	Category	Last update
	Document understanding		Language	b D, Y
	Application domain		Further specifications
	Document understanding
	Type	Format	Complexity
	ILP	FOCL format	250 training and 120 test instances approximately
	WWW / FTP
	ftp://ftp.mlnet.org/ml-archive/general/data/doc-understanding/
	Contact person(s)	Related group(s)	Optional contact address

	References
	Malerba D. Document Understanding: A Machine Learning Approach Technical Report, Esprit Project 5203 INTREPID, 4 March 1993. Esposito F., Malerba D., Semeraro G., & Pazzani M. A Machine Learning Approach to Document Understanding Proc. 2nd Int. Workshop on Multistrategy Learning, Harpers Ferry, WV, pp. 276-292, May 1993. Esposito F., Malerba D., & Semeraro G. Learning Contextual Rules in First-Order Logic Proc. 4th Italian Workshop on Machine Learning (GAA93), Milan, Italy, pp. 111-127, June 1993. Esposito F., Malerba D., & Semeraro G. Automated Acquisition of Rules for Document Understanding Proc. of the 2nd Int. Conf. on Document Analysis and Recognition, Tsukuba Science City, Japan, pp. 650-654, October 1993. Semeraro G., Esposito F., & Malerba D. Learning Contextual Rules for Document Understanding Proc. 10th IEEE Conf. on Artificial Intelligence for Applications San Antonio, Texas, pp. 108-115, March 1994. Esposito F., Malerba D., & Semeraro G. Multistrategy Learning for Document Recognition Applied Artificial Intelligence, 8, pp. 33-84, 1994
	Annotations
	The problem concerns classification of some parts of a business letter using information about the layout of a one page document. There are five concepts to be learned. These concepts are expressed as predicates, namely sender, receiver, logotype, reference number and date. The used language allows to characterize some properties of the text-blocks (their width and height, position of the block on a page etc.) as well as mutual position of two blocks (e.g. aligned-only-upper-row(X,Y)). The dataset describes properties present in 30 single page documents, providing approximately 250 training instances and 120 test instances. The considered problem is complicated by the presence of dependencies among concepts. The problem can be cast as a multiple predicate learning problem. Experimental results prove that learning contextual rules, that is rules in which concept dependencies are explicitely considered, leads to good results. Initially, results were published in a technical report [Malerba93], summary of the results appears in [Esposito93a, Esposito93b, Esposito93c and Semerano]. Problem of the whole document processing system is treated in detail in [Esposito94].
	Comments
	This dataset was used for the first time by D. Malerba during his stage at ICS, University of California, Irvine (Sept.-Dec,1992).

Supported by EU project Esprit No. 29288, University of Magdeburg and GMD
© ECSC - EC - EAEC, Brussels-Luxembourg, 2000