MLnetOiS Logo left

MLnetOiS Logo right

 Resources:DatasetsDetails

 

Index
Resources
* Bibliography
* Courses
* Datasets
* Links
* Showcases
* Software


Document understanding

Add Add a dataset to the database.
Update Update the entry for this dataset.

 

 

 

Back

 

 

 

up arrow

Name (abbrev)

Name (full)

Category

Last update

 

Document understanding

Language

b D, Y

up arrow

Application domain

Further specifications

 

Document understanding

up arrow

Type

Format

Complexity

 

ILP

FOCL format

250 training and 120 test instances approximately

up arrow

WWW / FTP

 

 

ftp://ftp.mlnet.org/ml-archive/general/data/doc-understanding/

up arrow

Contact person(s)

Related group(s)

Optional contact address

 

up arrow

References

 

Malerba D.
Document Understanding: A Machine Learning Approach
Technical Report, Esprit Project 5203 INTREPID, 4 March 1993.

Esposito F., Malerba D., Semeraro G., & Pazzani M.
A Machine Learning Approach to Document Understanding
Proc. 2nd Int. Workshop on Multistrategy Learning, Harpers Ferry,
WV, pp. 276-292, May 1993.

Esposito F., Malerba D., & Semeraro G.
Learning Contextual Rules in First-Order Logic
Proc. 4th Italian Workshop on Machine Learning (GAA93), Milan, Italy,
pp. 111-127, June 1993.

Esposito F., Malerba D., & Semeraro G.
Automated Acquisition of Rules for Document Understanding
Proc. of the 2nd Int. Conf. on Document Analysis and Recognition,
Tsukuba Science City, Japan, pp. 650-654, October 1993.

Semeraro G., Esposito F., & Malerba D.
Learning Contextual Rules for Document Understanding
Proc. 10th IEEE Conf. on Artificial Intelligence for Applications
San Antonio, Texas, pp. 108-115, March 1994.

Esposito F., Malerba D., & Semeraro G.
Multistrategy Learning for Document Recognition
Applied Artificial Intelligence, 8, pp. 33-84, 1994

up arrow

Annotations

 

The problem concerns classification of some parts of a business letter using information about the layout of a one page document. There are five concepts to be learned. These concepts are expressed as predicates, namely sender, receiver, logotype, reference number and date. The used language allows to characterize some properties of the text-blocks (their width and height, position of the block on a page etc.) as well as mutual position of two blocks (e.g. aligned-only-upper-row(X,Y)). The dataset describes properties present in 30 single page documents, providing approximately 250 training instances and 120 test instances.
The considered problem is complicated by the presence of dependencies among concepts. The problem can be cast as a multiple predicate learning problem. Experimental results prove that learning contextual rules, that is rules in which concept dependencies are explicitely considered, leads to good results.
Initially, results were published in a technical report [Malerba93], summary of the results appears in [Esposito93a, Esposito93b, Esposito93c and Semerano]. Problem of the whole document processing system is treated in detail in [Esposito94].

 

Comments

 

This dataset was used for the first time by D. Malerba during his stage at ICS, University of California, Irvine (Sept.-Dec,1992).

 

 

 

Index
Resources
* Bibliography
* Courses
* Datasets
* Links
* Showcases
* Software

 

 

Supported by EU project Esprit No. 29288, University of Magdeburg Logo Uni Magdeburg and GMD Logo GMD
© ECSC - EC - EAEC, Brussels-Luxembourg, 2000