| Predicting the secondary structure (three-dimensional shape) of proteins from their amino acid sequence (primary structure) is widely believed to be one of the hardest unsolved problems in molecular biology. The amino acids can be arranged in different patterns (spirals, turns, flat sections etc.) which are of considerable interest to pharmaceutical companies since a protein's shape generally determines its function as an enzyme. The dataset studied in [Muggleton 92] has been created with the goal to learn rules to identify whether a position in a protein is in an alpha-helix. The positive examples state which positions of chosen proteins are in an alpha-helix and the negative examples identify the rest of positions.
The constants of the considered language denote all the 20 existing amino acids and the values of some physical or chemical properties as sizes, hydrophobicities, polarities (e.g. polar0 and polar1). Background knowledge is expressed using the following predicates:
- position(A,B,C) meaning "residue of protein A at position B is C".
- octf(A,B,C,D,E,F,G,H,I) provides information that allows to index groups of nine adjacent positions in a protein (positions A--I occur in sequence).
- alpha_triplet(A,B,C), alpha_pair(A,B), index groups of three or two adjacent positions in a protein, respectively.
- alpha_pair4(A,B) holds if a pair of positions A,B is separated by 4 positions in a protein.
Additional unary predicates characterize some physical and chemical properties of the individual residues (hydrophobicity, hydrophilicity, charge, size, polarity, whether a residue is aliphatic or aromatic, whether it is a hydrogen donor or acceptor etc.). Ordering relations between some constants (less_than(polar0,polar1) are also provided.
|