THEA (Tools for High-throughput Experiments Analysis) is an integrated information processing system allowing convenient handling of data. It allows to automatically annotate data issued from classification systems with selected biological information coming from a knowledge base and to either manually search and browse through these annotations or automatically generate meaningful generalizations according to statistical criteria (data mining).

The problem

During the last decade, the various genomes sequencing projects fed large biological data banks with an extraordinary amount of data. However, these raw data remain of little utility if they are not transformed into knowledge. Currently, the laborious process of annotation is carried out jointly by human experts and data-processing programs. Automatic annotations are generated either by algorithms based on a biological modeling (search for genes or non-coding regions, 3D structure computation), or, generally by alignment programs. The principle of the second method is to infer sequences homologies on the basis of their similarity and to deduce, from this result, similarities of structure or function [Reese]. Generated knowledge is however not easily usable because it is represented in various forms (free text, key words, controlled vocabulary, relational database), often has a variable reliability according to the method used (manual annotations vs. predicted ones) and is dispersed among thousands of scientific papers, database annotations and the brain of biologists.

A similar scenario takes shape for new technologies (proteomic, transcriptomic) which start to produce torrents of data. The goal, from now, is less to thoroughly study the biological objects taken separately than to track the activity of whole genomes, temporally and spatially. Knowledge is not anymore built on the basis of sequence alignments but according to the measured activity of biological objects in particular experimentation contexts. The idea underlying all high throughput techniques is the assumption that a set of gene products which react together in a coordinated manner is probably implied in a functional module. Work thus consists in two distinct phases: identifying these modules and then understanding their role.

The first phase is easily automatizable and abundantly studied. There exist numeral techniques dedicated to the acquisition, normalisation, filtering and clustering of high throughput results produced by post-genomic experiments.

However, at the end, generated data are, of course, more reliable and organised, but still very numerous. Automatic systems able to extract from raw data useful knowledge have to face recurrent problems in genome annotation, including inconsistent function descriptions, false (positive or negative) assignments, unsupported predictions, haphazard use of various terms. There is more than ever a need for automatic techniques which relies on structured and controlled vocabularies (ontologies) to analyse large quantities of data in order to discover meaningful patterns and rules.


Thea is no longer available

Related publication

1. Pasquier C, Girardot F, Jevardat De Fombelle K, Christen R: THEA: ontology-driven analysis of microarray data. Bioinformatics (Oxford, England) 2004, 20:2636–43.