Research‎ > ‎

Semantic Data Integration

Background

Biology is now an information-intensive science and research in genomics, transcriptomics and proteomics heavily depend on the availability and the efficient use of information. When data were structured and organized as a collection of records in dedicated, self-sufficient databases, information was retrieved by performing queries on the database using a specialized query language; for example SQL (Structured Query Language) for relational databases or OQL (Object Query Language) for object databases. In modern biology, exploiting the different kinds of available information about a given topic is challenging because data are spread over the World Wide Web (Web), hosted in a large number of independent, heterogeneous and highly focused resources.

The Web is a system of interlinked documents distributed over the Internet. It allows access to a large number of valuable resources, mainly designed for human use and comprehension. Actually, hypertext links can be used to link anything to anything. By clicking a hyperlink on a Web page, one frequently obtains another document which is related to the clicked element (this can be a text, an image, a sound, a clip, etc). The relationship between the source and the target of a link can have a multitude of meanings: an explanation, a traduction, a localization, a sell or buy order, etc. Human readers are capable of deducing the role of the links and are able to use the Web to carry out complex tasks. However, a computer cannot accomplish the same tasks without human supervision because Web pages are designed to be read by people, not by machines.

Hands-off data handling requires moving from a Web of documents, only understandable by humans, to a Web of data in which information is expressed not only in natural language, but also in a format that can be read and used by software agents, thus permitting them to find, share and integrate information more easily. In parallel with the Web of data, which is focused primarily on data interoperability, considerable international efforts are ongoing to develop programmatic interoperability on the Web with the aim of enabling a Web of programs. Here, semantic descriptions are applied to processes, for example represented as Web Services. The extension of both the static and the dynamic part of the current Web is called the Semantic Web.

The principal technologies of the Semantic Web fit into a set of layered specifications. The current components are the Resource Description Framework (RDF) Core Model, the RDF Schema language (RDF schema), the Web Ontology Language (OWL) and the SPARQL query language for RDF.

The Semantic Web Health Care and Life Sciences Interest Group (HCLSIG) was launched to explore the application of these technologies in a variety of areas. Currently, several projects have been undertaken. Some works concern the encoding of information using SWL. Examples of data encoded with SWL are MGED Ontology, which provides terms for annotating microarray experiments, BioPAX which is an exchange format for biological pathway data, Gene Ontology (GO), which describes biological processes, molecular functions and cellular components of gene products and UniProt, which is the world's most comprehensive catalog of information on proteins. Several researches focused on information integration and retrieval while others concerned the elaboration of a workflow environment based on Web Services.

Regarding the problem of data integration, the application of these technologies faces difficulties which are amplified because of some specificities of biological knowledge:

  • biological data are huge in volume,
  • biological data sources are heterogeneous,
  • bio-ontologies do not follow standards for ontology design,
  • biological knowledge is context dependant,
  • data provenance is of crucial importance (a detailed discussion on these points can be found in [1, 2].

Because of these specificities, data integration in the life science constitutes a real challenge.

Research rationale

Our research is based on the assumption that, in the life sciences community, the use of Semantic Web technologies should be of central importance in a near future. We are exploring the use of this technology for data integration and knowledge representation and developed several tools (AllOnto, Thea-online, Thea-interact).

References

1. Pasquier N, Pasquier C, Brisson L, Collard M: Mining Gene Expression Data using Domain Knowledge. International Journal of Software and Informatics (IJSI) 2008, 2:215–231.

2. Pasquier C: Applying Semantic Web technologies to biological data integration and visualization. In Data Management in Semantic Web. Edited by Jin H, Zehua L. Nova Science Publishers, Inc.; 2011:131–151.