Data Grid Technology

As represented by the term "Bioinformatics," advances in information science and technology are expected to be applied to the field of life science. There are a vast number of biological databases containing biological information, in particular, more than 500 databases for genomic information. These databases have been operated and managed individually on the Internet. Under such a circumstance, we have founded an infrastructure that allows to share information contained in these databases and to conduct research collaboration.

However, the following problems exist in interoperating these databases.

Difference in format depending on database
Necessity of the latest data against the large size of databases and high frequency of their update.

Physical integration of the databases is not practical to solve these problems.

This group is developing several fundamental technologies for making full use of biological information by the seamless federation of multi-scale databases with the Grid technology.

Seamless Federation of Muli-Scale Databases with the Grid Technology

1. Solving the problem of heterogeneous database formats XML standard format

n the biological databases available on the Web, their database format has been shifted from the flat file to the XML format. However, these formats have heterogeneity in their description even for the same protein. For example, figure shows the RbsB protein, a ribose binding protein, is differently represented in the DDBJ, SWISS-PROT and PDB databases.

Heterogenity of Bio-Databases(Flat Files)

To cope with this issue, a database conversion system has developed for transforming heterogeneous database formats with different schemes into an XML-based standard format using conversion rules, and for displaying them.
(Associate Prof. Takenao OHKAWA, Graduate School of Information Science and Technology, Osaka University; in cooperation with Aztec System Co., Ltd.)

In addition, to achieve seamless search with databases distributed in the network (making use of the independence of the different database management policies), the prototype of the Grid Database Infrastructure System is being developed so that the databases will be integrated virtually using Web mechanisms such as SOAP to enable database services to provide within the XML scheme. (Hitachi Software Engineering Co., Ltd.)

<Unified access tool for biology databases (prototype)>
This tool can retrieve data with keywords (e.g., "HGF" as a gene name) through a standardized XML format data as the data having the same structure. The retrieval is not only performed across different database entries having the keywords, but also, conducted repeatedly against the databases using the protein sequences in the already-retrieved entries.

2. Development of automatic update system for public databases

Comprehensive accessibility to the latest data of genomic databases is directly related to the success of projects. At present, however, 15 million base sequences are accumulated in the databases everyday on average. To cope with this issue, a system is being developed for performing an efficient all-inclusive sequence search by constructing the local copies of public databases and updating them.
(Prof. Teruo YASUNAGA, Genetic Information Research Center, Osaka University; in cooperation with Hewlett-Packard Japan, Ltd.)

3. Development of a database system with classified sequences for homology search

Problems have arisen with Biodatabase such as enlarged redundancy due not only to a rapid increase in the data amount, but also the submission of a large amount of data on specific types of sequences (for example, many alternative sequences of the HIV protease). This problem is also amplified by the rapid increase of low quality data due to the massive submission of a large amount of fragment data such as draft sequences. To solve these problems, this group has developed a new system using the following two methods.

Comparison is made among all the sequence data in a database and a hierarchy of the results of database search is dynamically built up to minimize any redundancy in search.
High-quality database information (secondary information) produced by experts is added to the output results to allow more accurate information-browsing.
(Prof. Hiroyuki TOH, Bioinformatics Center, The Institute for Chemical Research, Kyoto University; in cooperation with Mitsubishi Space Software Co., Ltd.)

4. High-level information retrieval system: XML-based ADME information database system

The system is being built to allow high-level information retrieval on the relationships with binding affinity between proteins and compounds, associated with information on drug metabolism by representing it as XML-based ADME information (Absorption, Distribution, Metabolism and Excretion of drugs), which are important in the area of new drug design, and linking among the XML-based information on genes, proteins and compounds. (Fujitsu Kyushu System Engineering Limited)

5. Coordination with existing Web services: Design of an XML layer

The system is being developed for interoperation with existing database services on the Web with minimal customization by intermediating between the services and the Data Grid with an XML layer. (Protein Research Foundation)
This system may have the following two advantages to the database service providers:

The need for copying all the data contained in a database is eliminated. This means that no update of copies is required.
Simply by conducting access restriction to the database with several levels, various types of services can be provided according to the level of customers. For this reason, the system may be expanded to actual business.

The future perspective of this project is described below.

Completion of the DataGrid Infrastructure System
Interoperation of further heterogeneous databases
Collaboration with Computing Grid group Results from the simulations conducted by Computing Grid group will be stored in databases.