Data Quality in Genome Databases
Mathematisch-Naturwissenschaftliche Fakultät II
Genome databases store data about molecular biological entities such as genes, proteins, diseases, etc. The main purpose of creating and maintaining such databases in commercial organizations is their importance in the process of drug discovery. Genome data is analyzed and interpreted to gain so-called leads, i.e., promising structures for new drugs. Following a lead through the process of drug development, testing, and finally sev-eral stages of clinical trials is extremely expensive. Thus, an underlying high quality data-base is of utmost importance. Due to the exploratory nature of genome databases, commer-cial and public, they are inaccurate, incomplete, outdated and in an overall poor state. This paper highlights the important challenges of determining and improving data quality for databases storing molecular biological data. We examine the production process for ge-nome data in detail and show that producing incorrect data is intrinsic to the process at the same time highlight common types of data errors. We compare these error classes with ex-isting solutions for data cleansing and come to the conclusion that traditional and proven data cleansing techniques of other application domains do not suffice for the particular needs and problem types of genomic databases.
Dateien zu dieser Publikation