Is newer better?—evaluating the effects of data curation on integrated analyses in Saccharomyces cerevisiae†‡
Abstract
Recent high-throughput experiments have produced a wealth of heterogeneous datasets, each of which provides information about different aspects of the cell. Consequently, integration of diverse data types is essential in order to address many biological questions. The quality of any integrated analysis system is dependent upon the quality of its component data, and upon the Gold Standard data used to evaluate it. It is commonly assumed that the quality of data improves as databases grow and change, particularly for manually curated databases. However, the validity of this assumption can be questioned, given the constant changes in the data coupled with the high level of noise associated with high-throughput experimental techniques. One of the most powerful approaches to data integration is the use of Probabilistic Functional Integrated Networks (PFINs). Here, we systematically analyse the changes in four highly-curated and widely-used online databases and evaluate the extent to which these changes affect the protein function prediction performance of PFINs in the yeast Saccharomyces cerevisiae. We find that the global trend in network performance improves over time. Where individual areas of biology are concerned, however, the most recent files do not always produce the best results. Individual datasets have unique biases towards different biological processes and by selecting and integrating relevant datasets performance can be improved. When using any type of integrated system to answer a specific biological question careful selection of raw data and Gold Standard is vital, since the most recent data may not be the most appropriate.
- This article is part of the themed collection: Integrative Computational Biology
 
                



 Please wait while we load your content...
                                            Please wait while we load your content...