One of the primary challenges in translational research data management is breaking down the barriers between the multiple data silos and the integration of 'omics data with clinical information to complete the cycle from the bench to the bedside.
The role of contextual metadata, also called provenance information, is a key factor ineffective data integration, reproducibility of results, correct attribution of original source, and answering research queries involving “What”, “Where”, “When”, “Which”, “Who”, “How”, and “Why” (also known as the W7 model).
We introduce an ontology-driven, intuitive Semantic Proteomics Dashboard (SemPoD) that uses provenance together with domain information (semantic provenance) to enable researchers to query, compare, and correlate data across multiple projects, multiple types of data, and integrate with legacy data to support their current research.
The SemPoD platform, currently in use at the Case Center for Proteomics and Bioinformatics (CPB), consists of the components: Ontology-driven Visual Query Composer, Result Explorer, and Query Manager shown in Figure 1 below. Currently, SemPoD allows provenance–aware querying of 1153 mass-spectrometry experiments from 20 different projects.
SemPoD is an intuitive and powerful provenance ontology-driven data access and query platform to create an integrated view over large-scale systems molecular biology datasets. SemPoD can be deployed over many existing database applications storing ‘omics data, for instance, the LabKey data-management system as shown in Figure 2.
The initial user feedback evaluating the usability and functionality of SemPoD has been very positive and it is being considered for wider deployment beyond the proteomics domain, and in other ‘omics’ centers.
We use two principal proteomics workflows as exemplars to describe the design and implementation of SemPoD, namely:
- The first workflow is affinity-purification mass-spectrometry (AP-MS) workflow that enables the identification of specific protein complexes, thus identifying proteins that are associated with one another.
- The second workflow is the shotgun expression proteomics that identifies and quantifies proteins in an unbiased manner from cells or tissues of interest. Together, these two workflows account for approximately 50% of all experiments performed in the CPB and have been used in approximately 20 separate projects, generating over 3Terabytes (TB) of data.