Common Fund Data Ecosystem Centers

The NIH Common Fund (CF) programs have produced transformative datasets, databases, methods, bioinformatics tools and workflows that are significantly advancing biomedical research in the United States and worldwide. Currently, CF programs are mostly isolated. However, integrating data from across CF programs has the potential for synergistic discoveries. In addition, since CF programs have a time limit of 10 years, sustainability of the widely used CF digital resources after the programs expire is critical. To address these challenges, the NIH established the Common Fund Data Ecosystem (CFDE) program which has been recently approved to continue to its second new phase. For the second phase of the CFDE five centers were established.
d2dae08b-6557-59eb-94c9-6dec92222aa7

The Data Resource Center (DRC)

The CFDE Workbench

The Data Resource Center (DRC) will produce the CFDE Workbench which will be composed of two main products: the CFDE information portal, and the CFDE data resource portal. These two web portals will be full-stack web-based applications with a backend database and will be integrated into one public site. It will contain information about the CFDE in a dedicated About page, information about each participating and non-participating CF program, information about each data coordination center (DCC), a link to a catalog of CF datasets, and a link to a catalog of CF tools and workflows, news, events, funding opportunities, standards and protocols, educational programs and opportunities, social media feeds, and publications. The CFDE data resource portal will contain metadata, data, workflows, and tools which are the products of the CF programs, and their data coordination centers (DDCs). We will adopt the C2M2 data model for storing information about metadata describing DCC datasets. We will also archive relatively small omics datasets that do not have a home in widely established repositories and do not require PHI protection. In addition, we will expand the cataloging to CF tools, APIs, and workflows. Importantly, we will develop a search engine that will index and present results from all these assembled digital assets. In addition, continuing the work established in the CFDE pilot phase, users of the data portal will be able to fetch identified datasets through links provided by the DCCs via the DRS protocol. This will include links to raw and processed data. The CFDE portals will provide access to CF programs processed data in various formats including: 1) knowledge graph assertions; 2) gene, drug, metabolite, and other set libraries; 3) data matrices ready for machine learning and other AI applications; 4) signatures; and 5) bipartite graphs. In addition, the extract, transform, and load (ETL) scripts to process the data into these formats will be provided. Since such processed data is relatively small, we will archive and serve this processed data, mint it with unique IDs, and serve it via APIs. In addition, we will develop workflows that will demonstrate how the processed data can be harmonized. At the same time, we will document APIs from all CF DCCs and provide example Jupyter Notebooks that demonstrate how these datasets can be accessed, processed, and combined for integrative omics analysis. For the portals we will also develop a library of tools that utilize these processed datasets. These tools will have some uniform requirements enabling a plug-and-play architecture. To achieve these goals, we will work collaboratively with the other CFDE newly established centers, the participating CFDE DCCs, the CFDE NIH team, and relevant external entities and potential consumers of these three software products. These interactions will be achieved via face-to-face meetings, virtual working groups meeting, one-on-one meetings, Slack, GitHub, project management software, and e-mail exchange. Via these interactions, we will establish standards, workstreams, feedback and mini projects towards accomplishing the goal of developing a lively and productive Common Fund Data Ecosystem.

129d4c96-512f-58b8-8dc0-dfde5e8bbaf5

The Integration and Coordination Center (ICC)

CONNECT: Collaborative Network for Nurturing Ecosystems of Common Fund Team Science

The Collaborative Network for Nurturing Ecosystems of Common Fund Team Science (CONNECT) Integration and Coordination Center (ICC) is dedicated to revolutionizing biomedical research within the Common Fund Data Ecosystem (CFDE) through exceptional efficiency, transparency, and innovation. Led by Prof. Jake Chen at the University of Alabama at Birmingham (UAB), with support from Prof. Casey Greene, Prof. Sean Davis, Prof. Peipei Ping, and Prof. Wei Wang, our synergistic Administrative, Evaluation, and Sustainability Cores underpin the ICC's ability to fulfill its mission. The significance of the CONNECT ICC lies in its alignment of missions, goals, and efforts to drive transformative discoveries and applications within the CFDE. Through comprehensive operation guidelines and protocols, our Administrative Core, led by Prof. Chen, ensures effective coordination, tracking, and project management across CFDE-participating entities. This core spearheads the implementation of an Agile project management system, utilizing advanced collaboration tools like U-BRITE, to optimize communication, coordination, and collaboration among CFDE stakeholders. By fostering efficient operations and transparent communication, the Administrative Core promotes software- assisted agile project management methodology innovation and accelerates scientific progress. The Evaluation Core, led by Prof. Greene and Prof. Davis at the University of Colorado at Anschutz, is vital in driving continuous quality improvement within the CFDE. By establishing evaluation metrics aligned with CFDE principles and engaging with stakeholders, this core ensures the effectiveness and impact of CFDE activities. The Evaluation Core's innovative approaches, including developing a report generator and continuous engagement for feedback and improvement, drive the advancement of the CFDE and facilitate evidence-based decision-making. The Sustainability Core, led by Prof. Ping and supported by Prof. Wang at UCLA, addresses the long-term viability and reusability of CF program data and resources. Through the MATCH approach and the CFDE Digital Asset Repository Roadmap, this core promotes seamless data transitioning, maximizes data dissemination, and enhances community reuse. By developing best practices for data management, coordinating data transfer strategies, and identifying suitable repositories, the Sustainability Core ensures the preservation and accessibility of invaluable CF program data for future research and discoveries. Overall, our CONNECT ICC's innovation lies in its ability to harmoniously integrate the efforts of the Administrative, Evaluation, and Sustainability Cores. By fostering efficient operations, promoting continuous quality improvement, and ensuring data sustainability, the ICC drives transformative biomedical discoveries, facilitates collaborative research, and accelerates the impact of the CFDE. Through the expertise and dedication of our team, the CONNECT ICC is poised to help CFDE revolutionize the biomedical research landscape and advance the mission of the Common Fund.

b8fe1a41-2586-5f44-bcea-d536c3719bb7

The Cloud Workspace Implementation Center (CWIC)

The CFDE Cloud Workspace

The Cloud Workspace of the Common Fund Data Ecosystem (CFDE) will make it easy for a wide range of researchers to analyze data and work together. Our primary goal is to create a user-friendly environment where researchers can import their data, analyze their data alongside other Common Fund datasets, and use a variety of analysis tools and workflows. Our Cloud Workspace Implementation Center (CWIC) leverages the existing partnerships between the Texas Advanced Computing Center’s (TACC) unparalleled public high-performance resources, Galaxy’s open-source online interface for analyzing data and authoring workflows, and CloudBank’s tools that simplify cloud resource access and billing. By leveraging our existing resources and skills, this CWIC can streamline implementation of the Cloud Workspace. This Cloud Workspace will provide users with access to Common Fund datasets, allow users to import data from other sources, and allow for data integration and analysis. Users will have access to a wide range of tools, workflows, and pipelines developed by the CFDE, by Galaxy, and by other partners. In addition, users will have the flexibility to use custom or third-party tools within the workspace. The CWIC will cater to the needs of both novice and expert users by offering outreach, training, and support designed to meet the diverse needs of its users. This effort will include user manuals, online tutorials, and tools to help users manage computing costs across public and commercial computing resources. Current scientific discoveries can be expensive, relying on large datasets and intensive computational processing power. Notably, this Cloud Workspace will enable science across a broad set of researchers, providing access to large amounts of storage and compute at no cost to new users and trainees. By promoting data sharing, collaboration, and ease of access, the CWIC will speed up biomedical research and address high-priority challenges for our nation. The CFDE Cloud Workspace represents a significant step towards realizing the vision of broadening the community of scientists with the power to tackle complex research questions and drive innovation.

7e71e873-6bf8-5b44-a028-9e430440bfa0

The Training Center (TC)

Fostering Meaningful Use of Common Fund Data

To help ensure that Common Fund Data Ecosystem (CFDE) data assets, data tools, and resources are findable, accessible, interoperable, and reusable (FAIR), the project will create a CFDE Training Center (TC) responsive to the needs of biomedical researchers and current and future CFDE data programs. The TC will serve as the central hub for training development, coordination, and evaluation across CFDE programs and data initiatives. Goals of the TC are to expand the Common Fund (CF) data userbase, enhance the confidence and complexity of dataset usage, and increase awareness of the CFDE and resources. Team ORAU proposes to accomplish with (1) training intentionally designed to address gaps in the training landscape, (2) outreach activities and a mentoring program designed to support the growing community of CF users, and (3) the establishment of a collaborative and supportive learning community. The TC will provide a dynamic syllabus tailored to meeting the needs of a diverse community of scientists and medical researchers at all experience levels, with focus on postgraduate students and early-stage investigators. The TC will design and deliver customized trainings and training tools to address both broad data skills needs and the customized needs of specific CFDE data initiatives and Data Coordinating Centers. Matching students and early career scientists with experienced bioinformaticians and CF data users, the CFDE Mentoring Program will provide technical and professional guidance to engage learners and encourage careers in bioinformatics and data sciences. The TC recruitment and outreach functions will focus on engaging a diverse and equitable set of trainers and participants. The learning community will be supported by internal processes and functions put in place to facilitate communication and collaboration, including a CFDE Trainers Working Group to share best practices and leverage shared resources, a Diversity Committee to ensure diversity, equity, inclusion and accessibility (DEIA) is embedded in all activities and processes, a CFDE Landing Page on the CFDE Portal providing a public gateway to CFDE training resources, and a CFDE Virtual Community of Learners providing learning support and guidance to all training participants as well as a platform for communication and engagement. Development of the TC will require in-depth landscape analysis to assess, identify, and address the training opportunities and needs of the CFDE community. The landscape analysis will include comprehensive gap analysis, examining current tools and processes and training content, systematic literature review, key informant interviews, CF research community survey, and other data gathering from National Institutes of Health (NIH) stakeholders and biomedical community at large. Ongoing evaluation and assessment of training activities will inform program enhancements and outcomes, achievements, and challenges will be reported out to the common fund and other CFDE coordinating centers. The TC will be developed and managed by a qualified and experienced team of training professionals and data scientists from Oak Ridge Associate Universities (ORAU) and BioData Sage LLC, working in close coordination with the CF and the CFDE stakeholder community. Project management processes and controls will ensure that the TC and affiliated training programs and engagement activities are appropriate, effective, and efficient. Program development and management will take an integrated flexible approach, coordinating closely with NIH and integrating feedback and activities from across the CFDE to ensure alignment in all areas.

cfde-centers
8370ed93-862f-5a49-bd14-bdf9e581f8a4

The Knowledge Center (KC)

Providing Scientifically Valid Knowledge from the Common Fund Data Ecosystem

Making NIH Common Fund (CF) datasets FAIR is but the first step in realizing their potential within the “big data” revolution. Science progresses through the accumulation of knowledge, which achieves a wide reach only if it is accessible to a diverse spectrum of researchers. While computer scientists have made substantial strides in modeling knowledge within “knowledge graphs” (KGs), non-computational scientists can find it hard to interpret the graph-based reasoning tools and visualizations that accompany KGs because such tools use logical reasoning that does not account for scientific context or uncertainty and can produce a plethora of scientifically invalid inferences. Our CFDE KC will aim to present scientifically valid knowledge produced by CF projects. We will represent this knowledge as a KG, compliant with existing CFDE and external knowledge curation efforts. But we will focus on scientific validity through both (a) careful knowledge extraction, by ensuring that each edge in the KG is either a primary experimental finding or the result of an expert-applied analysis, and (b) careful knowledge presentation, by building a portal that de-emphasizes general-purpose graph traversal in favor of single-purpose visualizations. To implement this KC, we will draw from our experience managing four large-scale NIH-funded projects that have faced similar challenges in related settings.