The NIH Common Fund have supported the generation, management, and sharing of single cell genomic data from millions of cells through several large international consortia with the goal of building a comprehensive reference of healthy cells across multiple organs in the human body. We will use single cell/nucleus RNA- sequencing (scRNA-seq) data from the Common Fund-supported Human BioMolecular Atlas Program (HuBMAP) and Genotype-Tissue Expression (GTEx) consortia to prototype a cell type harmonization protocol for constructing a cross-consortia cell census meta-atlas. The HuBMAP consortium provides organ-specific cell atlases for multiple organs, while GTEx provides an integrated cross-organ single cell atlas. Our group has developed and extensively validated computational algorithms, NS-Forest and FR-Match, for biomarker identification and robust cell type matching using scRNA-seq data. Our algorithms utilize Random Forest machine learning and minimum spanning tree graphical modeling, which provide superior classification performance while maintaining high explainability and interpretability for biological applications. In Specific Aim 1, rigorous data quality control approaches will be applied for dataset selection and preparation. The NS-Forest algorithm will then be used to identify optimal biomarker combinations for characterization of organ-specific cell types of individual organs in HuBMAP and cross-organ cell types in GTEx. In Specific Aim 2, we will focus on human lung, as an exemplar organ, to prototype the assembly of a cross-consortia meta-atlas by developing a robust cell type harmonization approach using our validated and benchmarked FR-Match algorithm and HuBMAP-Lung, GTEx lung subset, and other publicly available Human Lung Cell Atlas (HLCA) datasets. We will compare and benchmark FR-Match with two other popular methods, Azimuth and CellTypist, for cell type matching and validate the matching results using all methods. We will also form a domain expert panel to review and validate the cell type harmonization results using domain knowledge and literature information for community approval. We will build a strategy for capturing sample metadata, anatomic structure information, cell type nomenclature and biomarker-based definitions into an ontological representation for the meta-atlas and populate the contents into the Provisional Cell Ontology. In Specific Aim 3, we will disseminate our results to key stakeholder communities, including the HuBMAP Anatomical Structures, Cell Types and Biomarkers (ASCT+B) Working Group and the GTEx Multi-Gene Single Cell Query platform. We will present the project and participate in the Common Fund Data Ecosystem Spring Meeting for engaging the community and soliciting feedback. Beyond the pilot phase, the cell type harmonization framework established in this project can be generally applicable to integrate single cell-based cell type datasets across Common Fund and other data resources.
Exercise is associated numerous health benefits, but defining the molecular mediators of these effects remains an active focus of biomedical research. With the advent of the ‘omics sciences, studies including the ongoing Molecular Transducers of Physical Activity Consortium (MoTrPAC) and multiple smaller-scale efforts have sought to map the “complete” molecular response to acute and chronic exercise. Much research is currently focused on integration of genomics, proteomics and metabolomics data within such studies, but another important strategy is meta-analysis of distinct data sets to evaluate consistencies and differences in molecular responses observed between different modes of exercise, sex, age, species, and other factors. Of the exercise- related meta-studies performed to date, most have focused on the genome and transcriptome whereas few have included metabolomics data. Reasons for this shortcoming include differences in analytical methods, inconsistency in compound naming and data reporting, and prevalence of unknown features in untargeted metabolomics data. Unknown metabolite identification and cross-study integration is challenging and requires application of computational and experimental strategies in a coordinated manner. Yet, the potential benefits are substantial – identification of novel metabolites and detection of consistent patterns of response have led to biological insights relevant to fundamental biology and human health, including exercise. Using data from NIH Common Fund data archives, we propose to develop a multi-study, multi-organism and multi-condition database of identified and unknown exercise responsive features of the metabolome. We will integrate data across studies and, when available, across ‘omes, to prioritize and identify unknowns within this database. We will achieve these goals by carrying out two specific aims: 1) We will perform a comprehensive survey and alignment of exercise-related small molecule features in MotrPAC data and from studies in the Metabolomics Workbench. We will use computational tools we have pioneered for metabolomics data cleaning, inter-laboratory data alignment, and network- and correlation-based analysis to prioritize unknown features for follow-up. 2) We will systematically track, annotate and identify high-priority exercise-responsive unknown features in metabolomics data using software and experimental techniques we and others have devised for MS/MS data collection to identify and annotate features not tractable by routine library search. Our study represents a crucial step between the map-building aims of MoTrPAC and detailed mechanistic studies of specific pathways and that hold potential for human health benefits through targeted interventions. We will share our database and associated data with the research community through publications and uploads to public data archives. We anticipate our efforts will contribute to improved understanding of the effects of exercise at the biochemical pathway level and will offer targets for future studies to help delineate the mechanisms by which small molecules contribute to its salutary effects on health.
As we age, our tissues and organs experience molecular and physiological damage that prevents them from functioning properly and this ultimately leads to disease states. These changes are not only due to the aging process itself but are largely influenced by the exposome which includes all non-genetic exposures (environmental and behavioral). Depending on the complex interaction between the exposome of an individual and their genetics, different organs deteriorate over time at a different pace, resulting in tissues with different biological ages within the same individual. As the biological age of a given organ reflects its overall health and functional capacity, biologically older organs are more likely to cause health problems increasing the risk of diseases. Aging “clocks” powered by omics technologies (transcriptomics, proteomics, epigenomics, etc.) and machine learning methods have been used to approximate the biological age of specific tissues. However, tissue-specific clocks require omics data from a biopsy, making clinical adoption impractical. Therefore, there is a critical need to develop simple diagnostic tools using readily accessible biological material to measure organ- specific aging rates in an individual which can be translated into personalized actionabilities and enable accurate evaluation of the efficacy of health-promoting interventions. Using blood, the pipeline of the immune system, from aging cohorts we and others have demonstrated that accelerated aging, as evidenced by age-related chronic inflammation (inflammaging) and dysfunctional immune systems, results in organ dysfunction and an elevated risk of disease in older subjects. This is not surprising since inflammaging has been proposed to be a common denominator of most, if not all, diseases of aging. In this proposal, we hypothesize that the biological information to investigate the aging rates of a given organ is contained in the blood of the same individual and thus, can be estimated using a collection of tissue-specific gene expression signatures matched with those from blood samples. Here, we will assemble multiple public domain datasets within and outside of the NIH Common Fund to create blood-based organ-specific clocks and enable rapid diagnostics of aging rates for a given organ in an individual. To do so, we will use transcriptomic data across multiple tissues and matched blood from the Genotype-Tissue Expression (GTEx) database to construct a computational framework that calculates the rate of aging of 45 tissues in an individual using blood gene expression. We will validate the resulting models to predict organ-specific aging in disease states specific to the organ of interest, and we will assess the influence of lifestyle factors including diet, exercise and smoking on the aging of different organs using data from the Framingham Heart Study. Finally, we will use the Library of Integrated Network-based Cellular Signatures (LINCS) to identify candidate compounds that can restore the gene expression changes in the blood associated with tissue aging to optimal levels.
Cell-cell interactions (CCIs) are crucial to the maintenance of proper cell functions in tissues, particularly those, like barrier tissues, that orchestrate complex responses to invading pathogens and environmental signals. There are significant opportunities for leveraging existing datasets to generate biological insight by better understanding how CCIs and core transcriptional signatures of cells orchestrate tissue function or disease. Single-cell transcriptomic datasets allow for comprehensive prediction of CCIs in a given disease or tissue of interest. Many computational techniques have been developed to identify ligand-receptor pairs that mediate these CCIs using either bulk and single cell datasets, as well as spatial transcriptomic datasets. However, the analysis of transcriptomics data produces thousands of ligand-receptor interactions that difficult to prioritize for experimental validation. Thus, there is a need for a computational tool that will rank CCIs for experimental validation. Here we propose to create a database of putative CCIs across several epithelial barrier tissues, including skin, intestine, and reproductive tissue. We will then employ a ranking system that uses information from several Common Fund datasets to rank cell-cell interactions for experimental validation. Finally, we will validate our approach using existing spatial transcriptomic datasets. Overall, the results of this work will leverage the wealth of existing data to better contextualize CCIs, allowing the scientific community to prioritize novel CCIs for experimental validation.
Amyotrophic lateral sclerosis (ALS) is a progressive neurodegenerative disorder that affects nerve cells in the brain and spinal cord, resulting in muscle weakness, difficulty speaking and swallowing, and eventually, respiratory failure. There is no known cure and the average life expectancy of an ALS patient is 3-5 years from the time of diagnosis. Therefore, it is important to identify the genetic alterations that are involved in the molecular pathogenesis of ALS in order to develop targets for therapeutics. Genome-wide occurrence of DNA repeat expansions has been identified as a common genetic cause or marker of several neurodegenerative diseases, including ALS. Here we propose to survey the whole genome sequencing data from a population of ALS patients and healthy individuals for recurrent repeat expansions, correlate these findings with gene expressions, and finally study mechanisms of regulation of identified genes by integration of gene expression, chromatin accessibility, and 3D genome organization data. We hypothesize that many of the overlooked repeat expansions are located in the non-coding regions of the genome and are involved in regulation of genes that might contribute to the death of motor neurons. To this end, we will first identify the recurrent repeat expansions that act as tissue- specific expression quantitative trait loci using GTEx data. We will then integrate multi-omics data from ALS patients with 3D genome data from the 4DN project to identify gene regulatory circuitry. We envision that the proposed work will allow scientists to generate experimentally testable hypotheses by creating a link between non-coding repeat expansions and genes in a tissue-specific manner and help identify new therapeutic targets.
Each human is, on average, colonized by 1014 microbial cells that mostly reside in the gastrointestinal track. Research in the last two decades has uncovered the central role of this microbial community in human health and disease. A pressing challenge, however, is the lack of understanding of microbial drug metabolism. Experimental studies, clinical observations, and anecdotal examples demonstrate that microbial enzymes alter drugs through common enzymatic transformations such as reduction, hydrolysis, dehydroxylation, demethylation, and others. Despite progress, there lacks a systematic approach for the discovery and analysis for such transformations, thus hindering the design and interpretation of experimental studies. There is therefore a need to establish workflows to explore such transformations. We investigate in this proposal microbial drug metabolism at the molecular and community levels. We are proposing to use data from two Common Fund data sets to conduct this investigation. Illuminating the Druggable Genome (IDG) catalogues drugs and their pharmacologic action, while the NIH Human Microbiome Project (HMP) provides detailed gut microbial data for cohorts. We are also proposing to use our deep-learning tools to predict the likelihood of interaction between microbial enzymes and drugs (Aim 1), and to predict putative derivative products due to this interaction (Aim 2). Our tools (CSI for Aim 1, and GNN-SOM and PROXIMAL for Aim 2) have already been validated on other datasets and in other studies, and they will be adapted for microbial enzymes and drugs based on data culled from IDG and HMP and other resources. The workflows established in Aims 1 and 2 will be utilized to conduct a pilot study (Aim 3) to investigate the extent of functional redundancy towards drugs within microbial communities of healthy individuals that are culled from HMP. The strength of our Approach therefore lies in: i) adapting novel, state-of-the-art deep-learning models to predict microbial enzyme promiscuity on drugs, ii) providing biochemically explainable drug products, and iii) exploring how drug microbial metabolism is a function of microbial community composition. The Significance of this research is that it provides an explainable hypothesis of microbial drug metabolism. The work is impactful as it will enable further studies, such as exploring the functional redundancy of a microbial community towards drugs (as planned in Aim 3) and designing and interpreting experimental studies involving the impact of the gut microbiota on drugs. The proposed work is appropriate for this funding opportunity as it curates and annotates data using novel deep-learning approaches and creates a previously unexplored link between the HMP and IDG.
We will develop computational tools that facilitate investigation of the fundamental relationship between gene expression and genome topology. Specifically, we will develop machine learning tools that can link enhancer and its targeted gene at genome wide scale. The ability of establishing relationship between enhancers and their target genes is critically important, as it will aid in our understanding of gene regulation and in establishing the relationship between noncoding risk variants from GWAS studies to potential causal genes. Our approach will be based on 3D polymer models of chromatin interactions derived from Hi-C data in the common fund 4D Nucleome (4DN) database, and will integrate data from the common fund supported Genotype-Tissue Expression (GTEx) databaseas, as well as data from ENCODE database. We will 1) construct a database of trusted high- quality database of candidate enhancer-gene target pairs. We will then 2) use this database to train a machine learning predictor that can predict enhancer-gene target pairs at genome wide scale. For 1), we will develop a pipeline to identify a small set of critical specific chromatin 3D interactions through simulation of large scale folding of 3D chromatin ensembles. The small set of specific interactions will be tested for sufficiency of chromatin folding. We will then identify computationally enhancers based on epigenetic histone modifications and chromatin accessibility data from ENCODE as well as the Roadmap Epigenomics Project. We will then select enhancers containing eQTLs from the GTEx databases, which are known to affect the expression of the target gene. The end result will be a high- quality and trustworthy database of enhance-gene pairs, which will be provided by the predicted critical specific 3D physical chromatin interactions connecting the eQTL-containing enhancer and the target gene. For 2), we will develop a machine-learning predictor that predicts enhancer-gene interactions from genomic, epigenomic, and Hi-C data at genome-wide scale. We will combine epigenetic data with genomic information (such as sequence motifs of TFs) as features. We will then train a machine learning predictor through hold-outs and cross-validations of the constructed database of enhancer-target gene pairs from 1). The efficacy of the predictor will then be assessed with the gold-standard of the CRISPRi-FlowFISH data. We will then carry out large scale computational and will construct databases of predicted enhancer-gene relationship for selected cell types. Overall, we will demonstrate significant added-power of integrating two important Common Fund data resources and will provide tools to facilitate understanding the relationship between genome topology and gene expression. Our computational tools will lead to new insight into the relationship of genome structure and genome function important for improving human health.
Cis-regulatory elements (CREs) are crucial components of transcriptional regulation and the rapid growth of genomic data has enabled researchers to annotate CREs across many biological contexts. However, despite the comprehensiveness of these collections, understanding the rules dictating how CREs regulate genes remains a major unresolved problem in genomics. Therefore, to better understand gene regulation, we are proposing to develop a new framework where CRE-gene interactions are modeled as graphs. This will enable researchers to accomplish a wide range of computational tasks such as comparisons between cell types, predictions of new interactions, and predictions of gene expression. Specifically, this pilot project aims to evaluate the feasibility and generalizability of a CRE-interaction graph approach for predicting gene expression. We will build CRE-interaction graphs in three biological contexts using public datasets, including those generated by Common Fund projects, by integrating genomic interaction data, such as CRISPR perturbations and Hi-C loops, with annotated CREs. Then to demonstrate the utility of these graph models, we will use graph neural networks to predict gene expression, testing different algorithms and gene expression qualifications to maximize model performance. Finally, we will use feature attribution methods and prediction explainer algorithms to interpret our models to gain a better understanding of the mechanisms regulating transcription. The project will not only lead to a better model for predicting gene expression, but also establish a flexible framework for future research on gene regulation. The project will also produce a resource for the computational and machine learning community and improve the utility of existing resources.
Pediatric cancer is the leading cause of disease-related death in children, yet very few drugs are specifically labeled for pediatric malignancies, underscoring a need to identify novel molecular therapeutic targets to improve outcomes for children with cancer, which is our long-term goal. Specifically, diffuse pediatric high- grade gliomas (pHGGs) are resistant to multi-modal treatment and have had no new FDA-approved drugs in the past 20 years, thus patients with these tumors are in urgent need of novel, effective therapeutic strategies. Aberrant splicing contributes to neoepitope formation and represents a class of untapped targetable genetic alterations that are largely unexplored in pHGG. Our central hypothesis of this research plan is that aberrant splicing events can result in tumor-specific neoepitopes in pHGGs and these data can be rapidly harnessed and prioritized for therapeutic targeting. The proposed work will test this hypothesis with two integrated specific aims: 1) identify putative immunotherapeutic subtype-specific splice targets in pHGGs and 2) characterize aberrant splice variation in pHGG preclinical models and validate immunotherapeutic splice targets for preclinical testing. These studies will integrate transcriptional splice events with tumor tissue expression (PBTA and Kids First X01), normal tissue expression (GTEx and available pediatric matched tissue normals), peptide sequences (UniProt), and known extracellular domain annotations (UniProt) to identify and prioritize neoepitopes generated in pHGGs. This work will elucidate novel splice-driven immunotherapeutic targets through rigorous integrative computational analysis of splice variation in pHGG tumors, coupled with orthogonal molecular assays, to validate presence and expression of these targets. The successful completion of this project will generate significant new knowledge of aberrant splicing programs in pHGG and will identify potential immunotherapeutic targets. This work is critical to understanding the genetic contributions of aberrant splicing to pediatric cancer, will enable the research and clinical communities to rationally inform novel immunotherapeutic strategies for pHGG, and will serve as a roadmap for investigation of neoepitopes in other pediatric brain tumors. This work is highly relevant to the critical mission of the National Cancer Institute to advance scientific knowledge and identify novel strategies to improve overall survival of cancer patients.
CTCF binding at its convergent-orientated DNA motifs has been implicated in establishing TAD boundary. CTCF protein regulates the genome organization through cohesin complex-mediated loop extrusion mechanism. While a few more factors have been recently discovered to regulate genome organization, such as NIPBL, WAPL, YY1, ZNF143, and MAZ, it is still far from a comprehensive mechanistic understanding of how the genome is organized. Discovering novel regulators of genome organization is still challenging due to the intensive nature of chromatin conformation capture technologies. To address the technical challenge in measuring genome organization, we have recently demonstrated that a deep neural network approach can enable de novo prediction of cell type- specific chromatin organization at high resolution. Moreover, this deep neural network model enables high- throughput in silico genetic screen (ISGS) for identifying cell type-specific DNA elements that are important for chromatin interactions. To fully unlock the discovery potential of this deep neural network-based ISGS approach, here we propose to leverage the NIH Common Fund-supported large-scale genomic data across human bio- samples for discovering novel regulators in 3D genome organization. We will predict a list of high-confidence trans-acting regulators, and experimentally validate 3-5 top hits in pilot studies to generate cross-cutting hypotheses for future research in 3D genome regulation.
Advances in immunotherapy have lately revolutionized cancer care. A key strategy of cancer immunotherapy is to target “non-cell-autonomous” mechanisms of immune surveillance adaptation, achieved via regulating the secretions of immune modulators from cancer cells. Yet, an in silico systematic screen for these targets and immunomodulating agents (potentially therapeutic drugs) remains untested due to a lack of computational tools to analyze relevant large-scale database resources. The study is proposed in response to RFA-RM-23-003 to meaningfully integrate multiple NIH Common Fund and other NIH-funded datasets to inform the molecular basis of immune surveillance adaptation and screen for potential immunomodulating agents. Our central hypothesis is that cancer genomic features captured by deep learning predict cancer cells’ non-cell autonomous signals induced by a compound treatment to modulate immune cells in the tumor microenvironment. We propose to test the hypothesis by developing an innovative and feasible computational framework that is built upon our published deep learning models. Specifically, in Aim 1.1 we propose to identify prognosis-related immune cell types and associated immunologic gene signatures among adult (The Cancer Genome Atlas [TCGA]) and pediatric tumors (Gabriella Miller Kids First [Kids First] and Therapeutically Applicable Research To Generate Effective Treatments [TARGET]). We will then build a deep learning model to predict the perturbation of the identified immunologic gene signatures induced by a compound in a cancer cell line using the Library of Integrated Network-based Cellular Signatures (LINCS) data. In Aim 1.2, we will experimentally validate key findings using our in-house in vitro models. We have formed a cross-disciplinary team with strong complementary expertise to efficiently achieve the proposed goals: dry lab of Dr. Yu-Chiao Chiu (MPI) for cancer bioinformatics, multi-modal data integration, and artificial intelligence; and wet lab of Dr. Yi-Nan Gong (MPI) for cancer immunology, immunotherapy, and tumor cell death mechanisms. Successful completion of the pilot study will produce high- impact preliminary results: i) the first deep learning framework that systematically incorporates multi-modal genomic and pharmacogenomic data to screen for immunomodulating agents, ii) a deeper understanding of the molecular basis of immune surveillance adaptation than was previously possible, and more importantly iii) a set of promising targets preliminarily validated in vitro. These preliminary data will lead to a follow-up study to explore functional and preclinical aspects of our results. We also expect the proposed study to provide a computational framework that enhances the utilization and integration of NIH Common Fund data and other publicly available large datasets.
Cardiorespiratory fitness (CRF) is an integrative measure of cardiopulmonary and metabolic health, and a powerful, independent predictor of future risk of mortality. No pharmacotherapies target CRF, and while exercise training (ET) remains the only established means to improve fitness, substantial inter-individual variability exists in both its intrinsic (untrained) level as well as its response to ET. Limited information exists regarding CRF’s underlying biology and identifying its molecular underpinnings may help inform our understanding of the determinants of this clinically important trait. Emerging data highlight circulating biochemicals, and in particular proteins, as mediators of exercise’s health benefits (“exerkines”). To better understand the molecular pathways involved in exercise and CRF, the NIH Common Fund created MoTrPAC, the most comprehensive effort to study human exercise to date. MoTrPAC includes gold-standard measures of CRF (VO2max), and deep molecular profiling - including whole genome sequencing (WGS) and large-scale plasma proteomics. Moreover, the MoTrPAC study designs utilizes a randomized, controlled trial of ET with a non-exercise control arm to allow for the identification of determinants of VO2max changes in response to ET (VO2max). Though the generation of large multi-omics data in MoTrPAC presents enormous opportunity, a major challenge remains in identifying the most promising biochemical determinants of CRF to be brought forth for further, mechanistic studies. We and others have previously shown that integrating genetics, proteomics, and functional genomics (e.g. tissue expression and knockout models) using statistical colocalization can inform understanding about a protein’s regulation and suggest a causal role in health and disease. These studies can, in effect, prioritize plasma proteins to be triaged for further investigation. This proposal integrates plasma proteomics, WGS and VO2max traits from MoTrPAC with tissue expression data from GTEx to prioritize candidate protein determinants of CRF. In Aim 1, we will apply a novel 5,000 assay plasma proteomics platform (Olink5K) in ~1,980 MoTrPAC participants undergoing ET to identify circulating proteins related to baseline VO2max and VO2max leveraging the non-exercise control arm in MoTrPAC. In Aim 2 we will: A) identify locally-acting (cis-) genetic instruments related to plasma proteins from Aim 1 (cis-protein quantitative trait loci, pQTLs) from existing, publicly available datasets (TOPMed, UK BioBank) as well as generate new cis-pQTLs using WGS from MoTrPAC; and B) gain functional insights into the genetic regulation of VO2max-related proteins through colocalization with tissue expression in GTEx (expression- or eQTLs). These experiments will help determine whether a plasma protein is regulated at the transcriptional level and/or in tissues relevant to VO2max (e.g. heart, skeletal muscle) and set the stage for further mechanistic studies that extend beyond the scope of this project.
Despite advances in our understanding of rare genetic diseases and their causes, only 8% of these diseases have targeted drugs. Much of this arises from the disconnect between the inhibitory nature of drug molecules and a predominance of loss-of-function mechanisms in such diseases. There has been a growing appreciation of the role of gain-of-function variants in this context, especially towards drug repurposing. More specifically, we and others have shown that even subtle changes in function such as alterations of post-translational modification or molecular interaction sites can frequently lead to such disorders. It is unclear to what extent these observations generalize and are actionable from a therapeutic perspective. Common Fund data sets such as those from the Gabriella Miller Kids First, Undiagnosed Disease Network, the Illuminating the Druggable Genome and LINCS programs provide a unique opportunity to assess this computationally. Our central hypothesis is that gain-of-function variants account for a much larger proportion of rare genetic diseases than currently known and in silico functional profiling can be used to computationally identify such diseases. The proposed work will test this hypothesis through two aims. In Aim 1, we will apply our previously- developed predictors of variant impact towards the identification of known and predicted disease-associated variants in large Common Fund genomic data sets. In Aim 2, we will subset out those variants that impact druggable biochemical properties either directly or indirectly, to thus, infer novel drug-disease pairs. Over the award period, the principal investigator (PI) will leverage his and his team's expertise in variant interpretation, machine learning and bioinformatics knowledgebases towards the systematic integration of genomic and drug- related data from multiple Common Fund data sets to identify candidate drugs that can be repurposed for rare genetic diseases. This work will be carried out at the Icahn School of Medicine at Mount Sinai, home to world- renowned researchers in human disease genetics, robust computational infrastructure, and a thriving biomedical data science training environment. The proposed research will not only provide valuable pilot data for experimental validation of promising drug repurposing candidates but will serve as the foundation for future computational methodology development that will expand the scope of variants and mechanisms that can be queried. The work is expected to have broad impact, as it presents a new mechanism-centric, data-driven approach to identifying drug repurposing candidates for rare genetic diseases, that is generalizable to other situations.
As the leading cause of cancer death in the United States, lung cancer accounts for about 20% of all cancer deaths. While there are two major types of lung cancer (i.e., 80%~85% for non-small cell lung cancer (NSCLC) and 10%~15% for small cell lung cancer (SCLC)), each type of lung cancer has multiple distinct subtypes characterized by morphological, molecular, and genetic alterations. Identifying lung cancer subtypes can facilitate downstream risk stratification and tailored treatment design. While various conventional methods like morphological analysis, computed tomography (CT) and imaging techniques, cytogenetic analysis, immunophenotyping, or molecular profiling have been used for lung cancer subtype identification, they are usually costly, time-consuming, labor-intensive, and sometimes inaccurate. Recent progress has witnessed the application of next generation sequencing (NGS) for identifying lung cancer subtypes, but they are limited to bulk NGS data, or single omics data only. With tons of omics data being generated within and beyond the Common Fund data sets (e.g., GTEx and HuBMAP), we hypothesize that integration of single-cell and bulk multi-omics data including genomics, transcriptomics, and epigenetics data will significantly facilitate subtype-specific biomarker discovery and boost the accuracy of lung cancer subtype identification. To address these concerns, we propose to develop an integrated machine learning (ML) framework for accurate and cost-effective lung cancer subtype identification by combining single-cell and bulk multi-omics data within and beyond Common Fund data sets. To achieve this, two specific aims are undertaken. Aim 1, to establish a gene- signature-transfer ML model that leverages large-scale bulk and single-cell transcriptomics data within and beyond Common Fund data sets for lung cancer subtype identification. Besides identifying well-annotated lung cancer subtypes, we will also explore novel lung cancer subtypes by detecting rare cell types from large-scale single cell data, from which cluster-specific and rare-cell-type specific gene signatures can be transferred to the bulk transcriptomics data for improving performance of lung cancer subtype identification. Aim 2, to develop a multi-omics integration framework to systematically combine single-cell and bulk multi-omics data (including genomics, transcriptomics, epigenomics) to further boost lung cancer subtype identification. Our model is flexible to tackle cases when only partial or incomplete multi-omics data are available for new patients. We believe successful completion of this study will have direct impacts on improving downstream lung cancer risk stratification, facilitating diagnosis and prognosis, and optimizing treatment selection. We also expect that our proposed framework in this study can be customized and extensible to identifying subtypes of other types of cancer.
Unraveling the molecular mechanisms that link SNPs and genes identified in GWAS studies to the disease is a challenge that must be overcome to translate these genetic discoveries into actionable health insights. We would like to build a machine learning tool on the basis of visible neural networks (vNN) that recently showed success to provide predictive and explanatory power on cellular responses to gene regulations or disease treatment. Key to the vNN approach is an understandable network such as Knowledge Graph that contains rich annotations of relevant entities and relationships among entities organized by integrative data sources. Our hypothesis is that organizing and integrating data from the Common Fund Data Ecosystem (CFDE) can enhance the explanatory power of vNN to illuminate GWAS results. We propose combining the ROBOKOP Knowledge Graph with diverse biological data from the CFDE, and vNN as a knowledge-based architecture to provide high interpretability in supervised learning. ROBOKOP Knowledge Graph will serve as an organizational hub for integrating CFDE data with existing knowledge. This query-able resource for CFDE data will be our first deliverable. We will extract network-based relationships from this data and train vNN using genotypes and phenotypes from T2D-focused GWAS, providing our second deliverable. The trained vNN, our third deliverable, will enable the prediction of T2D phenotypes from genotype data. Lastly, we'll provide the code base as a platform to expand this KG and vNN approach to other GWAS studies and potentially be generalized for genome wide ‘omic analyses with large data sets.
Exercise represents one of the most powerful and beneficial interventions for health and wellness, though patients with and without chronic disease struggle to exercise enough to garner its benefits. One sought after approach to overcoming these challenges is to provide the benefits of exercise pharmacologically, a so-called "exercise-in-a-pill" solution. This involves identifying biomolecular mechanisms responsible for exercise benefits and engaging them pharmacologically using small molecule agents. To accelerate discovery of exercise mimetic drugs, this project synergizes existing data from two National Institutes of Health Common Fund projects. The Molecular Transducers of Physical Activity Consortium (MoTrPAC) provides a map of the biomolecular response to exercise, while the Library of Integrated Network-Based Cellular Signatures (LINGS) Program provides a map of the biomolecular response to small molecule exposure. The investigators hypothesize that biomolecular exercise pathways and small molecule drug candidates from these two resources can be matched by their shared biomolecular "signatures". The maps are linked by matching exercise-induced changes in biomolecular expression (i.e., a gene expression "signature" of exercise) from MoTrPAC to similar expression changes induced by small molecules found in LINGS. By linking these two data sets, the project will create a detailed, browsable, and interactive resource for identifying potential exercise mimetics. Aim 1 seeks to identify biomolecular signatures of exercise training from MoTrPAC by analyzing publicly released multi-omic response data from MoTrPAC's exercise training studies in young adult rats. Specific objectives include: 1) identifying biomolecular "signatures" through gene set enrichment analysis, network clustering, and gene regulatory networks; 2) evaluating the validity and reliability of these signatures across tissues, sexes, and timepoints in the rat study; and 3) generating an annotated database of signatures for alignment with LINGS. Aim 2 integrates MoTrPAC and LINGS biomolecular signatures in a cloud-based infrastructure that matches MoTrPAC signatures with those available from LINGS. This infrastructure will be designed to 1) organize, browse, and fiter the MoTrPAC signatures database; 2) query existing cloud-based applications developed by LINGS for engagement with their library of data; and 3) deliver these results in a visually informative and interactive web application to maximize the user's ability to gain novel insights. Future efforts will expand the infrastructure by adding new MoTrPAC data and annotating exercise signatures with knowledge from other Common Fund datasets. Additionally, the project aims to leverage results to identify and prioritize pathways related to insulin resistance and other diseases for experimentation in model systems. This comprehensive approach seeks to accelerate the identification of exercise mimetics and provide valuable resources to translational researchers and stakeholders in the pharmaceutical pipeline.
The diversity of human tissues and cell types is controlled by differential regulation of gene expression. Enhancers are the primary units of gene expression control in humans, which physically interact with the target genes to activate them. To understand the mechanism of transcriptional regulation, several landmarking consortia have accumulated large amounts of genomic data. The NIH Common Fund GTEx project has revealed tissue- specific gene expression and transcriptional regulation. The NIH Common Fund 4D Nucleome (4DN) project has characterized 3D chromatin interactions in many human tissues and cell types. The ENCODE project and the Roadmap Epigenomic project have profiled tissue-specific epigenomic states and annotated the regulatory elements accordingly. However, our understanding of human transcriptional regulation remains limited. A major challenge in studying Enhancer-Promoter (E-P) interactions is that enhancers are often located tens to hundreds of kilobases distal to their target genes, yet must physically interact with their target genes to activate them. Therefore, to understand gene regulation in human, a necessary first step is to precisely map all E-P interactions. However, current 3D maps of the human genome remain sparse and noisy and fail to detect the vast majority of E-P interactions. In addition, the datasets generated from the Common Fund consortia are isolated in terms of cell types and tissue types covered, how the data are stored, and the resolution of the genomic data, resulting in additional barriers for integrative analysis. Here we propose to combine the experimental approach, Region Capture Micro-C, and the deep learning algorithms to accurately quantify E-P interactions. These results enhance the current 3D maps from 4DN and ENCODE projects. We also plan to integrate experimental and imputed E-P interactions with NIH Common Fund data in a knowledge graph. Our knowledge graph will support efficient cross-modality queries, graph visualization, and customized computational modeling for investigating quantitative rules in transcriptional regulation. Not only would this accomplishment have an enormous positive impact on the utility and usage of the Common Fund datasets, but it would also help to promote open science and reproducible research in the areas of computational genomics and data science.
Understanding and decoding the intricacies of gene regulatory networks is crucial in genomics for insights into gene expression and cellular functions. Traditionally, research in this field has heavily relied on transcriptome data and machine learning to infer these networks, but this approach has mostly used bulk tissue samples. This method overlooks the nuances of individual cells and their microenvironments, limiting our understanding to a broader, macroscopic level. The advent of spatial transcriptomics marks a significant shift, promising to unravel these networks at a single- cell and spatial level. This technology allows for the exploration of gene expression in relation to spatial dynamics, enhancing our understanding of tissue organization and cellular functions. However, adapting machine learning to spatial genomics faces challenges. One major issue is the scarcity of spatial transcriptome data, which hampers the effectiveness of deep learning methods known for their superiority in network estimation. Another challenge is the need for models that account for the physical positions of cells, as traditional methods treat data as independent and identically distributed, ignoring spatial relationships. To address these challenges, this proposal outlines two main objectives: Aim 1: Developing deep learning methods for cell-type resolution regulatory network estimation capable of transferring between scRNA-seq and spatial transcriptomics data. We will develop machine learning mdoels that can integrate components that explicitly model the regulatory network, distinguishing cell types based on transcriptomic data. The approach will use domain-invariant regularization to adapt from scRNA-seq to spatial transcriptomics, employing GTEx and HuBMap data sets. Aim 2: Developing deep learning methods with spatial regularization for estimating regulatory networks at spatial resolution within spatial transcriptomics. We will develop techniques that factor in the spatial positioning of cells during the learning process. The hypothesis is that cells in close spatial proximity have similar regulatory structures. This aim will also use GTEx and HuBMap data, along with collaborative efforts on spatial transcriptome data of the human dorsolateral prefrontal cortex. Overall, this proposal seeks to lead the development of advanced deep learning models, integrating cell-type resolution and spatial dimensions to revolutionize our understanding of regulatory networks in genomics to both the cell-type and spatial resolution.
Tissue-specific genes are underutilized as disease targets. Tissue-specific genes show narrow expression, play key roles in maintaining tissue homeostasis and are thought to be good drug candidates. Thus, targeting of dysfunctional tissue-specific genes can provide a safer therapeutic approach due to the reduced risk of side effects. However, identifying which tissue-specific genes are critical in disease is a bottleneck in drug discovery. We hypothesize that key tissue-specific genes have the ability to spread perturbations in a protein-protein interactome and can be identified by their context specific functionality. This approach offers a paradigm shift from conventional analyses that uniquely focus on one-gene expression levels. We recently proposed Gene Utility Model (GUM) which hypothesizes that it is how a gene is utilized in protein-protein interaction (PPI) network dictates its importance in disease development. We will use information flow of a gene within a PPI network to represent the gene utility in a given biological state. Under this scenario, genes with high information flows (i.e., high gene utilities) in a disease state, instead of gene expression level, are deemed to play more important roles in disease development. Thus, this application seeks to increase the clinical utility of NIH Common Funds datasets by employing state-of-the-art systems biology approaches to precisely and reliably identifying tissue-specific druggable functional genes (TS-DFGs). We will construct a prototype for Common Fund Gene Utility Compendium by leveraging four NIH Common Fund datasets: Genotype Tissue Expression (GTEx), Library of Integrated Network-based Cellular Signatures (LINCS), Illuminating the Druggable Genome (IDG), and 4D Nucleome (4DN). We will focus on three disease types, liver cancer, nonalcoholic fatty liver disease (NAFLD), and Alzheimer's disease (AD) as proof-of-concept studies. In Aim 1, we will uncover highly utilized tissue-specific genes across multiple normal tissue types and three selected disease types. We will then construct utility karyotype to indicate chromosomal regions enriched with highly utilized genes. In Aim 2, we will employ selectivity, controllability, and suitability as criteria to score druggability for TS-DFG candidates with respect to liver cancer, NAFLD, and AD. Druggable utility networks (DUNs) with respect to each disease type will be constructed to assess the distribution of highly score TS-DFGs in a PPI network and signaling pathways. The constructed prototype of the Common Fund Gene Utility Compendium will promote innovative research to enhance the usage and provide added clinical value for the NIH Common Fund datasets by offering a new paradigm shift for target and drug discovery. Our long-term goal is to enlarge this compendium by including more diseases across different tissue types to facilitate integrative pan-tissue analyses and drive drug discovery.
Inherited diseases are a major source of blindness. These diseases are typically single-gene disorders. Some may cause developmental defects in early eye formation, others cause the degeneration of photoreceptor cells, while others may cause anterior segment disease. Historically they have been classified based on the clinical phenotype and then grouped into various disease entities. Retinitis pigmentosa (RP), an example of a rod-cone dystrophy, is the most common form of inherited retinal disease. It has a constellation of classical findings observed on eye examination accompanied by progressive loss of rod and then cone photoreceptors leading to eventual blindness in late stages of the disease. In the past three decades the genetic basis of many forms of inherited retinal diseases, for example, have been discovered leading to the identification of over 270 retinal disease genes. Mutations of over 80 genes are known to be associated with RP alone. However, in spite of the tremendous progress that has been made, the identification of the causative genetic alteration can be identified in only 50-75% of patients with presumed inherited retinal disease, even after whole genome sequencing. Based on this fact, it is presumed that significant numbers of unknown ocular disease genes exist. One approach to identify additional ocular disease genes in the mammalian retina is to take advantage of knockout mouse technology. The Knockout Mouse Phenotyping (KOMP) program is part of the International Mouse Phenotyping Consortium (IMPC), a group of scientists from mouse clinics around the world with the common goal of creating single gene knockout mice for every gene in the mouse genome. To date, over 7,000 single gene knockout mice have been created and phenotyped of the ~24,000 protein coding genes in the mouse genome. The UC Davis Mouse Biology Program is one of just three KOMP/IMPC centers in the US and generates a large number of knockout mice for the KOMP pipeline. Mouse knockouts receive comprehensive phenotyping in every organ system in the first four months of life prior to necropsy and histopathology. Knockout lines are annotated for dozens of specific eye abnormalities which are carefully documented during the phenotyping process. Identification of ocular disease genes in knockout mice provides candidate eye disease genes relevant in people. This proposal seeks to close the gap on the remaining 25-50% of patients with presumed inherited ocular diseases that currently cannot be genetically diagnosed. In this project we will identify all mouse retinal disease genes identified by the KOMP. In addition, we will correlate these novel mouse ocular disease genes for human relevance by cross referencing with the GTEx data base, also supported by the Common Fund. Furthermore, we will deeply analyze the specific cell biology of novel genes, by literature search, Gene Ontology, Panther Pathway, STRING, Syscilia, CiliaCarta, and publicly available ophthalmic GWAS. This proposal will catalyze discoveries and generate novel hypotheses based on clinically relevant pathways previously unimplicated in eye disease.
Identifying biomarkers that are diagnostic, robust and generalizable across individuals while possessing therapeutic values is the most wanted endeavor in medicine. However, there are numerous challenges in the identification of such robust therapy-associated biomarkers (TABs). For example, most of the current methods seek to achieve statistically significant differential biological signals in general patient cohorts but fail to acknowledge heterogenous genetic backgrounds and phenotypic diversity among individual patients. Our recent studies using newly developed machine learning-based feature engineering approaches and conducted in a pan-cancer study across 12 cancer types showed that biologically constrained features (named herein invariant features) are universal in disease and can be used to classify individual cancers. Importantly, we also show that invariant features can be used to build de novo biological networks and discover network hubs that can be successfully utilized to infer the expression of associated genes. As such, invariant features can act as information encoders. Using information from Drug Repurposing Hub we show that these hub genes are also drug targets. Collectively, these observations suggest that invariant feature hubs can be TAB candidates. We propose that under the new light of biological constraints, we can use a dynamic approach for biomarker discovery that encapsulates both the genetic heterogeneity and molecular fluctuation across individual patients. Our central hypothesis is that disease states show constrains in their molecular activities, and identifiable invariable features possess diagnostic and therapeutic values. The main objective of this proposal is to uncover TABs using selected NIH Common Fund datasets (namely, exRNA, GTEx, LINC, and IDG). In Aim 1, we will test the hypothesis that biologically constrained invariant features are universal to most if not all biological states. We will show this by finding invariant features with respect to each biological state from selected Common Fund datasets. We will conduct comparative analyses in disease and normal states in order to dissect disease-specific invariant features. Next, in Aim 2, we will test the hypothesis that invariant feature hubs are TABs. We will show this by determining the diagnostic capability of invariant feature hubs for their “encodability” to reconstruct the expression values of their associated invariant feature genes in different individual patients diagnosed under same disease type. Finally, we will map these invariant feature hubs to IDG and DrugBank to determine their druggability. For those understudied hubs with no known drugs, we will perform computational analyses such as homology modeling and machine learning to characterize their druggability. We expect timely accomplishment of proposed aims and successful completion of this project will no doubt provide added values for the selected Common Fund datasets, while providing a new paradigm shift of biomarker and therapeutic target discovery.
Glycans play an important role in cell-cell interaction, cell-signaling, and immune response, with specific glycan structural motifs responsible for driving glycans’ cell- and protein- binding activity and thus their functional role. Despite glycans’ significant structural heterogeneity, these glycan motifs, or determinants, have been recognized as relevant in specific biological contexts and named by glycobiologists. In a similar way as protein domains, these recurring structural motifs confer specific binding and functional activity to glycan structures that contain them. Many collections and catalogs of these glycan motifs have been developed but they are primarily lists – little to no functional understanding is captured by these resources to provide biological context for understanding a motifs’ role in the cell. The GlycoMotif data- resource, developed in the Edwards lab in support of the GlyGen Glycan Knowledgebase, a Common- Fund sponsored data-resource, is one such catalog of glycan motifs, with the capability to annotate each motif with names, keywords, and publications. Despite the relative paucity of our understanding of glycan structure motifs’ functions, however, we do understand the glycosylation enzyme machinery responsible for assembling and attaching glycans to glycoproteins, and can enumerate, though the data-resources provided by GlyGen, the glycoenzymes necessary for attaching and assembling mature glycan structures on mouse and human proteins. We seek to explore the utility of functional annotation of GlycoMotif glycan motifs by the integration of gene-based phenotype, tissue-localization, and cell-type annotations of associated glycoenzymes from the International Mouse Phenotyping Consortium and GTEx Common- Fund data-resources with GlycoMotif, and thereby providing a platform for comprehensive functional annotation of glycan motifs, and ultimately, glycan structures in GlyGen.
NIH Common Fund (CF) programs have produced a number of unique and high-value data sets. To solve complex biomedical questions, we need to find related data sets that can be co-analyzed for specific study purposes. Many of the current search techniques depend on data descriptors which differ across CF programs and may be incomplete or inaccurate. Many of these experiments output lists of genes significant to certain biomedical conditions. We are proposing to use these gene lists to find similar data sets. This approach will not only enable searching across CF data sets but also can connect them to other experiments in other databases and biomedical catalogs, e.g., databases containing disease-gene associations and molecular pathways. To achieve this aim, we will implement an efficient linear algorithm to calculate similarities between large numbers of gene sets. Our prototype tool, DBRetina, uses this algorithm to build huge similarity networks in few minutes using minimal computational resources. DBRetina serves as the foundation for CurIndex, a study similarity graph database that connects multiple health-related resources. DBRetina and CurIndex will allow advanced search for related CF experiments and facilitate better interpretation of biomedical data.
Genome-wide association studies (GWAS) have highlighted that disease-associated human genetic variants are prevalent in noncoding regions and for most of them the biological function or gene target remain uncharacterized. To better annotate such disease variants, NIH-funded consortia created comprehensive maps of putative regulatory elements and identified SNPs associated with gene expression (eQTLs) for different tissues and primary cell types. In parallel, breakthroughs in capturing the 3D genome structure have demonstrated the importance of cell-type-specific physical proximity between genes and their regulatory elements. This 3D view provided a new way through which disease-associations of certain variants can be explained. There is an increasing interest in utilization of chromatin loops for GWAS variant annotation, however, to the best of our knowledge, there is no comprehensive study incorporating eQTL data and high-resolution chromatin looping information across many different matched/related cell types and tissues to interpret GWAS variants identified for a large set of diseases. To goal of this proposal is to utilize NIH Common Fund datasets (GTEx and 4D Nucleome) as well as other published chromatin loop and eQTL data to carry out different integrative approaches for better annotation of disease-associated genetic variants. This will lead to the development of a framework and best practices for integrative analysis of loops, eQTLs and GWAS signals. The developed framework will be tested on a large number of diseases and disease-relevant cell types to create a substantial online resource for researchers. For a subset of the studied diseases, for which we have ongoing research interests, we will further analyze the identified novel genes, genetic variants and overlapping regulatory elements to determine potential targets that warrant further investigation.
Pediatric brain tumors are the most frequent cause of morbidity in children with solid tumors. Importantly, the aggressive therapeutic regiments often lead to debilitating neurological effects. The realization that developmental processes critical to brain development are also deregulated in cancer has provided new hope for understanding and treating brain tumors. Indeed, single cell-RNAseq analyses have further demonstrated the role of defects in lineage determination for pediatric brain tumors. To discover novel drivers of tumorigenesis, we will focus on the function of three-dimensional (3D) genome folding in pediatric brain tumors. Indeed, 3D chromatin interactions are involved in gene expression regulation, and changes in genome folding are linked to cell identity acquisition during development. While there is increasing interest in elucidating the function of 3D genome architecture during developmental processes and in cancer, how the 3D genome is organized in different pediatric brain tumors and its roles in tumor formation and progression are unknown. We hypothesize that disrupted 3D genome folding during embryonic or postnatal development alters gene expression leading to abnormal cell differentiation and tumorigenesis in the developing brain. To test our hypothesis, we will comprehensively interrogate the genomes of pediatric brain tumors for non-coding variants that may affect 3D genome folding. We will use a deep-learning model called Akita that predicts 3D chromatin interaction frequencies from genome sequence alone. Because Akita only requires DNA sequence as input, we can predict the effect of any variant within a single framework that accommodates single-nucleotide variants (SNVs), insertion/deletions (indels), and structural variation (SVs). Akita will be used with pediatric brain whole genome sequences (WGS) from Gabriella Miller Kids First (KF) plus chromatin capture, epigenetic, and expression data from the 4D Nucleome (4DN) and Genotype-Tissue Expression (GTEx) programs in the following aims: 1) Determine the 3D genome architecture of Atypical teratoid/rhabdoid tumor AT/RT tumors. We have initiated our study using AT/RT, tumors thought to be due to defects in early development11 and the most common brain tumor in children less than six months of age. 1.A. We will develop a bioinformatics pipeline that uses Akita to quantify how much a genetic variant is predicted to disrupt 3D chromatin interactions in AT/RT tumors. 1.B. We will validate and determine the functional relevance of 3D genomic folding disruptions observed in AT/RT tumors. 2) Determine the 3D genome architecture of malignant pediatric tumors. We will extend our analyses with Akita to additional malignant pediatric brain tumors, focusing for this pilot project on the most malignant and treatment refractory tumors. This innovative project, using a new deep-learning tool Akita, will lead to, novel research hypotheses and will accelerate the discovery of additional regulators of pediatric cancer tumorigenesis and thus to potential therapeutic strategies for these devastating diseases.
The lack of uniformity in published experimental methods and data is a major impediment for the research community to compare, corroborate, and build upon biomedical discoveries. The FAIR data principles state that research data should be “findable, accessible, interoperable, and reusable.” Public metabolomics data repositories and large-scale studies supported by the NIH Common Fund, including Metabolomics Workbench and the Integrated Human Microbiome Project (iHMP), and other public mass spectrometry data repositories, such as the Global Natural Products Social Molecular Networking (GNPS) and MetaboLights, have made progress in recent years to address the first two FAIR principles by making metabolomics data easily findable and accessible. Unfortunately, the final two FAIR principles, which state that data should be interoperable and reusable, have not been adequately addressed yet by the metabolomics community. This prevents metabolomics data from multiple relevant studies to be compared and co-analyzed. This proposal aims to bridge this interoperability and reusability gap by harmonizing community standards and creating accompanying computational tools for data re-analysis. Specifically, this proposal will 1. Standardize and convert mass spectrometry data formats (Aim 1), 2. Harmonize experimental metadata and analysis results with common controlled vocabulary with consistent semantics across all experiments (Aim 1), 3. Develop web infrastructure to find and explore datasets by metadata (Aim 1), and 4. Develop cloud-enabled portable, reusable, and scalable co-analysis bioinformatics pipelines (Aim 2). Successful completion of these aims will democratize the ability for the entire metabolomics community to corroborate published findings, discover new metabolites that are highlighted only when co-analyzing datasets, and test translational hypotheses across different model organisms.
Prof. Oliver Fiehn will work with his key persons, statistician Dr. Christopher Brydges, bioinformatics specialist Dr. Yuanyue Li and programmer Gert Wohlgemuth (all UC Davis) to generate new pipelines that integrate stool microbial metagenomics data and stool mass spectrometry data to better associate metabolites with disease progression in inflammatory bowel disease. We will work in consultation with Dr. Clary Clish (Broad Institute) who generated and deposited the data to the NIH Common Funds MetabolomicsWorkbench and the iHMP integrated human microbiome data. We will prioritize the enormous set of more than 80,000 yet unidentified stool metabolic signals using longitudinal disease progression over 1 year in subjects with inflammatory bowel disease, in comparison to healthy subjects. For this limited set of not more than 1,000 metabolites that will show significant association with health outcomes, we will use all available accurate mass MS/MS data and all stool microbiome data to obtain metabolite class information and likely metabolite structures or substructures. Dr. Clish will review our results and share new annotations that his group will release. To this end, we will develop the tools for metabolome predictions that have been built by the KBase collaborative research consortium over the past 10 years. KBase uses microbial genomic sequences (or even transcriptomics data) to automatically build metabolic pathways through enzyme predictions and gap filling. KBase also empowers utilization of microbial communities, modeling import and export of metabolites that other microbes can use as carbon sources. In consultation with Dr. Chris Henry (Argonne National Lab) from the KBase consortium, we will then build pipelines within the KBase environment to include mass spectrometry tools that the Fiehn laboratory has built through its past NIH funding, specifically formula predictions and substructure predictions (in MS-FINDER), retention time predictions (in Retip.app), hybrid-shift MS/MS similarity matching (in NIST search), and entropy similarity MS/MS matching (in MassBank.us). This specific project will have large impact on other, similar microbiome/metabolome projects that will be uploaded to the NIH Common Funds databases in the future. The project addresses the huge complexity in stool metagenomics and stool metabolomics data, and delivers key pipelines (called ‘narratives’ in KBase) that can be used by the research community at large.
Cancer remains the leading cause of death by disease past infancy among children in the United States. In contrast to adult cancer with many genetic mutations, most pediatric cancers have few genetic mutations. Instead, recent studies have shown that fusion RNAs and their encoded proteins may drive tumorigenesis in children. Fusion RNAs are generated by exons from two genes. With the launch of the Fusion Oncoproteins in Childhood Cancers Consortium, more fusion proteins are being found and studied. However, a complete understanding of the mechanisms in pediatric cancer remains elusive, mainly due to three unsolved challenges. First, current studies have focused on mRNA-derived fusion proteins and have not explored long noncoding RNA-derived fusion transcripts (lnc-fusions) and their encoded proteins in pediatric cancer; although lnc-fusions have been reported in adult cancer to regulate anti-tumor immunity. long noncoding RNAs (lncRNAs) are long transcripts of at least 200 nucleotides that cannot encode protein. lncRNAs largely outnumber mRNAs and play critical roles in various cancers. Therefore, a complete investigation of mechanisms driving pediatric cancer is not possible without expanding the study of fusion proteins to include lncRNA-fusions. Second, existing lnc-fusion detection methods cannot explore lnc-fusions that are derived from novel lncRNAs. Due to high disease-specificity, most lncRNAs have not been annotated in pediatric cancer. Third, fusion RNAs, including lnc-fusions, may be formed by alternative mechanisms, such as chromosome rearrangement or aberrant splicing events. These alternative mechanisms complicate the understanding of genetic mechanisms and thus treatment. The large amount of multi-omics data from various Common Fund sources enables us to address these challenges in pediatric cancer. Previously, we had developed computational methods to identify and characterize lncRNAs for human diseases and development. To discover molecular drivers in pediatric cancer, we will extend our previous studies of lncRNAs to identify lnc-fusions from RNA sequencing data (Aim 1). We will further determine the potential functions and formation mechanisms of lnc-fusions using integrative methods (Aim 2). Machine learning algorithms will be used to identify lnc-fusions as putative biomarkers and prognostic biomarkers in pediatric cancers. This study will focus on neuroblastoma and myeloid malignancies since these pediatric cancers have large cohorts of RNA sequencing and whole-genome sequencing data in the Gabriella Miller Kids First Dataset. In summary, we will discover lnc-fusions in pediatric cancers, develop computational methods and frameworks broadly applicable to existing and future RNA sequencing datasets. This study will improve the utility of three selected Common Fund datasets (Kids First, GTEx and 4DNucleome), and two external databases (GEO and TCGA).
The human genome has two alleles at each genetic locus, with one allele inherited from each parent. Allele- specificity has been widely observed and investigated across human transcriptome, epigenome and 3D chromatin organization respectively, as evidenced by the data collected from the GTEx project, the ENCODE project, and the 4DN project. However, the allele-level interplay among transcriptome, epigenome and 3D chromatin organization has not been systematically explored. In the proposal, we aim to leverage shared donors between these consortia and systematically investigate these connections at allele-level in many human tissue types. With the single cell datasets available from these donors, we will narrow down from tissue level to cell type level, and further interrogate these allele-specific cross-modality connections. In essential, our analysis seamlessly integrates two NIH Common Fund datasets, namely 4DN and GTEx datasets, and ENCODE datasets. With this integrated data source, biomedical researchers can easily navigate, browse, compare and investigate the high quality, high resolution, and comprehensive datasets regarding chromatin organization, regulatory elements, epigenomic status and transcriptional activity with allele-specificity and cell-type-specificity. Not only would this accomplishment have an enormous positive impact on the utility and usage of the Common Fund datasets, it would also help to promote open science and reproducible research in the areas of computational genomics and data science.
The NIH Common Fund Genotype-Tissue Expression project (GTEx) collected whole-genome sequencing and gene expression data from 47 tissues sites of hundreds of subjects. It generated a huge impact by providing tissue-level gene expression and expression quantitative trait loci (eQTLs) for over 7,000 publications. However, tissues are mixtures of myriad cells, and tissue-level gene regulation is affected by cellular compositions. To obtain cell-type-specific (CTS) effects, GTEx started to collect single-nucleus RNA-sequencing (snRNA-seq) data from eight tissue types. The single-cell data collection is extremely expensive and labor-intensive, and thus snRNA-seq data are only collected from 25 tissue samples of 16 donors that may not represent the population. More cost and labor-efficient methods are urgently needed to use existing datasets fully. It turns out that with another NIH Common Fund project, Human BioMolecular Atlas Program (HuBMAP), we can gain population- level insights with HuBMAP single-cell data as a reference by developing computationally efficient methods. Complementary to GTEx and other single-cell references, the HuBMAP single-cell reference allows us to deconvolve the 47 GTEx tissues into over 200 cell types. In addition to the cellular fractions, we will calculate CTS eQTLs for those cell types at a population scale. Specifically, we will: 1) estimate cellular fractions of over 200 cell types from 47 tissue sites across the human body; 2) calculate CTS-eQTLs for those hundreds of cell types with statistical rigor and power. We will further consider the potential selection bias in the eQTL analysis that GTEx collected only normal tissues. The successful completion of this project will maximize the usage of NIH Common Fund GTEx and HuBMAP projects to provide a new eQTL resource at cell-type resolution. It will be powerful in downstream analyses such as CTS colocalization by connecting with genome-wide association studies (GWAS) and CTS transcriptome-wide association studies (TWAS) by predicting genetically regulated CTS gene expression. Altogether, this project will provide a global picture of the human body at high resolution to map cells to health and complex diseases.
The increasing adoption of whole-genome sequencing (WGS) in the context of genomic medicine and precision oncology has resulted in the accelerated discovery of structural variants (SVs) in patient cancer genomes. However, while human cancer types are generally characterized by widespread genomic instability the functional consequences of most structural and copy number variants (CNV) remain poorly understood. Critically, it is unknown which of the hundreds to thousands of genomic rearrangements typically observed in a patient tumor are pathogenic and which are non- functional genomic scars. Because SVs alter the genome at the structural (linear sequence), topological (three-dimensional organization), and phenotypic levels (epigenetic landscape), integrative and multiscale datasets are necessary to correctly predict their impact. This dearth of integrative resources and tools critically limits the medical interpretation of patient genetic data. Existing large-scale genomic and proteogenomic cancer characterization efforts, including the Common Fund (CF) Gabriella Miller Kids First (GMKF) data resource provide rich data to link genetic information including SVs with their phenotypic consequences, such as gene expression. However, these datasets alone are insufficient to provide deep mechanistic and functional insights. CF data sets, specifically 4D Nucleome (4DN), Epigenomics (Roadmap), and GTEx provide the blueprint to link germline variation, genome topology, and chromatin architecture to gene expression. Therefore, we propose the integration of genomic data from patient tumor samples (GMKF), with spatial and functional data (4DN, Roadmap, GTEx), which will allow us to elucidate and predict the pathogenic mechanisms of structural variants: Aim 1: To create TopVar a data resource to enhance our understanding of the interplay between genome TOPology and structural VARiation. The integrative TopVar resource will provide the phenotypic context required to interpret SVs in genetic and biological terms, which will yield testable hypotheses regarding their downstream effects. Aim 2: To develop and evaluate a predictive model of SV pathogenicity across multiple human cancers. Using the structured TopVar data resource, we will implement an interpretable statistical model to predict which SVs have an impact on gene expression, utilizing multiple layers of the integrated data. The realization of both aims will represent a proof-of-principle for the utility of TopVar for predictive modeling of SVs in the context of precision oncology. While our proposed study will focus on interrogating the comprehensive genomic data generated by GMKF (pediatric cancer) and CPTAC (adult cancer), it will serve as the foundation for their use within real-time sequencing programs, such as MI-OncoSeq and Peds-MI-OncoSeq, focusing on refractory and metastatic tumors.
Many valuable datasets have been generated from the NIH Common Fund programs, including large RNA- sequencing data from multiple tissues and species as well as increasingly available high-quality mass spec- trometry-based proteomics data. Although proteomics offers unique insight into various pathophysiological processes, currently many proteomics datasets remain smaller in scale than their RNA-seq counterparts and it remains to be investigated how best to integrate proteomics and RNA-seq data to maximize their utility. The goal of this pilot project is to assess the feasibility of integrating multi-omics data to generate new hypotheses about cross-tissue physiology. We will perform three integrated tasks during the funding period: First, we will integrate data across tissues, species, and omics data type in two NIH Common Fund projects, namely GTEx v8 and MoTrPAC Release v1.0, with the aid of transfer learning methods. Our goal is to learn a low-dimension representation of gene expression structure across human tissues that can be applied to other datasets to facilitate integrative analysis. Second, we will apply an RNA-sequencing guided proteomics pipeline and software that we recently developed in order to extract hidden peptide information from Common Fund proteomics data, including isoform and post-translational modifications. We will then evaluate the feasibility and utility of integrating transcriptomics and proteomics data to examine multi-tissue correlations in gene expression. Lastly, we will utilize this data analysis pipeline in order to predict cross-tissue communication pat- terns from transcriptomics and proteomics data, then perform limited experimental validation of the computational findings using human induced pluripotent stem cell (iPSC)-derived cells. If successful, we envision that the results will inform on data integration strategies that can help further increase the utility of large-scale proteomics and RNA-seq data in the public domain, as well as generate testable hypotheses on gene co-expression across multiple tissues that can inform future work.
The NIH Common Fund Data Ecosystem collects together a dozen data-rich projects producing high- throughput data with cutting-edge assays. The overall ecosystem design includes plans for cross-project data harvesting and analysis on a collaborative cloud computing platform. While a long-range plan may culminate in highly structured analytic workbenches, we propose to use approaches established in the Bioconductor project to design community-driven modular approaches to data structure and interactive analysis of selected Common Fund assets. Our first aim is to produce standard, easy to use interfaces to resources provided in the 4D Nucleome, Illuminating the Druggable Genome, and Genotype-Tissue Expression projects. These Common Fund projects provide data of considerable interest by the general research community, but data discovery and use is hampered by intrinsic complexity as well as differing access and delivery methods adopted by the various Common Fund projects. We will interface well-established data containers and query methods to allow familiar R/Bioconductor programming idioms to work smoothly with resources from the selected Common Fund projects. The two-decade history of Bioconductor's approach to modular software design, documentation, integration, and distribution will sharply increase the likelihood of durable improvement in access to and utilization of Common Fund Assets. Our second aim is to leverage the new interfaces and containers to carry out four analytical projects in the investigation of origins of treatment persistent tumor cells in organoid cultures, with the objective of identifying compounds that can attenuate and provide insights into epigenetic mechanisms of treatment persistence. The operational and substantive outcomes of this project will form the basis of a comprehensive, highly community-driven approach to building strong bridges between consortia innovating at the cutting edge of biotechnology, and scientists innovating in integrative translational inference based on genome biology.
Protein phosphorylation plays a major role in perturbation-induced signal transduction. The Library of Integrated Network-based Signatures (LINCS) P100 project has generated targeted (using 96 carefully chosen phosphosites) and comprehensive (using DIA, data independent acquisition) mass spectrometry (MS)-based phosphoproteomic datasets characterizing cell states perturbed using a collection of common bioactive therapeutic (“known”) compounds. The analysis of phosphoproteomics data acquired using DIA is challenging, and as a result the comprehensive LINCS DIA data has not yet been leveraged to its full potential. The Molecular Transducers of Physical Activity (MoTrPAC) consortium has also used phosphoproteomics (along with other omics data) to quantify the molecular effects of exercise. Using methods like post-translational modification site- specific enrichment analysis (PTM-SEA, Krug et al. MCP, 2019), perturbation-induced phosphorylation signatures can be correlated with exercise-induced phosphorylation changes to identify known compounds that can mimic the effects of physical activity, providing an exciting opportunity to enhance the combined utility and impact of the LINCS and MoTrPAC Common Fund datasets. The goal of our proposal is to test the hypothesis that there are known compounds that can mimic the effects of physical activity. To accomplish this goal, the PTM signatures database (PTMsigDB) will be significantly expanded using the LINCS DIA data.These signatures will then be correlated with phosphoproteomic changes induced by physical activity provided by MoTrPAC to nominate exercise-mimetic drugs. With these goals in mind we will 1) Develop and apply an automated, cloud-based pipeline to greatly expand phospho-perturbation profiles from existing LINCS data from 100 to several thousand distinct phosphosites; 2) Derive perturbation-specific phosphoproteomic signatures from the greatly expanded phospho-profiling data; and 3) Identify exercise mimetic drugs using PTM signature enrichment in MoTrPAC phosphoproteomic data from acute and long-term exercise. Successful application of the proposed strategy will nominate a list of exercise-mimetic drugs with the potential to initiate entirely new avenues of research focused on finding new therapeutic approaches to combat aging or muscle wasting (cachexia) in cancer patients.
To enhance the utility of the common fund supported 4D Nucleome (4DN) database and Genotype- Tissue Expression (GTEx) database, we will develop novel computational tools for infering the spatial organizations of genomic elements to elucidate how eQTLs can regulate the expression of their target genes. Our tools will integrate 4DN and GTEx data and overcome the limit of the 2D nature of Hi-C frequency heatmaps, enabling construction of large 3D ensembles of high-resolution models of single-cell chromatin conformations for loci containing tissue-specific genetic variants associated with differential expression. By accounting for 3D polymer effects of random collision between genomic elements due to nuclear volume confinement, our tools will identify chromatin interactions that are statistically significant and likely biologically important. With the ensemble model of single-cell 3D chromatin conformations, our tools will further identify participating genes, promoters, enhancers, and other elements, and elucidate how they are physically arranged in space around genetic variants associated differential gene expression, including how units of higher order many-body interaction for gene regulation may form. In addition, our tools will quantify the presence of heterogeneous subpopulation of cells with different chromatin 3D configurations, allowing probabilistic understanding of the heterogeneous physical interactions around eQTLs. With planned comparative analysis of 3D chromatin conformations from different tissues, different spatial pattern of arrangement of genes and elements important for gene expression will be uncovered, resulting better understanding of genome structure and function relationship. Overall, we will demonstrate significant added-power of integrating two important Common Fund data resources and will provide tools to facilitate understanding the relationship between genome topology and gene expression. Our work will enable highly specific and compelling testable hypothesis on mechanisms of gene regulation to be formulated based on the reconstructed 3D spatial genome topology at loci that harbor variants and eGenes. Validation or refutation of these hypotheses will lead to new insight into the relationship of genome structure and genome function important for improving human health.
Smoking and drinking are major modifiable and heritable risk factors for a myriad of human diseases. Elucidating the genetic basis for smoking and drinking addiction will be critical for public health. In the past few years, the genetic studies of smoking and drinking addiction have made significant progress. With the help of large datasets and advanced analytical methods, we have identified >400 associated loci in samples of European ancestry. As a next step, we will expand our study to include samples of non-European populations, in order to further empower discovery and elucidate the genetic architecture. A majority of the identified GWAS loci are non-coding. A critical first step to understand their function is to identify the target gene. Transcriptome-wide association study (TWAS) was proposed to link regulatory variants to target genes. In its original form, TWAS integrates eQTL and GWAS data from the matched ancestry. As multi- ethnic studies become more prevalent, it has been shown that direct integration of European eQTL with non- European GWAS would lead to loss of power and the results may be difficult to interpret as well. A majority of Common Funds functional genomic data (e.g., GTEx and 4DN) were primarily from European ancestry. It remains unclear whether they remain useful in multi-ethnic studies and if so, how to effectively utilize them. Here we propose a series of methodological innovations to combine GTEx data, epigenetic and 3D genomes data and other non-European functional genomic data to improve the gene expression prediction accuracy across tissue types and ancestries. For a given gene expression model, we will also propose methods to perform provably optimal TWAS in multi-ethnic genetic studies. These proposed methods, if successful, will open doors to use Common Funds data in the next generation genetic studies of complex traits in diverse populations. Compared to extremely expensive data generation, these method development projects are cost effective and could be highly impactful for maximizing the utility of Common Funds datasets.
This project will pilot a process to explore the role of genes contributing to abnormal asymmetry in developmental disorders by combining knowledge of genotype/phenotype interactions derived from the Common Fund Knockout Mouse Phenotyping Program (KOMP2) and the Genotype-Tissue Expression (GTEx) project with family cohort data from two Gabriella Miller Kids First Pediatric Research Projects (KF): Genomic Studies of Orofacial Cleft Birth Defects and Genomics of Orofacial Cleft Birth Defects in Latin American Families. Asymmetry is a key feature of numerous developmental disorders including major structural birth defects as well as neurological disorders. A better understanding of the genetic basis of asymmetry and its relationship to disease susceptibility will help unravel the complex genetic and environmental factors and their interactions that increase risk in a wide range of developmental disorders. The KOMP2 project aims to provide comprehensive mouse knockout phenotype data, including 3D fetal imaging of sub-viable and lethal lines that are likely to play a significant role in development. In this project, automated, dense quantification of asymmetry of 3D embryonic microCT images will be used to build statistical models of asymmetry in normal development. Knockout strains will be screened for phenotypes with asymmetric structures or organs with the goal of detecting genes associated with abnormally heightened asymmetry. The functional significance of the selected genes will be validated by comparing regions impacted in knockout strain phenotypes from the KOMP2 dataset to tissue expression data from the GTEx project. Candidate genes identified using biological information from the KOMP2 and GTEx datasets will be explored f or association with the KF whole genome sequencing data from OFC parent-case trios with the aim of identifying genetic variants that are enriched in these groups compared to a control population. Identification of these variants will help shed light on the mechanisms linking congenital asymmetry and OFC risk. The outcomes of this study will include (1) statistical models of normal anatomy and asymmetry from the KOMP2 fetal 3D imaging data, (2) an open -source software to produce detailed phenotype descriptions from dense morphometric analysis of 3D images from the KOMP2 dataset, (3) correlations between phenotype descriptions from the KOMP2 knockout strains and tissue expression data from the GTEx project, and (4) analysis of the contribution of rare variants on candidate genes towards OFC risk.
Subcellular localization, such as the nucleus lysosomes, and mitochondria, has tremendous potential to enhance the effectiveness of the therapeutic molecules rather than random distribution throughout the cell. With improved subcellular localization and enhanced concentration, a specific molecule can be more efficacious as well as less toxic which is usually a concern of random distribution and nonspecific localization. Therefore, understanding subcellular distribution and the mechanism for a specific molecule can further modulate subcellular dysfunction mediated diseases. Xenobiotic localization at the subcellular level has a profound effect on several processes. The overarching goal of the proposed work is to develop a novel platform with computational tools for specific xenobiotic localization. The proposed work will take advantage of three common fund datasets. In specific aim-1, we aim to develop a suite of machine learning (ML) models for hierarchical levels of micro-compartmentation and 40 specific subcellular locations. These machine learning models will be first built using three different types of features (fingerprints-based, pharmacophore-based, and physicochemical descriptors-based). Then, they are fused using an advanced multilayer combinatorial fusion algorithm to get the best consensus model. We will also perform the scaffold analysis to identify critical scaffolds that play a role in accumulating molecules at specific subcellular locations. In specific aim-2, we will conduct experimental validation of the predictions developed ML models. More specifically we will test 50 compounds for their subcellular location. In specific aim-3, we plan to build an open portal that incorporates datasets, ML model, prediction server, and documentation. All the data and models generated from the project are made available as open-source.
This project will pilot a process for identifying multi-variant interactions contributing structural birth defect and childhood cancer disorders. This study will focus on analysis of oral-facial clefts, congenital diaphragmatic hernia, and congenital heart defects from whole-genome sequencing (WGS) data from family cohorts taken from Gabriella Miller Kids First Pediatric Research Project (KF). Typically, tests to link disorders to genome-wide complex multivariate associations are computationally prohibitive. Thus, a first step to making such analyses more reasonable is to limit the number of variables (genes, variants) being tested together. We can reduce the number of possible tests by restricting what data should be tested. This study will seek to reduce the data for testing by utilizing biological knowledge from other two Common Fund datasets: the Knockout Mouse Phenotyping Program (KOMP2), and the Genotype-Tissue Expression (GTEx) project. KOMP2 has generated extensive information on mouse knockout developmental phenotypes relevant for matching gene and phenotypes in KF WGS studies. GTEx can be merged with KF loci and relevant tissue-to-phenotype relationships. Thus, using features from other Common Fund data and annotations, we can generate selected subsets of KF variants and genes as feature-reduced KF data. A comprehensive machine learning (ML) analysis pipeline will then be utilized for the identification of candidate risk factors and characterization of complex patterns of association between these feature-reduced KF data. In addition to performing the more traditional univariate association analyses of genotype vs. phenotype, this pipeline will also identify complex associations including (1) context-dependent genetic effects resulting from non-additive multi-variant interactions, i.e. epistasis, and (2) subgroup-specific associations, i.e. by phenotype and genotypic heterogeneity, where different etiological paths lead to the same/similar phenotypes in the selected KF subject group. This pipeline will include feature selection, modeling, and interpretation of multi-variant interactions. The outcomes of this study will include (1) pipelines for integrating Common Fund data into Kids First datasets, (2) integrated KF-KOMP2-GTEx datasets including cross-species integration, (3) ML pipelines for multi-variant interaction analyses of phenotype vs genotype in selected, reduced-feature KF data, and (4) results from the aforementioned pipelines for multi-variant interactions for later hypothesis testing.
The Genotype-Tissue Expression (GTEx) Program studies the impact of genetic variants on gene expression in many human cell types and tissues. To identify the expression quantitative trait loci (eQTLs) of each gene, the genetic variants within one million base pairs (1 Mb) of the transcription start site (TSS) of the gene are considered as the candidates, and then the GTEx computational pipeline identifies the significant candidates as eQTLs of the gene. This 1 Mb threshold is being widely used as the gold standard in the field to reduce multiple tests. Using this threshold assumes that genetic variants outside of this distance contribute little to gene expression, and thus are unlikely to be eQTLs. However, we observed that, on average, 10% of cis-regulatory elements (CREs) are outside of the 1 Mb threshold, herein referred to as distal CREs. Therefore, the eQTLs in such CREs are missed using the 1 Mb threshold. In addition, the 1 Mb threshold implicitly assumes that the majority of genomic regions within the distance to a TSS are CREs that regulate the gene. However, we found that on average CREs account for only 2.1% of the ±1Mb regions around a TSS. Moreover, it is not uncommon that CREs skip the closest genes to regulate distal genes. These observations indicate that many candidate variants within the 1 Mb distance may be noise, and thus impede the detection of bona fide eQTLs. In line with this, we found that using distance thresholds smaller than 1 Mb substantially increase the numbers of eQTLs and associated genes. These results together indicate that the current eQTLs detection can be improved by focusing only on the CREs of genes. To this end, we will use the genome structure data from 4D Nucleome and other public data to build CRE-gene linkages. These linkages are expected to detect more eQTLs, especially the weak ones. The results will enhance the existing GTEx dataset and substantially improve our understanding of gene expression regulation and human diseases.
The NIH Common Fund program has generated a number of transformative data sets containing a wide variety of multi-dimensional molecular and phenotypic data from human and model organisms. We propose to promote the integration and widen the usage of selected Common Fund data sets (KidsFirst, GTEx, and LINCS L1000) using UCSC Xena. UCSC Xena is a web-based high-performance resource for functional genomics data visualization with a large user base in the cancer genomics research community. Xena is already a visualization resource for a Common Fund data set, the GTEx transcriptome, which cancer researchers can use to compare gene expression between tumors and matched normal tissues. Our proposed work will further widen the usage of the Common Fund data sets by providing the scientific community a web-based interactive avenue to visualize and explore the data. We also propose to integrate the data with other cancer genomics data for pan-cancer across-tissue comparison to identify target genes and pathways in patients tumors. Lastly our proposal will extend Xena Browser software functionality to use differential gene expression and leverage external tools and APIs to search for small molecules that might inhibit tumor growth. We propose the following aims. Aim 1. We will add KidsFirst data to Xena and integrate it with the cancer genomics data already on Xena to enable comparative visualization of molecular profiles across pediatric and adult tumors. Aim 2. Building upon the success of the UCSC RNA-seq compendium, we will deliver an even larger uniformly analyzed RNA-seq data compendium of over 25,0000 samples from KidsFirst, GTEx, TCGA, TARGET, CCLE and other studies. We will openly-share the compendium data using the Xena Browser. This rich data resource will support not only users wishing to compare expression across tumor to normal tissues, but will also support the Treehouse Childhood Cancer initiative in their quest to find treatments for children with cancer. Aim 3. We will extend Xena software functionality to perform genome-wide differential gene expression analysis and connect the analysis results to L1000FWD, a state-of-the-art web-based search and visualization tool for tens of thousands of small-molecule perturbation signatures profiled by the LINCS L1000 assay. This new feature and connectivity will enable users to predict candidate small molecule perturbations that might disrupt tumor growth using the reverse tumor gene expression signature they identified on Xena.
Mass spectrometry in combination with chromatography provides a powerful approach to characterize small molecules produced in cells, tissues and other biological systems. In essence, measured metabolites provide a functional readout of cellular state, allowing novel biological studies that advance our understanding of health and disease. Currently, the main bottleneck in metabolomics is determining the chemical identities associated with the spectral signatures of measured masses. Despite the growth of spectral databases and advances in annotation tools that recommend the chemical structure that best explains each signature, the large majority of measured masses cannot be assigned a chemical identity. There is now consensus that gleaning partial information regarding the measured spectra in terms of chemical substructure or chemical classification can inform biological studies. This consensus is reflected in the newly updated reporting standards for metabolite annotation as proposed by the Metabolite Identification Task Group of the Metabolomics Society. As we show in our Preliminary Results, spectral characterization results in “features” that can enhance performance in machine-learning tasks such as annotation. This work aims to enhance the use and value of the metabolomics dataset in Metabolomics Workbench by: (1) developing machine-learning tools trained on this dataset to characterize unknown spectra, and (2) adding characterization information to the Metabolomics Workbench dataset. In Aim 1, we identify spectral patterns (motifs) that can represent chemically meaningful groupings of peaks within the spectra (e.g., peaks associated with aromatic substructures, loss of a substructure fragment, etc.). We utilize neural topic models that use variational inference to identify such motifs. We expect such models to offer computational speedups and to identify more chemically coherent motifs when compared to earlier implementations of topic modeling. We generate motifs across all spectra in the Metabolomics Workbench and provide annotations for each spectrum. In Aim 2, we map spectral signatures to chemical ontology classes. As ontologies are hierarchical and as a molecule can be associated with multiple classes at different hierarchical levels of an ontology, we cast this mapping problem as a hierarchical multi-label classification problem and use neural networks to implement such a classifier. The classifier will be trained using the Metabolomics Workbench dataset. Learned motifs from Aim 1 will be used as additional input features to improve classification. We expect that the developed classifier can be used by others to elucidate measurements of unidentified molecules with chemical ontology classes, or to generate ontology terms that can be used as features in downstream machine-learning tasks.
Gene duplication is a major mechanism for the evolution of novel gene functions. Copy-number and sequence variation within multigene families are associated with many phenotypes, human diseases, and evolutionary adaptations. Yet systematic incorporation of gene paralog variation into studies of genomic diversity is lacking. Most existing tools are not well suited to delineating differences among gene family members or require prohibitively large computational resources. We recently developed an approach, QuicK-mer2, which efficiently estimates gene copy-number in a paralog specific manner. Application of our approach to data from the 1000 Genomes Project revealed rare gene-paralog variants that have not been previously reported. Here, we propose application of QuicK-mer2 to create paralog specific copy-number estimates from existing NIH Common Fund genomics data sets. In specific Aim 1, we will analyze genome sequencing data from the Genotype-Tissue Expression (GTEx) consortium to define the effect of gene paralog variation on gene expression levels. Although we will assess the entire genome, we will focus our analyses on variation among the largest family of transcription factors, KRAB-ZFPs (Kruppel-related AB box zinc finger proteins), to identify trans-acting expression QTL. In specific Aim 2, we will analyze variation among duplicated genes in the Gabriella Miller Kids First Data Resource with a focus on structural birth defects, a phenotype to which copy-number variation is known to be a key contributor. Many recurrent copy-number variants arise in regions which are flanked by large segments of duplicated sequence with a high identity. Many of these regions of segmental duplication also contain members of duplicated gene families that have important biological functions. Here, we will focus on discovering previously missed gene copy number variation within the duplicated sequences themselves. Together, completion of these aims will give a fuller picture of the extent of genomic variation and the impact of differences among gene paralogs on gene regulation and disease.
For the 30 million Americans living with a rare disease, 95 percent of those diseases do not currently have an identified therapeutic option. Advances in genetics and omics technologies coupled with increased availability of health data present an opportunity to make precise personalized patient care broadly a clinical reality. However, the lack of rare disease clinical samples and suitable preclinical models for research and development often makes it difficult to even nominate, let alone test, therapeutic options for these patients. To aid rare disease research, our long term goal is to develop and apply approaches leveraging multi-omics data to nominate and prioritize drug targets and repurposing candidates. In this project, our main objective is to conduct a feasibility study based on analyses across NIH Common Fund and other publicly available data, developing research methods to support data integration. In Aim 1, we will pursue bioinformatics analysis to identify and improve optimal preclinical rare disease models by piloting approaches for identifying the best cell line as an avatar for a given patient (Aim 1a) and for analyzing patient induced pluripotent stem cell (iPSC) profiles in the context of the most clinically relevant tissue types (Aim 1b). In Aim 2, we will determine and test prioritized drug repurposing candidates for rare diseases by implementing transfer learning to project data on to cell line-by-perturbation data and identifying drug candidates that might rescue cell physiology deficits (Aim 2a). We will test top drug repurposing candidates for each phenotype in either patient-derived iPSC cell lines or xenograft mouse models and generate and analyze RNA-seq profiles pre- and post-treatment for future use in refining computational models (Aim 2b). We focus here on two rare diseases which both desperately need improved therapeutic options: Friedreich’s ataxia (FRDA) and rare brain tumors including glioblastoma multiforme (GBM). We will use NIH Common Fund data sets specified in this Funding Opportunity Announcement (GTEx, Kids First, LINCS, and PHAROS), other NIH-supported data sets (TCGA and CCLE/DepMap), and RNA-seq data generated in our lab at UAB. Because we are advancing this methodology in two disease systems simultaneously, we will demonstrate broad utility of these approaches and ensure a high chance of success in the one year timeframe. These approaches will be the basis of a conceptual framework for subsequent R01-level funding regarding genome-guided precision medicine approaches and computational methods development, as well as generating hypothesis for future collaborative GBM and FRDA research projects. The interdisciplinary approaches described here are crucial for advancing bench-to-bedside rare disease studies both at UAB, a leader in rare disease diagnosis, as well as in the broader scientific community. Upon successful completion of this proposal, we expect our contribution to be advancements to both preclinical modeling of, and prioritizing drug repurposing candidates for rare diseases as well as demonstrate how Common Fund data can be used to accelerate rare disease research.
After the completion of the Human Genome Project, several landmarking consortia have accumulated large amounts of genomic data towards understanding the functions of human genome. The ENCODE project has annotated genome-wide regulatory elements. The Roadmap Epigenomic project has characterized tissue-specific variation in epigenetic state. The NIH Common Fund GTEx project has delineated tissue-specific gene expression and transcription regulation. The NIH Common Fund 4D Nucleome (4DN) project has revealed dynamic 3D chromatin organization in many cell and tissue types. Each of the aforementioned consortia has generated thousands or even tens of thousands of datasets, and provided different insights regarding human genome at an unprecedent scale and depth. However, the datasets generated from these consortia are isolated in terms of cell types and tissue types covered, how the data are stored, and the resolution of the genomic data. These gaps bring realistic data analysis challenges to biomedical researchers when they use these public datasets jointly in their research — they need to go through different data portals with heterogeneous processing pipelines, different data formats, and unmatched resolutions. We aim to develop the most cutting-edge deep learning approaches to impute high-resolution chromatin contact maps, and integrate the high-resolution chromatin contact maps with transcriptional data available from GTEx project and epigenomic data from ENCODE/Roadmap. We plan to share the integrated data on a public web server with a multi-panel interactive visualization genome browser. The integrated data will provide an important resource for understanding of tissue-specific genetic variation in the light of the spatial organization of these genomic and epigenomic elements and their functional implications.
Improving Deposition Quality and FAIRness of Metabolomics Workbench PROJECT SUMMARY (30 lines) The practical reuse of genomics and transcriptomics datasets is well-demonstrated due to the use of universal gene identifiers that facilitate matching of features across these datasets, high feature coverage, standardized metadata and data deposition formats, and a maturity in deposition quality and consistency. However, metabolomics datasets are much harder to reuse due to the lack of standardization metabolite feature identification, heterogeneity in feature coverage, and high variability in deposition quality and consistency. Therefore, it is much harder to both find relevant metabolomics datasets from repositories like Metabolomics Workbench (MWbench) and effectively reuse these datasets to generate and/or test hypotheses. To address these difficulties in reusing metabolomics datasets, deposition quality must be improved. Furthermore, methods that enable the effective search and harmonization of MWbench studies are needed, especially for integrative multi-omics analyses. We are the developers of the only set of available open- source tools for parsing, generating, and validating mwTab formatted repository files. Our experience developing and utilizing this open-source mwtab Python package makes us uniquely qualified to develop methods to improve both deposition and FAIRness of MWbench studies. Also, we have provided periodic feedback to MWbench based on systematic evaluations of the repository to enable the improvement of this growing public resource (2). Therefore, we propose to develop methods and open-source tools that will improve deposition quality and FAIRness of MWbench through the following specific aims: Aim 1: Enable comprehensive capture, deposition, and validation of metabolomics experimental data and metadata; Aim 2: Improve the FAIRness of Metabolomics Workbench while demonstrating effective multi-omics integration with the Genotype-Tissue Expression Project (GTEx). The major innovations that this proposal will develop are: i) effective metadata capture methods from unstructured formats, ii) advanced search methods for relevant MWbench studies that can filter on metadata quality, iii) effective harmonization methods for MWbench studies, iv) new omics integration approach to detect human gene-metabolite associations, and v) new tools that facilitate public deposition with high-quality metadata, with InChI tags, and in mwTab format for quicker, easier deposition. The significance of this proposal is in developing methods and tools that: a) comprehensively capture, validate, and deposit metadata-rich metabolomics data, b) improve the FAIRness of MWbench datasets, especially reuse, c) enable integration of MWbench and GTEx datasets to generate biomedically-relevant human gene-metabolite associations, and d) enable interpretation of gene-metabolite associations within molecular interaction networks. These new tools will enhance the utility and usage of Metabolomics Workbench while demonstrating multi-omics integration with the Genotype-Tissue Expression Project.
The Common Fund Knockout Mouse Phenotyping Program (KOMP2) is a valuable resource for functionally characterizing mammalian genes. We propose to increase the utility of KOMP2 by curating and annotating genomic information in the dataset by collecting and curating human clinical data to match human patients to KOMP2 mice with severe phenotypes. The goal of this project is to assess pediatric patient cohorts with exome sequencing data and no molecular diagnosis for variants of uncertain significance in genes that correspond to a lethal phenotype in KOMP2 mouse mutant lines. Mouse lines categorized as cellular lethal, developmental lethal or subviable are targeted as relevant for early and severe pediatric phenotypes. For this reason, we will consider four human patient cohorts. The first cohort consists of patients who died within the first year of life. The second cohort consists of patients admitted to the pediatric intensive care units (ICUs) within the first 100 days of life. The third cohort is a recent sample of pediatric patients with trio exome data available. The fourth cohort is a pediatric cohort with likely Mendelian disease genes of unknown function. With each cohort we will identify variants of uncertain significance in human orthologues corresponding to mouse genes classified as cellular lethal, developmental lethal or sub-viable. Then, we will compare the mouse and human phenotypes using standardized phenotype terms to prioritize follow up of genes with variants in our human cohorts and with similar phenotypes in mice and humans.
This project will develop novel computational methods to leverage diverse sources of data sets, including the rich information generated from the NIH Common Fund projects, for drug repurposing, which may dramatically lower the risk of drug development by skipping early-stage trials, shorten time investment, and cut down capital investment. With the advancement of high-throughput sequencing and massively parallel technologies, more and more omics data are available for biomedical research. These genomics, transcriptomics, proteomics, metabolomics and microbiomics data can help biomedical researchers better understand the complex biological systems underlying human diseases from different perspectives. For example, genome-wide association and sequencing studies have successfully identified tens of thousands of variants that are significantly associated with one or more complex traits. Despite these great successes, the results have not been fully translated into potential clinical value. The overall goal of this pilot project is to leverage the rich information generated from the NIH Common Funds projects, in combination of other public data sets, to explore the feasibility of drug repurposing through novel computational approaches. The ultimate goal of our project is to develop, implement, and apply a computational framework to integrate data from the Common Fund projects and other resources to identify potential uses of existing drugs for new indications, and we will also make our newly developed tools available to the general research community. This will be accomplished through: [1] further development of a powerful framework proposed by our group to leverage cross-tissue information in the GTEx data to achieve higher accuracy in imputation of gene expression within each tissue and combine single-tissue association tests to derive a powerful test for gene-trait association using summary statistics from genome wide association studies; [2] development of a signature-matching-based drug repurposing framework with gene expression data from diverse sources (drug perturbation experiments, case control studies, and patient intervention studies) and GWAS summary statistics; and [3] implementation and application of the proposed framework to discover candidate drugs for repurposing to diseases in critical need of drug development, e.g. non-alcoholic steatohepatitis. With the completion of the pilot project, we will be able to assess the feasibility of the proposed framework for drug repurposing for further developments and implementations.
Over the past decade, thousands of genome-wide association studies (GWAS) have been performed, greatly improving our understanding of the genetic origins of complex diseases. A large number of variants have been associated with individual traits, but a complete understanding of complex disease remains elusive, due in large part to two unsolved challenges. First, a majority of associated variants are noncoding and distant from the nearest gene, complicating their interpretation. Second, the observed heritability of many complex diseases far exceeds the portion which can be explained by GWAS-discovered variants, largely because of the combined effects of rare variants unprobed by current techniques and common variants falling below significance thresholds of existing GWAS methods. As whole-genome sequencing and rare variant discovery become increasingly prevalent, frameworks for functionally annotating rare variants and associating them with disease-associated driver genes and pathways will become increasingly important. A wealth of public epigenetic data exists, including collections of chromatin modification profiles and 3D structure data from various Common Fund sources as well as external consortia. In combination with whole-genome sequencing data, these datasets offer great potential to further our understanding of diseases across the spectrum from Mendelian to complex diseases. As members of the ENCODE Project, we have developed the Registry of candidate cis-Regulatory Elements (cCREs), a collection of nearly a million candidate enhancers, promoters, and insulators in the human genome with activity profiles in more than 800 human cell types. In parallel, we collaborated with Prof. Xihong Lin on the development of variant-Set Test for Association using Annotation infoRmation (STAAR), a framework for performing rare-variant association tests using functional annotations and a dynamic weighting scheme. Here we aim to extend the Registry of cCREs to include gene regulatory networks, including gene-enhancer links, 3D chromatin neighborhoods, co-expressed gene networks, and biochemical pathways, drawing on data from the Common Fund, including GTEx and the 4DNucleome Project, and other public sources (Aim 1). We then aim to extend GWAS and the STAAR methodology to incorporate these higher-order features to identify novel gene regulatory network associations with disease-associated rare variants (Aim 2). In this study, we will focus on three human congenital disorders, cleft lip/palate (CLP), congenital diaphragmatic hernia (CDH), and ventricular septal defect (VSD), as these disorders have extensive whole-genome sequencing data by the Gabriella Miller Kids First Pediatric Research Consortium. We will validate our results using Knockout Mouse Phenotyping Program (KOMP2). In summary, we will discover new disease-gene associations, produce a framework broadly applicable to existing and future whole-genome sequencing datasets, and improve the utility and accessibility of four select Common Fund datasets (GTEx, 4DNucleome, KOMP2, and Kids First).
Sex differences in human diseases are well-recognized, but the mechanisms are not well understood. This gap of knowledge delays the progress in risk assessment and therapeutic strategies for sex-aware precision healthcare. While studies have shown significant sex differences in the genetic architectures of complex diseases, most investigators opted to do sex- combined analyses in disease genetic studies to maximize statistical power. NIH recently began to reinforce the inclusion of sex as a biological variable in the design, analysis, and reporting of vertebrate animal and human studies. Insights into the functional genetic bases of sex as a biological variable are critical to develop therapeutic interventions that equally benefit each sex. We recently found that ~1% variants in the population have sex-biased allele frequency, including ~10% of disease variants in the Genome Aggregation Database (gnomAD). These variants preferentially occur in tissue-specific sex-differentially expressed genes. We propose a novel approach to study sex differences in disease genetic architectures by leveraging variants that are sex-biased either in allele frequency or phenotypic association. We believe this approach will increase the statistical power to identify sex-specific or sex interacting causal variants in sex- biased diseases. We will identify and characterize sex-biased variants in gnomAD, Genotype- Tissue Expression project (GTEx) and Trans-Omics for Precision Medicine for sleep disordered breathing phenotypes and venous thromboembolism case-control datasets. We will subsequently study the functional mechanisms of these sex-biased variants in ~50 GTEx tissues. The completion of this pilot study will advance future genetic studies of sex-divergent disorders and accelerate the realization of sex-aware genomic medicine.
We will work with the iHMP data resource to apply novel tools and data analysis methodologies to the challenge of disease association between large microbiome data sets, Inflammatory Bowel Disease, and the onset of diabetes. We will start with an annotation-free approach using k-mers to preprocess IBD and diabetes cohorts. We then will apply a novel scaling technology implemented in the sourmash software to reduce the data set size by a factor of 2000, rendering it tractable to machine learning approaches. We next will use random forests to determine a subset of predictive k-mers, and will measure their accuracy on validation data sets not used in the initial training. Finally, we will annotate the predictive k-mers using all available genome databases as well as a novel method to infer the metagenomic presence of accessory genomes of known genomes. Our outcomes will include a catalog of microbial genomes that correlate with IBD subtype and the onset of diabetes, as well as automated workflows to apply similar approaches to other data sets.