LMRL2020 Accepted Posters
Visit posters live at our Gather.town
# | Room | Title | Authors | Abstract | Video |
---|---|---|---|---|---|
1 | C | Embeddings Allow GO Annotation Transfer Beyond Homology | Maria Littmann, Michael Heinzinger, Christian Dallago, Tobias Olenyi, Burkhard Rost | Understanding protein structure and function is crucial to advance molecular and medical biology. Yet, most protein features cannot be easily determined experimentally for all known protein sequences. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an Fmax of 37±2%, 50±3%, and 57±2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with <20% pairwise sequence identity to the query, performance drops (Fmax BPO 33±2%, MFO 43±3%, CCO 53±2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, we hypothesize that the annotation transfer using embedding similarity could be applied to a variety of task in protein structure and function prediction. | |
2 | C | Profile Prediction: An Alignment-Based Pre-Training Task For Protein Sequence Models | Pascal Sturmfels, Jesse Vig, Ali Madani, Nazneen Fatema Rajani | Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks. However, the optimal pre-training strategy remains an open question. Instead of strictly borrowing from natural language processing (NLP) in the form of masked or autoregressive language modeling, we introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments. | |
3 | A | Topological Data Analysis of copy number alterations in cancer | Stefan Groha, Caroline Weis, Alexander Gusev, Bastian Rieck | Identifying subgroups and properties of cancer biopsy samples is a crucial step towards obtaining precise diagnoses and being able to perform personalized treatment of cancer patients. Recent data collections provide a comprehensive characterization of cancer cell data, including genetic data on copy number alterations (CNAs). We explore the potential to capture information contained in cancer genomic information using a novel topology-based approach that encodes each cancer sample as a persistence diagram of topological features, i.e., high-dimensional voids represented in the data. We find that this technique has the potential to extract meaningful low-dimensional representations in cancer somatic genetic data and demonstrate the viability of some applications on finding substructures in cancer data as well as comparing similarity of cancer types. | |
4 | A | A deep learning classifier for local ancestry inference | Matthew Aguirre, Jan Sokol, Guhan Venkataraman, Alexander Ioannidis | Local ancestry inference (LAI) identifies the ancestry of each segment of an individual's genome and is an important step in medical and population genetic studies of diverse cohorts. Several techniques have been used for LAI, including Hidden Markov Models and Random Forests. Here, we formulate the LAI task as an image segmentation problem and develop a new LAI tool using a deep convolutional neural network with an encoder-decoder architecture. We train our model using complete genome sequences from 982 unadmixed individuals from each of five continental ancestry groups, and we evaluate it using simulated admixed data derived from an additional 279 individuals selected from the same populations. We show that our model is able to learn admixture as a zero-shot task, yielding ancestry assignments that are nearly as accurate as those from the existing gold standard tool, RFMix. | |
5 | A | Protein Structural Alignments From Sequence | James T. Morton, Charlie E. M. Strauss, Robert Blackwell, Daniel Berenberg, Vladimir Gligorijevic, Richard Bonneau | Computing sequence similarity is a fundamental task in biology, with alignmentforming the basis for the annotation of genes and genomes and providing the coredata structures for evolutionary analysis. Standard approaches are a mainstay of modern molecular biology and rely on variations of edit distance to obtain explicit alignments between pairs of biological sequences. However, sequence alignmentalgorithms struggle with remote homology tasks and cannot identify similaritiesbetween many pairs of proteins with similar structures and likely homology. Recentwork suggests that using machine learning language models can improve remotehomology detection. To this end, we introduce DeepBLAST, that obtains explicitalignments from residue embeddings learned from a protein language model in-tegrated into an end-to-end differentiable alignment framework. This approach can be accelerated on the GPU architectures and outperforms conventional sequence alignment techniques in terms of both speed and accuracy when identifying structurally similar proteins. | Video |
6 | A | Feature Controlled Variational Autoencoder for Single Cell Image Analysis | Luke Ternes, Joe Gray, Laura Heiser, Young Hwan Chang | Recent advances in high-throughput imaging screening technologies enable quantification of phenotypic differences among diverse cell populations; image-based cell profiling relies on encoded representations of relevant properties of imaged cells as quantitative measurements and features to characterize phenotypic differences. However, defining suitable metrics can be challenging since there are many obstacles to overcome, including segmentation and identification of cellular or subcellular compartments for feature extraction. Recent variational autoencoder (VAE) approaches produced encouraging results by mapping from an image to a descriptor that represents essential information and outperforming classical methods for analyses of hand- crafted features such as morphology, intensity, and texture. Although VAE approaches show promising results for capturing morphological features and cellular organization in tissue, single cell analyses based on VAEs often fail to identify biologically informative features due to the intrinsic amount of uninformative image variability. In order to fully extract and quantify complex, biologically meaningful features from single cell images, VAEs need to be modified to overcome this uninformative image variability. Herein, we propose a feature controlled VAE for single cell analysis by using transformed images as a self-supervised signal to remove the uninformative features such as cellular rotation and polarity and learn biologically meaningful representations. We show this improves downstream analysis of single cell images, making distinct cell populations separable and capturing phenotypic differences between cells. | Video |
7 | A | Auxiliary task evaluations to learn meaningful representations from electronic health records | Constantin Schneider, Eirini Arvaniti, Aylin Cakiroglu, Hamish Tomlinson | With the digitisation of healthcare information in electronic health records (EHRs), large volumes of patient-level data are being produced. These longitudinal and multimodal healthcare datasets may provide insights for improved care and drug development, and are a target of several recent machine learning algorithms. In particular, unsupervised representation learning algorithms such as latent variable models (LVMs) may enable the identification of disease subgroups for precision medicine.LVMs trained on EHR data aim to generate a latent space that is representative of the underlying biology of patient health states. However, it is difficult to evaluate how meaningful the learned patient representations are and this challenge remains relatively under-explored. Existing studies often reduce model evaluation to a single auxiliary classification task, such as prediction of readmission within one year. In this work, instead of focussing on the performance of a single auxiliary task, we propose multiple complementary tasks that reveal different aspects of the latent space and capture the diversity of characteristics that we hope to obtain. | Video |
8 | A | Machine learning guided association of adverse drug reactions with in vitro target-based pharmacology | Robert ietswaart | BackgroundAdverse drug reactions (ADRs) are one of the leading causes of morbidity and mortality in health care. Understanding which drug targets are linked to ADRs can lead to the development of safer medicines.MethodsHere, we analyse in vitro secondary pharmacology of common (off) targets for 2134 marketed drugs. To associate these drugs with human ADRs, we utilized FDA Adverse Event Reports and developed random forest models that predict ADR occurrences from in vitro pharmacological profiles.FindingsBy evaluating Gini importance scores of model features, we identify 221 target-ADR associations, which co-occur in PubMed abstracts to a greater extent than expected by chance. Amongst these are established relations, such as the association of in vitro hERG binding with cardiac arrhythmias, which further validate our machine learning approach. Evidence on bile acid metabolism supports our identification of associations between the Bile Salt Export Pump and renal, thyroid, lipid metabolism, respiratory tract and central nervous system disorders. Unexpectedly, our model suggests PDE3 is associated with 40 ADRs.InterpretationThese associations provide a comprehensive resource to support drug development and human biology studies. | |
9 | A | Factorized linear discriminant analysis for phenotype-guided representation learning of gene expression data | Mu Qiao, Markus Meister | We introduce factorized linear discriminant analysis (FLDA), a novel supervised linear dimensionality reduction method for understanding the relationship between high-dimensional gene expression patterns and cellular phenotypes. We leverage FLDA with a sparsity-based regularization algorithm, which constrains the number of genes contributing to each linear projection. The sparse algorithm allows us to identify new genes for each of the phenotypic features that were not apparent under conventional methods.We illustrate this approach by applying FLDA to a single-cell transcriptome dataset of T4/T5 neurons in Drosophila, focusing on two phenotypes: dendritic location and axonal lamination. The analysis confirms results obtained by conventional methods but also points to new genes related to the phenotypes.FLDA is motivated by multi-way analysis of variance (ANOVA), and thus it generalizes easily to more than two features. The linear nature of FLDA makes it extremely easy to interpret the low-dimensional representations, as the weight vector directly informs the relative importance of each gene. | Video |
10 | A | Multiscale PHATE Exploration of SARS-CoV-2 Data Reveals Signature of Disease | Manik Kuchroo, Jessie Huang, Patrick Wong, Guy Wolf, Akiko Iwasaki and Smita Krishnaswamy | The biomedical community is producing increasingly high dimensional datasets integrated from hundreds of patient samples that current computational techniques are unable to explore. We propose a novel approach, called Multiscale PHATE, which learns increasingly abstracted features from the data to produce high level summarizations and detailed representations of dataset subsets in a computationally efficient manner. Multiscale PHATE utilizes a continuous data coarse graining approach called diffusion condensation to create a tree containing all levels of data granularity, and then selects layers for visualization that start from a coarse grained summary, and then zoom in to reveal more detail. We apply this computational approach to study the evolution of the patient immune response to SARS-CoV-2 infection in 22 million cells measured via flow cytometry. Through our analysis of patient samples, we identify a pathologic non-activated neutrophil response enriched in the most severely ill patients. | Video |
11 | A | Cloud-based Software for NGS Data Management and Analysis for Directed Evolution of Peptide-Based Delivery Vectors | Umesh Padia, David Brown, Xiaozhe Ding, Xinhong Chen, Sripriya R. Kumar, Viviana Gradinaru | Adeno-associated viruses (AAVs) are widely used gene delivery vectors due to their ability to transduce dividing and non-dividing cells, their long-term persistence, and low immunogenicity. However, natural AAV serotypes have a limited set of tropisms. Directed evolution has been used to engineer recombinant AAVs to target specific cell types and tissues, leveraging next generation sequencing (NGS) data. The deluge of data from these deep sequencing experiments has brought about data management and analysis challenges, for which there are no current commercially available solutions. Furthermore, classical approaches to analyzing data from directed evolution heavily involves manual inspection, and often overlooks patterns present in the larger datasets. To address these challenges, we developed robust cloud-based software that provides central management for next generation sequencing data, extracts variants, performs structural modeling, and can be extended to incorporate machine learning models to make predictions for variants with specific properties. The software is composed of a set of interconnected discrete components: a modern web user interface implemented in JavaScript with React, a relational database, a distributed task queue, task workers, and a Python-based API. This architecture facilitates intensive tasks such as alignments, structural modeling, and machine learning to scale from a single machine to hundreds of machines, with minimal configuration. The software automatically imports and manages sequencing data from several different commercial and in-house sequencing providers. The extracted variants are encoded into embeddings, grouped into families, and are analyzed for prevalent sequence motifs. We use the Rosetta software libraries to perform comparative modeling simulations on selected variants. This software represents a general tool for simple, scalable, and centralized analyses of NGS data for protein engineering by directed evolution. | |
12 | A | Characterizing Electrocardiogram Signals using Capsule Networks | Hirunima Jayasekara, Vinoj Jayasundara, Mohamed Athif, Jathushan Rajasegaran, Sandaru Jayasekara, Suranga Seneviratne, Ranga Rodrigo | Capsule networks excel in understanding spatial relationships in 2D data for visionrelated tasks. TimeCaps is a capsule network designed to capture temporal relationships in 1D signals. In TimeCaps, we generate capsules along the temporaland channel dimensions, creating two feature detectors that learn contrasting relationships, prior to projecting the input signal in to a concise latent representation. We demonstrate the performance of TimeCaps in a variety of 1D signal processing tasks including characterisation, classification, decomposition, compression and reconstruction utilizing instantiation parameters inherently learnt by the capsule networks. TimeCaps surpasses the state-of-the-art results by achieving an accuracy of 96.96% on classifying 13 Electrocardiogram (ECG) signal beat categories. | Video |
13 | A | A Cross-Level Information Transmission Network for Predicting Phenotype from New Genotype: Application to Cancer Precision Medicine | Di He, Lei Xie | An unsolved fundamental problem in biology and ecology is to predict observable traits (phenotypes) from a new genetic constitution (genotype) of an organism under environmental perturbations (e.g., drug treatment). The emergence of multiple omics data provides new opportunities but imposes great challenges in the predictive modeling of genotype-phenotype associations. Firstly, the high-dimensionality of genomics data and the lack of labeled data often make the existing supervised learning techniques less successful. Secondly, it is a challenging task to integrate heterogeneous omics data from different resources. Finally, the information transmission from DNA to phenotype involves multiple intermediate levels of RNA, protein, metabolite, etc. The higher-level features (e.g., gene expression) usually have stronger discriminative power than the lower level features (e.g., somatic mutation). To address the above issues, we proposed a novel Cross- LEvel Information Transmission network (CLEIT) framework. CLEIT aims to explicitly model the asymmetrical multi-level organization of the biological system. Inspired by domain adaptation, CLEIT first learns the latent representation of the high-level domain then uses it as ground-truth embedding to improve the representation learning of the low- level domain in the form of contrastive loss. In addition, we adopt a pre-training-fine-tuning approach to leveraging the unlabeled heterogeneous omics data to improve the generalizability of CLEIT. We demonstrate the effectiveness and performance boost of CLEIT in predicting anti-cancer drug sensitivity from somatic mutations via the assistance of gene expressions when compared with state-of-the-art methods. | |
14 | A | Learning a low dimensional manifold of real cancer tissue with PathologyGAN | Adalberto Claudio Quiros, Roderick Murray-Smith, Ke Yuan | Deep generative models with representation learning properties provide an alternative path to further understand cancer tissue phenotypes, capturing tissue morphologies. We present a deep generative model that learns to simulate high-fidelity cancer tissue images while mapping the real images onto an interpretable low dimensional latent space. The key to the model is an encoder trained by a previously developed generative adversarial network, PathologyGAN. Here we provide examples of how the latent space holds morphological characteristics of cancer tissue (e.g. tissue type or cancer, lymphocytes, and stromal cells). We tested the general applicability of our representations in three different settings: latent space visualization, training a tissue type classifier over latent representations, and on multiple instance learning (MIL). Our results show that PathologyGAN captures distinct phenotype characteristics, paving the way for further understanding of tumor micro-environment and ultimately refining histopathological classification for diagnosis and treatment. | |
15 | A | Deep Generative Models of Protein Domain Structures Can Uncover Distant Relationships: Evidence for an Urfold | Eli J. Draizen, Menuka Jaiswal, Saad Saleem, Yonghyeon Kweon, Stella Veretnik, Cameron Mura, and Philip E. Bourne | Recent advances in protein structure determination and prediction offer new opportunities to decipher relationships amongst proteins—a task that entails 3D structure comparison and classification. Historically, protein domain classification has been somewhat manual and heuristic. While CATH and related resources represent significant steps towards a more systematic and automatable approach, more scalable and objective classification methods will enable a fuller exploration of protein structure or ‘fold’ space. Comparative analyses of protein structure latent spaces may uncover distant relationships, and will potentially entail a large-scale restructuring of traditional classification schemes. We have developed 3D convolutional variational autoencoders to ‘define’ ideal geometries and biophysical properties of proteins at CATH’s homologous superfamily (SF) level. To quantitatively evaluate pairwise ‘distances’ between SFs, we built one model per SF and compared the evidence lower bound (ELBO) loss functions of the models when evaluated with different SF structure representatives. Clustering on these distance matrices provides a new view of protein interrelationships—a view that extends beyond simple structural/geometric similarity, towards the realm of structure/function properties, and that is consistent with a recently proposed ‘Urfold’ concept. | |
16 | A | Learning the regulatory grammar of DNA for gene expression engineering | Jan Zrimec, Aleksej Zelezniak | The DNA regulatory code that governs gene expression is present in the gene regulatory structure that spans the coding and adjacent non-coding regulatory DNA regions, including promoters, terminators and untranslated regions. Deciphering this regulatory code, as well as how the whole gene regulatory structure interacts to produce mRNA transcripts and regulate mRNA abundance, can greatly improve our capabilities for controlling gene expression and solving problems related to both medicine and biotechnology.Here, we consider that natural systems offer the most accurate information on gene expression regulation and apply deep learning on over 20,000 mRNA datasets to learn the DNA-encoded regulatory code across a variety of model organisms from bacteria to Human (https://doi.org/10.1038/s41467-020-19921-4). Since up to 82% of the regulatory code is encoded in the gene regulatory structure, mRNA abundance can be predicted directly from DNA with high accuracy in all model organisms. Coding and regulatory regions in fact carry both overlapping and orthogonal information and additively contribute to gene expression levels. By mining the gene expression models for the relevant DNA regulatory motifs, we uncover motif interactions across the whole gene regulatory structure that define over 3 orders of magnitude of gene expression levels. Based on these findings we develop a novel AI-guided approach for protein expression engineering and experimentally verify its usefulness.Our results challenge the current paradigm that single motifs or regulatory regions are solely responsible for gene expression levels. Instead, we demonstrate that the whole gene regulatory structure, comprising the DNA regulatory grammar of interacting DNA motifs across protein coding and adjacent regulatory regions, forms a coevolved transcriptional regulatory unit and provides a mechanism by which whole gene systems with pre-specified expression patterns can be designed. | Video |
17 | B | Constructing Data-Driven Network Models of Cell Dynamics with Perturbation Experiments | Bo Yuan, Ciyue Shen, Chris Sander | Systematic perturbation of cells followed by comprehensive measurements of molecular and phenotypic responses provides informative data resources for con- structing computational models of cell biology. Existing machine learning mod- els of cell dynamics have limited effectiveness to find global optima in a high- dimensional space and/or lack interpretability in terms of cellular mechanisms. Here we introduce a hybrid approach that combines explicit protein-protein and protein-phenotype interaction models of cell dynamics with automatic differen- tiation. We tested the modeling framework on a perturbation-response dataset of a melanoma cell line with drug treatments. These machine learning models can be efficiently trained to describe cellular behavior with good accuracy but also can provide direct mechanistic interpretation. The predictions and inference of interactions are robust against simulated experimental noise. The approach is readily applicable to a broad range of kinetic models of cell biology and provides encouragement for the collection of large-scale perturbation-response datasets. | |
18 | B | XGFix: Fast and Scalable Phasing Error Correction with Local Ancestry Inference | Helgi Hilmarsson, Arvind Kumar, Richa Rastogi, Daniel Mas Montserrat, Carlos Bustamante, Alexander Ioannidis | Accurate phasing of genomic data is crucial for human demographic modeling and identity-by-descent analyses. It has been shown that leveraging information about an individual’s genomic ancestry improves performance of current phasing algorithms. We introduce XGFix, an XGBoost-Local Ancestry based model, to do exactly that. We show that it performs similarly to the state of the art methods while being much faster, scalable in number of ancestries and agnostic to priors that are commonly relied on by other methods. | |
19 | B | Meta Learning for Ancestry Inference | Richa Rastogi, Arvind Kumar, Helgi Hilmarsson, Daniel Mas Montserrat, Carlos D. Bustamante, and Alexander G. Ioannidis | Models for predicting genetic traits and disease associations from an individual's genome are dependent upon the genetic correlation structure of the genome, which varies between different ancestries. However, the majority of the world’s populations are severely under-represented in genomic datasets with over 69% of populations having less than 15 individuals sequenced in public worldwide datasets. This high imbalance leads to poor accuracy estimates with standard ancestry based models. We present a Meta Learning approach for local ancestry inference, that is fast, simple and accurate for the low data regime. We demonstrate by performing a 2-shot, 3-way continental classification that the method generalizes well to unseen classes and is class-agnostic, avoiding the need to retrain to adapt for new classes. | Video |
20 | B | Ge2Net: Genomic Geographical Network, Enabling genetic medicine for diverse populations by inferring geographic ancestry along the genome | Richa Rastogi, Arvind Kumar, Helgi Hilmarsson, Daniel Mas Montserrat, Carlos D. Bustamante, and Alexander G. Ioannidis | Personalized genomic predictions have become increasingly important for genomic medicine enabled by Genome-Wide Association Studies (GWAS). Local Ancestry Inference (LAI) methods provide important co-variates for GWAS studies by labeling each region of the genome with an ancestry category. However, these ancestry categories are divided somewhat artificially, neglecting all variability within the class and ignoring the continuum of human variation that exist geographically between the discrete classes. Here we introduce Ge2Net : GEnomic GEographical Network, the first LAI method to identify ancestral origin of each segment of an individual’s genome as a continuous geographical coordinate, rather than an ethnic category. Ge2Net is a neural network model based on bi-directional LSTMs that yields high resolution ancestry inference, eliminating the need for ethnic labels. | Video |
21 | B | ATOM3D: Tasks on Molecules in Three Dimensions | Raphael J. L. Townshend, Martin Vögele, Patricia Suriana, Alexander Derry, Alex Powers, Yianni Laloudakis, Sidhika Balachandar, Brandon Anderson, Stephan Eismann, Risi Kondor, Russ Altman, Ron Dror | We present ATOM3D, a collection of both novel and existing datasets spanning several key classes of biomolecules, to systematically assess learning methods that work on 3D molecular structure. We implement prototypical three-dimensional models for each of these tasks, finding that they consistently improve performance relative to one- and two-dimensional methods. However, the specific choice of architecture proves to be critical for performance. | |
22 | B | bioembeddings.com: protein feature extraction, visualization and prediction | Christian Dallago & Konstantin Schütze | The descriptive ability of protein-based machine learning models is increasingly leveraged to guide experimental decision making. Protein Language Models (LMs) show enormous potential in generating descriptive features for proteins from just their sequences at a fraction of the time of evolutionary approaches. In general, protein LMs offer to convert amino acid sequences into embeddings (vector representations) that can be used in machine learning pipelines for supervised predictions of function or structure. Protein embeddings can also be used in combination with dimensionality reduction techniques (e.g. UMAP) to quickly span and visualize protein spaces (e.g. via scatter plots). The bio_embeddings pipeline offers an interface to simply and quickly embed large protein sets using protein LMs, to project the embeddings in lower dimensional spaces, and to visualize proteins in these spaces on interactive 3D scatter plots. | Video |
23 | B | Compressed sensing, random walks and gene tests | Pratik Worah | The problem of finding a critical set of genes that accurately distinguish between two sets of samples, say sequencing data from two different disease populations, occurs frequently in many areas of medicine. Here we describe an algorithm, based upon generalizing the compressed sensing algorithm, that finds a small set of distinguishing critical genes. | |
24 | B | Distributed Deep Learning for a High-Dimensionality Genome-Wide Association Study | Aniello Esposito and Diana Moise | A genome-wide association study is a successful approach for investigating relations between a set of genetic variants in different humans and a particular disease.The number of genomic mutations as well as the set of patients included in astudy are typically very large and raise the question of how to render a GWAS feasible.We present the infrastructure of a distributed Deep Learning approach to a GWAS based on Keras and TensorFlow as well as various methods fordata distribution, filtering, and optimization such as deep autoencoders. The framework is found to be scalable up to several hundred nodes on a Cray XC system without evidentimpediments to scale further. Experiments reveal a search direction to improve accuracy in future work which consists of improving the autoencoder to further reduce the feature space and moresophisticated hyper parameter optimization as well as the use of MENNDL. | |
25 | B | A multi-view generative model for molecular representation improves prediction tasks | Jonathan Yin, Hattie Chung, Aviv Regev | Unsupervised generative models have been a popular approach to representing molecules. These models extract salient molecular features to create compact vectors that can be used for downstream prediction tasks. However, current generative models for molecules rely mostly on structural features and do not fully capture global biochemical features. Here, we propose a multi-view generative model that integrates low-level structural features with global chemical properties to create a more holistic molecular representation. In proof-of-concept analyses, compared to purely structural latent representations, multi-view latent representations improve model accuracy on various tasks when used as input to feed-forward prediction networks. For some tasks, simple models trained on multi-view representations perform comparably to more complex supervised methods. Multi-view representations are an attractive method to improve representations in an unsupervised manner, and could be useful for prediction tasks, particularly in contexts where data is limited. | |
26 | B | Learning to localize mutation in lung adenocarcinoma histopathology images | Sahar Shahamatdar, Daryoush Saeed-Vafa, Drew Linsley, Lester Li, Sohini Ramachandran, Thomas Serre | Molecular profiling of cancers is necessary to identify the optimal therapeutic options for patients. However, these assays have notable limitations: they are time-and resource-intensive to perform, and they cannot accurately capture mutational heterogeneity. Here, we present a novel approach to address these issues, which combines state-of-the-art attention models from computer vision with precise laser capture microdissection (LCM) molecular annotations on whole slide images (WSIs) of tumor tissue. We apply our model to 221 WSIs of lung adenocarcinoma to detect mutations in KRAS, the most commonly mutated gene in lung cancer. Our model is significantly more accurate at detecting and localizing KRAS mutations in WSIs than current leading approaches, which in contrast to our approach, are trained on individual image patches extracted from a WSI. We further show that LCM data are the critical data source for developing computer vision models that can contribute to precision medicine, as model performance monotonically improves with additional LCM data – but not additional WSIs. Our work demonstrates that computer vision is capable of rapid and interpretable mutation detection in WSIs provided that the appropriate model architecture is optimized with spatially-resolved datasets. | |
27 | B | DenseHMM: Learning HMMs by Learning Dense Representations | Joachim Sicking, Maximilian Pintz, Maram Akila, Tim Wirtz | We propose DenseHMM – a modification of Hidden Markov Models (HMMs) that allows to learn dense representations of both the hidden states and the observables. Our approach enables constraint-free, gradient-based optimization and comes empirically without loss of performance compared to standard HMMs. We show that the non-linearity of the kernelization is crucial for the expressiveness of the learned representations. | |
28 | B | Fast Diffusion Optimal Transport for Manifold-of-Manifold Embeddings | Alexander Tong, Manik Kuchroo, Guillaume Huguet, Ronald Coifman, Guy Wolf, Smita Krishnaswamy | High-throughput biomedical data is now being generated massively in parallel in different conditions or patients. However, there are few systematic methods for organizing a large collection of datasets rather than data points and for gaining insight from such organization. Here we propose a manifold-based Wasserstein distance to learn and embed the manifold of samples. Our method, based on graph diffusions, is up to 50 times faster than commonly used entropic regularized algorithms. We apply this to organize single-cell datasets arising from CRISPR perturbations in single-cell data. | |
29 | B | Towards a general framework for spatio-temporal transcriptomics | Julie Pinol, Thierry Artières, Paul Villoutreix | Position and dynamics of cells are essential pieces of information for the study of embryonic development. Unfortunately, this information is lost in many cell gene expression analysis processes, such as single cell RNA sequencing. To predict the physical positions and the temporal dynamics of cells from gene expression data, we propose an extension of the framework proposed by Nitzan et al. including a temporal regularization for the inference of the optimal transport plan. This new framework is tested on artificial data using a combination of the Sinkhorn algorithm and gradient descent. | Video |
30 | B | Predicting unobserved cell states from disentangled representations of single-cell data using generative adversarial networks | Hengshi Yu, Joshua D. Welch | Deep generative models, including variational autoencoders (VAEs) and generative adversarial networks (GANs) have achieved remarkable successes in generating and manipulating high-dimensional images. VAEs excel at learning disentangled image representations, while GANs excel at generating realistic images. Here, we systematically assess disentanglement and generation performance on single-cell gene expression data and find that these strengths and weaknesses of VAEs and GANs apply to single-cell gene expression data in a similar way. We also develop DRGAN, a novel neural network that combines the strengths of VAEs and GANs to sample from disentangled representations without sacrificing data generation quality. We learn disentangled representations of two large single-cell RNA-seq datasets and use DRGAN to sample from these representations. DRGAN allows us to manipulate semantically distinct aspects of cellular identity and predict single-cell gene expression response to drug treatment. | Video |
31 | B | Reconstruction of genes, evolutionary trajectories, and fitness from short sequencing reads of laboratory evolution experiments using machine learning | Max Shen | Directed evolution can generate proteins with novel, tailor-made activities. However, full-length genotypes, their frequencies and inferred fitnesses, and epistatic relationships among evolved mutations are difficult to obtain for gene-length biomolecules using most high-throughput DNA sequencing methods. This difficulty arises because short sequencing reads from mixed populations of genotypes lose associations between distant mutations within individual genes or haplotypes. Here, we present Evoracle, a machine learning method that accurately reconstructs full-length genotypes (R2 = 0.94) and fitness using short-read sequencing data from directed evolution experiments, with substantial performance improvements over related methods (R2 < 0.32). We validate Evoracle on campaigns from three directed evolution platforms including phage-assisted continuous evolution (PACE) of insecticidal proteins, phage-assisted non-continuous evolution (PANCE) of adenine base editors, and OrthoRep evolution of drug-resistant enzymes. Evoracle retains strong performance (R2 = 0.86) on data with complete loss of physical linkage between neighboring nucleotides and substantial measurement noise such as inexpensive pooled Sanger sequencing data (~$10 per timepoint), and broadens the accessibility of training supervised machine learning models on the fitnesses of full-length genotypes. Evoracle can also identify variants with higher fitness and activity than typical inspection methods using short reads, including low-frequency ‘rising star’ variants well before they can be identified using consensus mutations. | |
32 | B | Deconvolution of bulk genomics data using single-cell measurements via neural networks | Anastasiya Belyaeva, Caroline Uhler | In order to characterize tissues bulk genomics data is often collected. Since such data represents an average over thousands of cells, information about cell subpopulations that make up the bulk sample is lost. Single-cell data can be collected to alleviate this problem, however for many modalities single-cell methods are still underdeveloped. We present a novel computational approach for deconvolving bulk genomics data, in particular bulk histone marks, into histone profiles of cell sub-types that compose the bulk sample using a neural network. For deconvolution we rely on single-cell data of modalities that are widely available (e.g. single-cell ATAC-seq). We apply our method to the deconvolution of H3K9ac and H3K4me1 histone modifications from bulk into H3K9ac and H3K4me1 modifications of T-cells and monocytes, cell types that make up the bulk mixture. We show that our method recovers the true measurements with high accuracy. Our framework enables obtaining cell type specific predictions for assays that are expensive or difficult to collect at the single-cell level but can be measured in bulk. | |
33 | C | Inferring Happiness From Video for Better Insight Into Emotional Processing in the Brain | Emil Azadian ( Technical University of Berlin) < emil.azadian@gmail.com> Gautham Velchuru ( Microsoft) < gvelchuru@gmail.com> Nancy X.R. Wang ( IBM Research - Almaden) < wangnxr@ibm.com> Steven M. Peterson ( University of Washington) < stepeter@uw.edu> Valentina Staneva ( University of Washington) < vms16@uw.edu> Bingni Brunton ( University of Washington) < bbrunton@uw.edu> | Gaining a better understanding of which brain regions are responsible for emotional processing is crucial for the development of novel treatments for neuropsychiatric disorders. Current approaches rely on sparse assessments of subjects' emotional states, rarely reaching more than a hundred per patient. Additionally, data are usually obtained in a task solving scenario, possibly influencing their emotions by study design. Here, we utilize several days worth of near-continuous neural and video recordings of subjects in a naturalistic environment to predict the emotional state of happiness from neural data. We are able to obtain high-frequency and high-volume happiness labels for this task by predicting happiness from video data in an intermediary step, achieving good results ($F1=.75$) and providing us with more than 6 million happiness assessments per patient, on average. We then utilize these labels for a classifier on neural data ($F1=0.71$). Our findings can provide a potential pathway for future work on emotional processing that circumvents the mentioned restrictions. | Video |
34 | C | Generative Models for measuring pH using CEST MRI | Ilknur Icke , Raquel Sevilla , Gain Robinson , Corin Miller | Non-invasive measurement of pH provides multiple potential benefits for identification of disease states such as in cancer, inflammation, hypoxia, and tissue injury. In oncology, non-invasive pH measurements may help identify proper chemotherapeutics, characterize tumors that are more likely to either respond to therapy or metastasize, and may also help better assess treatment response. AcidoCEST MRI techniques have been developed over the recent years to perform tumor pH measurements by utilizing a contrast agent for which chemical exchange with tissue water depends on the pH of the local microenvironment.Quantitative analysis of the CEST MRI signals is generally done via modeling of the Bloch-McConnell equations by incorporating chemical exchange as a parameter, or by fitting Lorentzian line shapes to observed Z-spectra and then computing a log ratio of the CEST effects from multiple labile protons within the same molecule (ratiometric method). Modeling using Bloch-McConnell equations requires careful inclusion of many scan parameters to infer pH, while the ratiometric method requires contrast agents with multiple labile protons, thus making it unsuitable for molecules with a single labile proton. Furthermore, depending on the pH, sometimes it might not be possible to accurately calculate the ratio due to the low signal to noise ratio (SNR) of the CEST signal for certain labile protons.To overcome these limitations, we developed a generative machine learning algorithm to predict in vivo tumor pH where the observed Z-spectra after contrast agent infusion is modeled as a perturbation on the observed Z-spectra before the contrast agent was introduced. In this scheme, perturbation is modulated by the contrast agent. A sub-network of the architecture is pre-trained on a separate phantom dataset where temperature and contrast agent concentrations were controlled prior to CEST MRI scanning and subsequently used as prior knowledge that modulates the perturbation caused by the infusion of the contrast agent into the body. Our results on an experimental study using animal models (mice) indicate that our machine learning method provides a more general and accurate prediction of pH in comparison to the ratiometric method. Our method is also more general in the sense that it does not require explicit modeling of signal peaks that are dependent on the type of contrast agent. Finally, this framework can be extended into use-cases where new probing materials and sensing technologies can be designed and tested in order to better understand the biological entities being studied. | |
35 | C | Estimating Latent Space Uncertainty to Improve Integration of Multi-Omic Sequencing Data | Ifrah Tariq, Matthew Leventhal, Diep Nguyen | Integrating sequencing data at the genomic, transcriptomic, and proteomic levels can provide systemic insights into the processes involved in disease. Autoencoders have been previously used to integrate biological data, and have the advantage of mapping complex, high-dimensional sequencing data to lower-dimensional latent space representations while retaining non-linear relationships from the original inputs. It is challenging, however, to quantify how well the latent space represents the original distribution. Indeed, it has been observed that if the decoder is sufficiently powerful to model the input, then the latent variables may be ignored without impacting the reconstruction loss. When this occurs, the model produces uninformative latent variables, leading to posterior collapse and reducing the fidelity of the output. As such, it is important to have a metric for calculating the exact trade-off in representation uncertainty across model choices. Here we introduce VAE-D, a variational autoencoder with an additional disentanglement-based uncertainty calculation in its loss function. Our model performs comparatively well in our prediction task to both standard and classification-based VAEs in both single- and multi-omic settings. However, the pathways identified by the VAE-D latent representation aligns with the ground truth pathways from our simulated data better than a standard VAE. This uncertainty calculation also has diagnostic benefits, such as identifying cases of overfitting and model overcapacity. | Video |
36 | C | PersGNN: Topological and Geometric Deep Learning for Protein Function Prediction | Nicolas Swenson, Aditi S. Krishnapriyan, Aydin Buluc, Dmitriy Morozov, & Katherine Yelick | Understanding protein structure-function relationships is a key challenge in computational biology, with applications across the biotechnology and pharmaceutical industries. While it is known that protein structure directly impacts protein function, many functional prediction tasks use only protein sequence. In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank in order to study the expressiveness of different structure-based prediction schemes. We present PersGNN—an end-to-end trainable, deep geometric representation learning model that combines graph representation learning with topological data analysis to capture a complex set of both local and global structural features. While variations of these techniques have been successfully applied to proteins before, we demonstrate that our hybridized approach, PersGNN, outperforms either method on its own as well as a baseline neural network that learns from the same information. PersGNN achieves a 9.3% boost in area under the precision recall curve (AUPR) compared to the best individual model, as well as high F1 scores across different gene ontology categories, indicating the transferability of this approach. | Video |
37 | C | Machine learning enables advanced affinity selection of peptide binders | Somesh Mohapatra, Joseph Brown, Anthony Quartararo, Bradley L. Pentelute, Rafael Gómez-Bombarelli | Affinity selection is a widely used technique for discovery of peptide binders to target proteins from large libraries. However, simultaneous identification of non-specific binders plagues discovery efforts, requiring biophysical examination of each individual result from selection. On the computational front, due to lack of sufficient data on protein 3D structures and limitations of computational power for training 3D-models, accurate predictions regarding protein-peptide interactions have not been possible. Sequence-based models trained over one-hot encodings have been reported, however, they do not capture the diverse chemical space, and do not extend to non-canonical amino acids. Further, owing to the architecture of these models trained on specific binders alone, they cannot be used to identify non-specific binders and non-binders. Here, we report an unsupervised-supervised machine learning approach based on topological representation of amino acids to advance classification of peptide binders obtained from experimental library screenings. We used data obtained from affinity selection of canonical L-peptide libraries against 12ca5 for the purpose of this study. | |
38 | C | Transfer learning framework for cell segmentation with incorporation of geometric features | Yinuo Jin, Alexandre Toberoff, Elham Azizi | We propose a novel transfer learning cell segmentation framework incorporating shape-aware features in a deep learning model, with multi-level watershed and morphological post-processing steps. Our results show that incorporation of geometric features improves generalizability to segment cells in in situ tissue images, using solely in vitro images as training data. | |
39 | C | MonarchNet: Differentiating Monarch Butterflies from Butterflies with Similar Phenotypes | Thomas Y. Chen | In recent years, the monarch butterfly's iconic migration patterns have come under threat from a number of factors, from climate change to pesticide use. To track trends in their populations, scientists as well as citizen scientists must identify individuals accurately. This is key because there exist other species of butterfly, such as viceroy butterflies, that are "look-alikes," having similar phenotypes. To tackle this problem and to aid in more efficient identification, we present MonarchNet, the first comprehensive dataset consisting of butterfly imagery for monarchs and five look-alike species. We train a baseline deep-learning classification model to serve as a tool for differentiating monarch butterflies and its various look-alikes. We seek to contribute to the study of biodiversity and butterfly ecology by providing a novel method for computational classification of these particular butterfly species. The ultimate aim is to help scientists track monarch butterfly population and migration trends in the most precise manner possible. | Video |
40 | C | KEVOLVE: a combination of genetic algorithm and ensemble method to classify viruses with variability | Dylan Lebatteux, Amine M. Remita, and Abdoulaye Baniré Diallo | KEVOLVE is a method comprising of a genetic algorithm coupled with machine learning that allows the extraction of bag of minimal subsets of features. The extracted feature subsets are then used to build a support vector machine-based ensemble prediction model. We assessed the algorithm on HIV datasets represented by features built from k-mers occurrences. Although k-mers representations are robust features, data matrices created from k-mers can reach very high dimensionality. KEVOLVE provides 99% reduction of the initial feature matrix while maintaining a high F1-score. | |
41 | C | Scalable nonparametric Bayesian models that predict and generate genome sequences | Alan N. Amin*, Eli N. Weinstein*, Jean Disset, Tessa D. Green, Debora S. Marks (*equal contribution) | Large-scale sequencing efforts have revealed complex genomic diversity and dynamics across biology, while advances in genome engineering and high-throughput synthesis have made modifying genomes and creating new sequences progressively easier. In principal, probabilistic modeling offers a powerful, rigorous and modular approach to analyzing sequence data, making predictions, and generating new sequence designs. In practice, generative probabilistic genome sequence models have been so far largely limited to the case where genetic variation exclusively takes the form of single nucleotide polymorphisms, as opposed to complex structural variation, despite the common occurrence of structural variation throughout evolution. In this article, we propose a new class of generative models, termed "embedded autoregressive" (EAR) models, which generalize both (a) widely used heuristic methods for analyzing complex genome data such as kmer count comparisons and de novo local assembly algorithms and (b) a major class of statistical and machine learning models, autoregressive (AR) models. Inference in EAR models is just as scalable as for AR models, but they avoid misspecification via a nonparametric architecture. We provide a thorough theoretical analysis of the consistency of EAR models and their convergence rates, using Bayesian sieves. We illustrate the properties of EAR models and their advantages over standard AR models on both simulated and real data. | Video |
42 | C | Machine Learning Approaches for RNA Editing Prediction | Andrew J. Jung, Leo J. Lee, Alice J. Gao, Brendan J. Frey | RNA editing is a major post-transcriptional modification that contributes significantly to transcrip- tomic diversity and regulation of cellular processes. Exactly how cis-regulatory elements control RNA editing appears to be highly complex and remains largely unknown. However, with the improvement of computational methods for detecting and quantifying RNA editing from large-scale RNA-seq data, it becomes possible to build computational models of RNA editing. Here we report our attempt to develop machine learning models for predicting human A-to-I editing by training on large number of highly confident RNA editing sites supported by observational RNA-seq data. Our models achieve good performance on held-out test evaluations. Furthermore, our deep convolutional model also generalizes well to dataset from a controlled study. | |
43 | C | Feature selection and optimization to constrain anti-SARS-CoV2 antibody design and viral escape | Natalie K. Dullerud, Tea Freedman-Susskind, Priyanthi Gnanapragasam, Christopher Snow, Anthony P. West Jr., Vanessa D. Jonsson | Effective therapeutic strategies to mitigate the extent of the COVID-19 pandemic are crucial prior to the distribution of a readily available vaccine: this includes the isolation, design and prototyping of antibodies with enhanced neutralization against a broad range of virus mutations. We propose a new computational method that jointly constrains the antibody design space and solves for antibody cocktails to address virus escape. Our contributions are: a new statistical model, the saturated LASSO (satlasso) for feature selection and a new combinatorial optimization algorithm that inputs fitness landscapes and solves for antibody cocktails that can address escape through virus mutations. Our predictions guided the development of a new, optimized antibody, C105T28I-Y58F, with 15-fold more potent neutralization against anti-SARS-CoV2 than its non-optimized counterpart; further, we propose a new antibody cocktail that can target a majority of mutations in the receptor binding domain (RBD) of the spike glycoprotein of SARS-CoV2. | Video |
44 | C | Reprogramming Language Models for Molecular Representation Learning | Ria Vinod, Pin-Yu Chen, Payel Das | Recent advancements in transfer learning have made it a promising approach for domain adaptation via transfer of learned representations. This is especially when relevant when alternate tasks have limited samples of well-defined and labeled data, which is common in the molecule data domain. This makes transfer learning an ideal approach to solve molecular learning tasks. While Adversarial reprogramming has proven to be a successful method to repurpose neural networks for alternate tasks, most works consider source and alternate tasks within the same domain. In this work, we propose a new algorithm, Representation Reprogramming via Dictionary Learning (R2DL), for adversarially reprogramming pretrained language models for molecular learning tasks, motivated by leveraging learned representations in massive state of the art language models. The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver to approximate a sparse representation of the encoded data, via dictionary learning. R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting, thereby establishing avenues for domain-agnostic transfer learning for tasks with molecule data. | Video |
45 | C | DeepCell Datasets: Building a Single-Cell ImageNet for Biology | Erick Moen, Geneva Miller, Noah Greenwald, Enrico Borba, Morgan Schwartz, Tyler Price, Isabella Camplisson, Nora Koe, Sunny Cui, Cole Pavelchek, Tom Dougherty, Takamasa Kudo, Ed Pao, Will Graf, Leeat Keren, Michael Angelo, David Van Valen | Advances in imaging and genomics have improved our ability to profile changes within biological matter. When paired with deep learning, these new technologies hold the promise of revolutionizing our ability to study these changes at the single-cell level. However, deep learning’s requirement for expansive training data remains a considerable barrier to adoption. In this work, we show that human-in-the-loop (HITL) AI systems are effective approaches to overcoming this data challenge for the problem of single-cell analysis. We report on the development of a new data infrastructure for biology and demonstrate its effectiveness in building two datasets spanning live cell imaging and multiplexed tissue data. This dataset spans 5 cell lines, 6 imaging platforms, 8 different tissue types and includes nearly 2 million single-cell annotations, making it the largest collection compiled to date. Furthermore, we have used this diverse training dataset to create performant models of segmentation and cell tracking. We believe the HITL system we have developed holds significant promise not only for the challenge of single-cell analysis with deep learning but for solving image analysis problems across the various domains of life. | |
46 | C | Protein model quality assessment using rotation-equivariant, hierarchical neural networks | Stephan Eismann*, Patricia Suriana*, Bowen Jing, Raphael Townshend, Ron Dror | We introduce a novel deep learning approach to score the quality of individual protein models using a rotation-equivariant, hierarchical neural network architecture. | Video |
47 | C | Disentangling behavioral dynamics with MDN-RNN | Keita Mori*, Haoyu Wang, Naohiro Yamauchi, Yu Toyoshima, Yuichi Iino (Univ. of Tokyo) | A fundamental goal of behavioral science is to reveal the behavioral components and understanding the generative process underlying each behavioral component. To achieve this goal, we need to map the complex and high dimensional behavior data to low dimensional behavioral component space. However, several works challenged this task, some work lacked the ability or have poor ability to generate the behavior which means that they could not model the generative process. Here, we show a mixture density network - recurrent neural network (MDN-RNN) models can disentangle the behavior components and learn the generation process simultaneously. We demonstrate the performance of our framework on the behavioral dataset of nematode C. elegans. Since this framework is broadly applicable to any stochastic sequential data, this work will be useful in modeling and obtaining meaningful representation in many biological phenomena that exhibit stochastic time evolution. | |
48 | C | A new graph-based clustering method with application to single-cell RNA-seq data from human pancreatic islets. | Hao Wu, Disheng Mao, Yuping Zhang, Zhiyi Chi, Michael Stitzel, and Zhengqing Ouyang | Traditional bulk RNA-sequencing of human pancreatic islets mainly reflects transcriptional response of major cell types. Single-cell RNA sequencing technology enables transcriptional characterization of individual cells, and thus makes it possible to detect cell types and subtypes. To tackle the heterogeneity of single-cell RNA-seq data, powerful and appropriate clustering is required to facilitate the discovery of cell types. In this paper, we propose a new clustering method based on a novel graph-based approach. We take the compositional nature of single-cell RNA-seq data into account and employ log-ratio transformations. The practical merit of the proposed method is demonstrated through the application to the centered log-ratio transformed single-cell RNA-seq data for human pancreatic islets. | Video |