Accepted Papers

Title Authors Abstract
MonarchNet: Differentiating Monarch Butterflies from Butterflies with Similar Phenotypes Thomas Y Chen (The Academy for Mathematics, Science, and Engineering)* In recent years, the monarch butterfly's iconic migration patterns have come under threat from a number of factors, from climate change to pesticide use. To track trends in their populations, scientists as well as citizen scientists must identify individuals accurately. This is key because there exist other species of butterfly, such as viceroy butterflies, that are "look-alikes," having similar phenotypes. To tackle this problem and to aid in more efficient identification, we present MonarchNet, the first comprehensive dataset consisting of butterfly imagery for monarchs and five look-alike species. We train a baseline deep-learning classification model to serve as a tool for differentiating monarch butterflies and its various look-alikes. We seek to contribute to the study of biodiversity and butterfly ecology by providing a novel method for computational classification of these particular butterfly species. The ultimate aim is to help scientists track monarch butterfly population and migration trends in the most precise manner possible.
Unsupervised identification of rat behavioral motifs across timescales Haozhe Shan (Harvard University)* Behaviors of several laboratory animals can be modeled as sequences of stereotyped behaviors, or behavioral motifs. However, identifying such motifs is a challenging problem. Behaviors have a multi-scale structure: the animal can be simultaneously performing a small-scale motif and a large-scale one (e.g. chewing and feeding). Motifs are compositional: a large-scale motif is a chain of smaller-scale ones, folded in (some behavioral) space in a specific manner. We demonstrate an approach which captures these structures, using rat locomotor data as an example. From the same dataset, we used a preprocessing procedure to create different versions, each describing motifs of a different scale. We then trained several Hidden Markov Models (HMMs) in parallel, one for each dataset version. This approach essentially forced each HMM to learn motifs on a different scale, allowing us to capture behavioral structures lost in previous approaches. By comparing HMMs with models representing different null hypotheses, we found that rat locomotion was composed of distinct motifs from second scale to minute scale. We found that transitions between motifs were modulated by rats' location in the environment, leading to non-Markovian transitions. To test the ethological relevance of motifs we discovered, we compared their usage between rats with differences in a high-level trait, prosociality. We found that these rats had distinct motif repertoires, suggesting that motif usage statistics can be used to infer internal states of rats. Our method is therefore an efficient way to discover multi-scale, compositional structures in animal behaviors. It may also be applied as a sensitive assay for internal states.
Characterizing Electrocardiogram Signals Using Capsule Networks Hirunima Jayasekara (University of Moratuwa)*; Vinoj Jayasundara (University of Moratuwa); Mohamed Athif (Boston University); Sandaru Jayasekara (University of Moratuwa); Suranga Seneviratne (University of Sydney); Ranga Rodrigo (University of Moratuwa) Capsule networks excel in understanding spatial relationships in 2D data for vision related tasks. TimeCaps is a capsule network designed to capture temporal relationships in 1D signals. In TimeCaps, we generate capsules along the temporal and channel dimensions, creating two feature detectors that learn contrasting relationships, prior to projecting the input signal in to a concise latent representation. We demonstrate the performance of TimeCaps in a variety of 1D signal processing tasks including characterization, classification, decomposition, compression and reconstruction utilizing instantiation parameters inherently learnt by the capsule networks. TimeCaps surpasses the state-of-the-art results by achieving an accuracy of 96.96% on classifying 13 Electrocardiogram (ECG) signal beat categories.

(not included in the final proceedings at authors' request)

DenseHMM: Learning Hidden Markov Models by Learning Dense Representations Joachim Sicking (Fraunhofer IAIS)*; Maximilian Alexander Pintz (Fraunhofer IAIS); Maram Akila (Fraunhofer IAIS); Tim Wirtz (Fraunhofer IAIS) We propose DenseHMM – a modification of Hidden Markov Models (HMMs) that allows to learn dense representations of both the hidden states and the observables. Compared to the standard HMM, transition probabilities are not atomic but composed of these representations via kernelization. Our approach enables constraint-free and gradient-based optimization. We propose two optimization schemes that make use of this: a modification of the Baum-Welch algorithm and a direct co-occurrence optimization. The latter one is highly scalable and comes empirically without loss of performance compared to standard HMMs. We show that the non-linearity of the kernelization is crucial for the expressiveness of the representations. The properties of the DenseHMM like learned co-occurrences and log-likelihoods are studied empirically on synthetic and biomedical datasets.

(not included in the final proceedings at authors' request)

A Cross-Level Information Transmission Network for Predicting Phenotype from New Genotype: Application to Cancer Precision Medicine Di He (CUNY Graduate Center); Lei Xie (Hunter College, CUNY)* An unsolved fundamental problem in biology and ecology is to predict observable traits (phenotypes) from a new genetic constitution (genotype) of an organism under environmental perturbations (e.g., drug treatment). The emergence of multiple omics data provides new opportunities but imposes great challenges in the predictive modeling of genotype-phenotype associations. Firstly, the high-dimensionality of genomics data and the lack of labeled data often make the existing supervised learning techniques less successful. Secondly, it is a challenging task to integrate heterogeneous omics data from different resources. Finally, the information transmission from DNA to phenotype involves multiple intermediate levels of RNA, protein, metabolite, etc. The higher-level features (e.g., gene expression) usually have stronger discriminative power than the lower level features (e.g., somatic mutation). To address above issues, we proposed a novel Cross-LEvel Information Transmission network (CLEIT) framework. CLEIT aims to explicitly model the asymmetrical multi-level organization of the biological system. Inspired by domain adaptation, CLEIT first learns the latent representation of high-level domain then uses it as ground-truth embedding to improve the representation learning of the low-level domain in the form of contrastive loss. In addition, we adopt a pre-training-fine-tuning approach to leveraging the unlabeled heterogeneous omics data to improve the generalizability of CLEIT. We demonstrate the effectiveness and performance boost of CLEIT in predicting anti-cancer drug sensitivity from somatic mutations via the assistance of gene expressions when compared with state-of-the-art methods.
Streamlining value of protein embeddings through bio_embeddings Christian Dallago (Technical University of Munich)*; Konstantin Schütze (Technical University of Munich); Michael Heinzinger (Technical University of Munich); Tobias Olenyi (Technical University of Munich); Maria Littmann (Technical University of Munich); Amy Lu (University of Toronto); Kevin Yang (Microsoft); Seonwoo Min (Seoul National University); Sungroh Yoon (Seoul National University); Burkhard Rost (Technical University of Munich) Protein-based machine learning models increasingly contribute toward the guidance of experimental decision making. Models capable of quickly classifying entire biomes could help focus experiments on promising novelty. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to decoding the language of life found in protein sequences. Such protein LMs show enormous potential in generating descriptive representations for proteins from just their sequences at a fraction of the time compared to previous approaches. LMs convert amino acid sequences into embeddings (vector representations) useful for several downstream tasks, including analysis and the prediction of aspects of protein function and structure. A buzzing variety of protein LMs is being generated worldwide that, in its diversity, is likely to shine light on different angles of the protein language. Unfortunately, these resources are scattered over the web. The bio_embeddings pipeline offers a unified interface to protein LMs to simply and quickly embed large protein sets, to project the embeddings in lower dimensional spaces, to visualize proteins on interactive scatter plots, and to extract annotations through supervised or unsupervised techniques. This enables quick hypothesis generation and testing. The pipeline is accompanied by a web server that offers to embed, project, visualize, and extract annotations for small protein datasets directly online, without the need to install software.
Protein model quality assessment using rotation-equivariant, hierarchical neural networks Stephan Eismann (Stanford University)*; Patricia Suriana (Stanford); Bowen Jing (Stanford University); Raphael Townshend (Stanford University); Ron Dror (Stanford) Proteins are miniature machines whose function depends on their three-dimensional (3D) structure. Determining this structure computationally remains an unsolved grand challenge. A major bottleneck involves selecting the most accurate structural model among a large pool of candidates, a task addressed in model quality assessment. Here, we present a novel deep learning approach to assess the quality of a protein model. Our network builds on a point-based representation of the atomic structure and rotation-equivariant convolutions at different levels of structural resolution. These combined aspects allow the network to learn end-to-end from entire protein structures. Our method achieves state-of-the-art results in scoring protein models submitted to recent rounds of CASP, a blind prediction community experiment. Particularly striking is that our method does not use physics-inspired energy terms and does not rely on the availability of additional information (beyond the atomic structure of the individual protein model), such as sequence alignments of multiple proteins.

(not included in the final proceedings at authors' request)

Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models Pascal Sturmfels (University of Washington); Jesse Vig (Salesforce Research)*; Ali Madani (Salesforce Research); Nazneen Fatema Rajani (Salesforce Research) For protein sequence datasets, unlabeled data has greatly outpaced labeled data due to the high cost of wet-lab characterization. Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks. However, the optimal pre-training strategy remains an open question. Instead of strictly borrowing from natural language processing (NLP) in the form of masked or autoregressive language modeling, we introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments. Using a set of five, standardized downstream tasks for protein models, we demonstrate that our pre-training task along with a multi-task objective outperforms masked language modeling alone on all five tasks. Our results suggest that protein sequence models may benefit from leveraging biologically-inspired inductive biases that go beyond existing language modeling techniques in NLP.
Protein Structural Alignments From Sequence James T Morton (Simons Foundation)*; Charlie Strauss (Los Alamos National Laboratory); Robert Blackwell (Simons Foundation); Daniel Berenberg (Simons Foundation); Vladimir Gligorijevic (Simons Foundation); Richard Bonneau (Simons Foundation) Computing sequence similarity is a fundamental task in biology, with alignment forming the basis for the annotation of genes and genomes and providing the core data structures for evolutionary analysis. Standard approaches are a mainstay of modern molecular biology and rely on variations of edit distance to obtain explicit alignments between pairs of biological sequences. However, sequence alignment algorithms struggle with remote homology tasks and cannot identify similarities between many pairs of proteins with similar structures and likely homology. Recent work suggests that using machine learning language models can improve remote homology detection. To this end, we introduce DeepBLAST, that obtains explicit alignments from residue embeddings learned from a protein language model integrated into an end-to-end differentiable alignment framework. This approach can be accelerated on the GPU architectures and outperforms conventional sequence alignment techniques in terms of both speed and accuracy when identifying structurally similar proteins.
Learning a low dimensional manifold of real cancertissue with PathologyGAN Adalberto Claudio Quiros (University of Glasgow)*; Roderick Murray-Smith (University of Glasgow); Ke Yuan (University of Glasgow) Histopathological images contain information about how a tumor interacts with its micro-environment. Better understanding of such interaction holds the key for improved diagnosis and treatment of cancer. Deep learning shows promise on achieving those goals, however, its application is limited by the cost of high quality labels. Unsupervised learning, in particular, deep generative models with representation learning properties provides an alternative path to further understand cancer tissue phenotypes, capturing tissue morphologies. We present a deep generative model that learns to simulate high-fidelity cancer tissue images while mapping the real images onto an interpretable low dimensional latent space. The key to the model is an encoder trained by a previously developed generative adversarial network, PathologyGAN. Here we provide examples of how the latent space holds morphological characteristics of cancer tissue (e.g. tissue type or cancer, lymphocytes, and stroma cells). We tested the general applicability of our representations in three different settings: latent space visualization, training a tissue type classifier over latent representations, and on multiple instance learning (MIL). Latent visualizations of breast cancer tissue show that distinct regions of the latent space enfold different characteristics (stroma, lymphocytes, and cancer cells). A logistic regression for colorectal tissue type classification trained over latent projections achieves 87% accuracy. Finally, we used the attention-based deep MIL for predicting presence of epithelial cells in colorectal tissue, achieving 90% accuracy. Our results show that PathologyGAN captures distinct phenotype characteristics, paving the way for further understanding of tumor micro-environment and ultimately refining histopathological classification for diagnosis and treatment.
Disentangling behavioral dynamics with MDN-RNN Keita Mori (University of Tokyo)*; Haoyu Wang (University of Tokyo); Naohiro Yamauchi (University of Tokyo); Yu Toyoshima (University of Tokyo); Yuichi Iino (The University of Tokyo) A fundamental goal of behavior science is to reveal the behavior components and understanding the generative process underlying each behavioral component. To achieve this goal, we need to disentangle complex and high dimensional behavior data into low dimensional behavioral component space. However, while several attempts to solve this problem have been made, some works have poor ability to generate these behaviors which means that they could not model the generation process. Here, we show a mixture density network - recurrent neural network (MDN-RNN) models can disentangle the behavior components and learn the generation process simultaneously. We demonstrate the performance of our framework on the behavioral dataset of nematode Caenorhabditis elegans (C. elegans). Since this model is broadly applicable to any stochastic sequential data, this work will be useful in modeling and obtaining meaningful representations in many biological phenomena that exhibit stochastic time evolution.
Topological Data Analysis of copy number alterations in cancer Stefan Groha (Dana Farber Cancer Institute); Caroline Weis (ETH Zurich)*; Alexander Gusev (Dana Farber Cancer Institute); Bastian A Rieck (MLCB, D-BSSE, ETH Zurich) Identifying subgroups and properties of cancer biopsy samples is a crucial step towards obtaining precise diagnoses and being able to perform personalized treatment of cancer patients. Recent data collections provide a comprehensive characterization of cancer cell data, including genetic data on copy number alterations (CNAs). We explore the potential to capture information contained in cancer genomic information using a novel topology-based approach that encodes each cancer sample as a persistence diagram of topological features, i.e., high-dimensional voids represented in the data. We find that this technique has the potential to extract meaningful low-dimensional representations in cancer somatic genetic data and demonstrate the viability of some applications on finding substructures in cancer data as well as comparing similarity of cancer types.
Towards a general framework for spatio-temporal transcriptomics Julie Pinol (LIS (CNRS, UMR 7020)); Thierry Artières (Aix-Marseille Université); Paul Villoutreix (Turing Center for Living Systems)* Position and dynamics of cells are essential pieces of information for the study of embryonic development. Unfortunately, this information is lost in many cell gene expression analysis processes, such as single cell RNA sequencing. Being able to predict the physical positions and the temporal dynamics of cells from gene expression data is therefore a major challenge. After motivating our study with data from C. elegans development, we first review current methods based on optimal transport that aim at either predicting the spatial position of cells from transcriptomic data or interpolating differentiation trajectories from time series of transcriptomic data. However, they are not designed to capture simple temporal transformations of spatial data such as a rotation, we propose an extension of the framework proposed by (Nitzan et al. 2019) including a temporal regularization for the inference of the optimal transport plan. This new framework is tested on artificial data using a combination of the Sinkhorn algorithm and gradient descent. We show that we can successfully learn simple dynamic transformations from very high dimensional data.
Multiscale PHATE Exploration of SARS-CoV-2 Data Reveals Signature of Disease MANIK KUCHROO (Yale University)*; Jessie Huang (Yale University); Patrick Wong (Yale School of Medicine); Akiko Iwasaki (Yale School of Medicine); Guy Wolf (Université de Montréal); Smita Krishnaswamy (Yale University) The biomedical community is producing increasingly high dimensional datasets integrated from hundreds of patient samples that current computational techniques are unable to explore. We propose a novel approach, called Multiscale PHATE, which learns increasingly abstracted features from the data to produce high level summarizations and detailed representations of dataset subsets in a computationally efficient manner. Multiscale PHATE utilizes a continuous data coarse graining approach called diffusion condensation to create a tree containing all levels of data granularity, and then selects layers for visualization that start from a coarse grained summary, and then zoom in to reveal more detail. We apply this computational approach to study the evolution of the patient immune response to SARS-CoV-2 infection in 22 million cells measured via flow cytometry. Through our analysis of patient samples, we identify a pathologic non-activated neutrophil response enriched in the most severely ill patients.
Machine Learning Approaches for RNA Editing Prediction Andrew J Jung (University of Toronto)* RNA editing is a major post-transcriptional modification that contributes significantly to transcriptomic diversity and regulation of cellular processes. Exactly how cis-regulatory elements control RNA editing appears to be highly complex and remains largely unknown. However, with the improvement of computational methods for detecting and quantifying RNA editing from large-scale RNA-seq data, it has become possible to build computational models of RNA editing. Here we report our attempts to develop machine learning models for A-to-I editing prediction in human by training on a large number of highly confident RNA editing sites supported by observational RNA-seq data. Our models achieved good performance on held-out test evaluations. Furthermore, our deep convolutional model also generalizes well to a controlled study dataset.
A multi-view generative model for molecular representation improves prediction tasks Jonathan Yin Yin (Yale University); Hattie Chung (Broad Institute)*; Aviv Regev (Klarman Cell Observatory, Broad Institute of MIT and Harvard, Howard Hughes Medical Institute, Department of Biology, Massachusetts Institute of Technology, Cambridge MA 02142) Unsupervised generative models have been a popular approach to representing molecules. These models extract salient molecular features to create compact vectors that can be used for downstream prediction tasks. However, current generative models for molecules rely mostly on structural features and do not fully capture global biochemical features. Here, we propose a multi-view generative model that integrates low-level structural features with global chemical properties to create a more holistic molecular representation. In proof-of-concept analyses, compared to purely structural latent representations, multi-view latent representations improve model accuracy on various tasks when used as input to feed-forward prediction networks. For some tasks, simple models trained on multi-view representations perform comparably to more complex supervised methods. Multi-view representations are an attractive method to improve representations in an unsupervised manner, and could be useful for prediction tasks, particularly in contexts where data is limited.
Learning to localize mutation in lung adenocarcinoma histopathology images Sahar Shahamatdar (Brown University)*; Daryoush Saeed-Vafa (Moffitt Cancer Center); Drew Linsley (Brown University ); Lester Li (Rochester University); Sohini Ramachandran (Brown University); Thomas Serre (Brown University) Molecular profiling of cancers is necessary to identify the optimal therapeutic options for patients. However, these assays have notable limitations: they are time- and resource-intensive to perform, and they cannot accurately capture mutational heterogeneity. Here, we present a novel approach to address these issues, which combines state-of-the-art attention models from computer vision with precise laser capture microdisection (LCM) molecular annotations on whole slide images (WSIs) of tumor tissue. We apply our model to 221 WSIs of lung adenocarcinoma to detect mutations in KRAS, the most commonly mutated gene in lung cancer. Our model is significantly more accurate at detecting and localizing KRAS mutations in WSIs than current leading approaches, which in contrast to our approach, are trained on individual image patches extracted from a WSI. We further show that LCM data are the critical data source for developing computer vision models that can contribute to precision medicine, as model performance monotonically improves with additional LCM data -- but not additional WSIs. Our work demonstrates that computer vision is capable of rapid and interpretable mutation detection in WSIs provided that the appropriate model architecture is optimized with spatially-resolved datasets.

(not included in the final proceedings at authors' request)

KEVOLVE: a combination of genetic algorithm and ensemble method to classify viruses with variability Abdoulaye Banire Diallo (UQAM)*; Amine M. Remita (Université du Québec à Montréal); Dylan Lebatteux (UQAM) Motivation: Feature selection is an important step when solving machine learning problems. Careful and strategic feature selection methods can increase the performance of machine learning algorithms, reduce high dimensional data and make models and data more understandable. Most feature selection methods are designed to provide an "optimal" subset of features. Nevertheless, in the field of genomic classification, data representations are often derived from nucleotide subsequence information. In this context, the features space is exponential according to the size of the genomes and it is subject over time to mutations affecting the genomic sequences. One way of overcoming such challenges could be with an accurate and efficient algorithm to capture a bag of minimal subsets of features that could represent the most likely alternative and evolutionary space. Results: In this paper, we introduce K EVOLVE , a new method based on genetic algorithm that includes a machine learning kernel to extract a bag of minimal subsets of features maximizing a given score threshold. K EVOLVE was coupled with an ensemble prediction model based on support vector machines and applied on the classification of Human Immunodeficiency Virus (HIV) genomic sequences. The extracted subsets of features allowed a 99% reduction of dimension of initial feature matrices while outperforming the state of the art HIV predictors. The built models are robust in presence of different mutation rates. Moreover, the bag of extracted features can give meaningful insight in various problems such as rapid sequencing, genomic classification or the identification of biological functions. Availability:The source code of the method and detailed results are available at https://github.com/bioinfoUQAM/Kevolve
Inferring Happiness From Video for Better Insight Into Emotional Processing in the Brain Emil Azadian (Technical University of Berlin)*; Gautham Velchuru (Microsoft); Nancy X.R. Wang (IBM Research - Almaden); Steven M. Peterson (University of Washington); Valentina Staneva (University of Washington); Bingni Brunton (University of Washington) Gaining a better understanding of which brain regions are responsible for emotional processing is crucial for the development of novel treatments for neuropsychiatric disorders. Current approaches rely on sparse assessments of subjects' emotional states, rarely reaching more than a hundred per patient. Additionally, data are usually obtained in a task solving scenario, possibly influencing their emotions by study design. Here, we utilize several days worth of near-continuous neural and video recordings of subjects in a naturalistic environment to predict the emotional state of happiness from neural data. We are able to obtain high-frequency and high-volume happiness labels for this task by predicting happiness from video data in an intermediary step, achieving good results ($F1=.75$) and providing us with more than 6 million happiness assessments per patient, on average. We then utilize these labels for a classifier on neural data ($F1=0.71$). Our findings can provide a potential pathway for future work on emotional processing that circumvents the mentioned restrictions.
Constructing Data-Driven Network Models of Cell Dynamics with Perturbation Experiments Bo Yuan (Harvard University)*; Ciyue Shen (); Chris Sander (Dana-Farber Cancer Institute) Systematic perturbation of cells followed by comprehensive measurements of molecular and phenotypic responses provides informative data resources for con- structing computational models of cell biology. Existing machine learning mod- els of cell dynamics have limited effectiveness to find global optima in a high- dimensional space and/or lack interpretability in terms of cellular mechanisms. Here we introduce a hybrid approach that combines explicit protein-protein and protein-phenotype interaction models of cell dynamics with automatic differen- tiation. We tested the modeling framework on a perturbation-response dataset of a melanoma cell line with drug treatments. These machine learning models can be efficiently trained to describe cellular behavior with good accuracy but also can provide direct mechanistic interpretation. The predictions and inference of interactions are robust against simulated experimental noise. The approach is readily applicable to a broad range of kinetic models of cell biology and provides encouragement for the collection of large-scale perturbation-response datasets.
Transfer learning framework for cell segmentation with incorporation of geometric features Yinuo Jin (Columbia University)*; Alexandre A Toberoff (Columbia University); Elham Azizi (Columbia University) With recent advances in multiplexed imaging and spatial transcriptomic and proteomic technologies, cell segmentation is becoming a crucial step in biomedical image analysis. In recent years, Fully Convolutional Networks (FCN) have achieved great success in nuclei segmentation in in vitro imaging. Nevertheless, it remains challenging to perform similar tasks on in situ tissue images with more cluttered cells of diverse shapes. To address this issue, we propose a novel transfer learning, cell segmentation framework incorporating shape-aware features in a deep learning model, with multi-level watershed and morphological post-processing steps. Our results show that incorporation of geometric features improves generalizability to segmenting cells in in situ tissue images, using solely in vitro images as training data.
Deconvolution of bulk genomics data using single-cell measurements via neural networks Anastasiya Belyaeva (Massachusetts Institute of Technology)*; Caroline Uhler (MIT) In order to characterize tissues bulk genomics data is often collected. Since such data represents an average over thousands of cells, information about cell subpopulations that make up the bulk sample is lost. Single-cell data can be collected to alleviate this problem, however for many modalities single-cell methods are still underdeveloped. We present a novel computational approach for deconvolving bulk genomics data, in particular bulk histone marks, into histone profiles of cell sub-types that compose the bulk sample using a neural network. For deconvolution we rely on single-cell data of modalities that are widely available (e.g. single-cell ATAC-seq). We apply our method to the deconvolution of H3K9ac and H3K4me1 histone modifications from bulk into H3K9ac and H3K4me1 modifications of T-cells and monocytes, cell types that make up the bulk mixture. We show that our method recovers the true measurements with high accuracy. Our framework enables obtaining cell type specific predictions for assays that are expensive or difficult to collect at the single-cell level but can be measured in bulk.
Predicting Cellular Drug Sensitivity using Conditional Modulation of Gene Expression William Connell (UCSF)*; Michael J Keiser (University of California, San Francisco) Selecting drugs most effective against a tumor’s specific transcriptional signature is an important challenge in precision medicine. To assess oncogenic therapy options, cancer cell lines are dosed with drugs that can differentially impact cellular viability. Here we show that basal gene expression patterns can be conditioned by learned small molecule structure to better predict cellular drug sensitivity, achieving an R^2 of 0.7190±0.0098 (a 5.61% gain). We find that 1) transforming gene expression values by learned small molecule representations outperforms raw feature concatenation, 2) small molecule structural features meaningfully contribute to learned representations, and 3) an affine transformation best integrates these representations. We analyze conditioning parameters to determine how small molecule representations modulate gene expression embeddings. This ongoing work formalizes in silico cellular screening as a conditional task in precision oncology applications that can improve drug selection for cancer treatment.
Deconvolving fitness and phylogeny in generative models of molecular evolution Eli N Weinstein (Harvard University)*; Jonathan Frazer (Harvard University); Debora Marks () Large-scale molecular evolution is shaped by the combined forces of fitness and phylogeny. Phylogenetic models are routinely applied to infer evolutionary dynamics and reconstruct evolutionary histories from genomic sequences, while generative fitness models are increasingly applied to the same large-scale sequence data and used in a range of downstream applications including 3D structure prediction, clinical variant effect prediction, protein design, and more. Generative joint fitness-phylogeny models, however, have remained largely unexplored, owing primarily to computational challenges. In this article, we (1) uncover fundamental statistical problems in models of phylogeny that fail to account for fitness and models of fitness that fail to account for phylogeny, (2) propose a general new class of generative joint fitness-phylogeny models that can be applied to a wide range of data types, and (3) derive novel scalable Bayesian inference algorithms that enable the model to be applied to massive datasets. Our core theoretical results convert models from evolutionary systematics and quantitative genetics into Bayesian nonparametric models using tools from computational geometry, physics and theoretical statistics. Our results lay out a path towards reliable and accurate inference of fitness landscapes from evolutionary data for a wide variety of genetic elements, and towards forecasting evolutionary trajectories along these landscapes.
Scalable nonparametric Bayesian models that predict and generate genome sequences Alan Amin (Harvard University); Eli N Weinstein (Harvard University)*; Jean Disset (Harvard University); Tessa Green (Harvard University); Debora Marks () Large-scale sequencing efforts have revealed complex genomic diversity and dynamics across biology, while advances in genome engineering and high-throughput synthesis have made modifying genomes and creating new sequences progressively easier. In principal, probabilistic modeling offers a powerful, rigorous and modular approach to analyzing sequence data, making predictions, and generating new sequence designs. In practice, generative probabilistic genome sequence models have been so far largely limited to the case where genetic variation exclusively takes the form of single nucleotide polymorphisms, as opposed to complex structural variation, despite the common occurrence of structural variation throughout evolution. In this article, we propose a new class of generative models, termed "embedded autoregressive" (EAR) models, which generalize both (a) widely used heuristic methods for analyzing complex genome data such as kmer count comparisons and de novo local assembly algorithms and (b) a major class of statistical and machine learning models, autoregressive (AR) models. Inference in EAR models is just as scalable as for AR models, but they avoid misspecification via a nonparametric architecture. We provide a thorough theoretical analysis of the consistency of EAR models and their convergence rates, using Bayesian sieves. We illustrate the properties of EAR models and their advantages over standard AR models on both simulated and real data.
PersGNN: Applying Topological Data Analysis and Geometric Deep Learning to Structure-Based Protein Function Prediction Nicolas Swenson (Lawrence Berkeley National Lab)*; Aditi Krishnapriyan (Lawrence Berkeley National Laboratory) Understanding protein structure-function relationships is a key challenge in computational biology, with applications across the biotechnology and pharmaceutical industries. While it is known that protein structure directly impacts protein function, many functional prediction tasks use only protein sequence. In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank in order to study the expressiveness of different structure-based prediction schemes. We present PersGNN---an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis to capture a complex set of both local and global structural features. While variations of these techniques have been successfully applied to proteins before, we demonstrate that our hybridized approach, PersGNN, outperforms either method on its own as well as a baseline neural network that learns from the same information. PersGNN achieves a 9.3% boost in area under the precision recall curve (AUPR) compared to the best individual model, as well as high F1 scores across different gene ontology categories, indicating the transferability of this approach.
Ge2Net: Enabling genetic medicine for diverse populations by inferring geographic ancestry along the genome Richa Rastogi (Stanford); Arvind Kumar (Stanford University); Helgi Hilmarsson (Stanford University); Daniel Mas Montserrat (Purdue University)*; Carlos Bustamante (Stanford University); Alexander Ioannidis (Stanford University) Personalized genomic predictions are revolutionizing medical diagnosis and treatment. These predictions rely on associations between health outcomes (disease severity, drug response, cancer risk) and correlated neighboring positions along the genome. However, these local genomic correlations differ widely amongst worldwide populations, necessitating that genetic research include all human populations. For admixed populations further computational challenges arise, because individuals of diverse combined ancestries inherit genomic segments from multiple ancestral populations. To extend population-specific associations to such individuals, their multiple ancestries must be identified along their genome (local ancestry inference, LAI). Here we introduce Ge2Net, the first LAI method to identify ancestral origin of each segment of an individual's genome as a continuous geographical coordinate, rather than an ethnic category, using a neural network, yielding higher resolution ancestry inference, and eliminating a need for ethnic labels.

(not included in the final proceedings at authors' request)

Meta-learning for Local Ancestry Inference Richa Rastogi (Stanford); Arvind Kumar (Stanford University); Helgi Hilmarsson (Stanford University); Daniel Mas Montserrat (Purdue University)*; Carlos Bustamante (Stanford University); Alexander Ioannidis (Stanford University) Genomic medicine provides improved medical risk prediction by harnessing associations between health outcomes and linked positions along the genome. However, linkage, and hence association, differs between populations, making it paramount to include a diverse variety of populations in medical studies. Furthermore, increasing global migration is giving rise to increasing admixture between populations, with individual genomes containing segments of different origin that thus exhibit different ancestral linkage patterns, and necessitating an algorithmic solution to identify ancestry with high resolution within a genome. Local Ancestry Inference (LAI) provides a solution by annotating regions of an individual’s genome with an ancestry label. However, the majority of the world’s populations are severely under-represented in genomic datasets with over 69% of populations having less than 15 individuals sequenced in public worldwide datasets. This high imbalance leads to poor accuracy estimates with standard LAI models. In order to improve the performance in scenarios with skewed distributions of samples, we formulate LAI as a few-shot learning problem. Building on recent advances in meta-learning and local ancestry methods like Ge2Net and LAI-Net, we propose a deep meta learning based solution to local ancestry inference.
XGFix: Fast and Scalable Phasing Error Correction with Local Ancestry Inference Helgi Hilmarsson (Stanford University); Arvind Kumar (Stanford University); Richa Rastogi (Stanford); Daniel Mas Montserrat (Purdue University)*; Carlos Bustamante (Stanford University); Alexander Ioannidis (Stanford University) Accurate phasing of genomic data is crucial for human demographic modeling and for identity-by-descent (IBD) analyses. It has been shown that by leveraging information about an individual's ancestry the performance of current phasing algorithms can be largely improved. In this paper we present XGFix, a method that uses local ancestry inference (LAI) to enhance the performance of these algorithms. We show that XGFix is much faster than existing methods that incorporate LAI for phasing, and it does not rely on any assumptions of individual's admixture time or prior probability of phasing errors. Furthermore, we show that XGFix scales well with the number of ancestries whereas existing methods do not and quickly become infeasible.