ABSTRACT.- Genome enabled prediction of complex traits aims to predict a measurable characteristic of an organism using their genetic information. In the present work we address diverse traits and organisms including Yeast growth, Wheat yield, Jersey bull fertility and various Holstein cattle milk-related traits. We benchmark several popular Machine Learning models: Bayesian and penalized linear regressions, kernel methods, and Decision Tree ensembles. Through exhaustive hyperparameter tuning we outperform state-of-the-art results in most datasets. We also evaluate two codification techniques for input data and perform ablation studies to assess robustness to genetic markers - i.e input features- elimination. We also explore different Deep Learning architectures for this task. We propose and evaluate Convolutional Neural Network (CNN) architectures, showing that using residual connections improves performance but that in some cases Fully Connected Networks outperform CNNs. We link this to the fact that absolute positions are relevant in genomes, and thus, CNN's translational equivariance may not be an adequate inductive bias for tackling this problem. We evaluate Graph Neural Network (GNN) architectures by formulating trait prediction as a node regression problem on a population graph, where each node represents an individual, and edges association between their genetic information. We evaluate the transferability of these graphical models and find that the extent to which they exploit neighborhood information is limited. By combining CNN and GNN architectures, we could outperform all other models for predicting milk yield in Holstein cattle.The methods that are based on neural networks can be computationally demanding when used on high density chips or sequence data, even more when fully connected layers are used. To overcome this problem, we propose to obtain a new representation of the input vector by using the intermediate representation (code) of an Autoencoder (AE). Currently we are evaluating the performance benchmarks. Another common issue when using these databases is the missing data or the combination of chips of different SNP's numbers. Again, we propose to use AE for imputing the missing values. One of the main focuses of this work was to explore the feasibility of employing modern deep learning architectures in Genomic Prediction. In this regard, it was possible to train highly over-parameterized architectures and still obtain good generalization. For some datasets and traits, these models outperform all others. However, this did not hold for all the models, traits and datasets studied. Besides, whether the gains in performance outweigh the increase in model size and thus its training and inference computational cost, and lack of interpretability, calls for further discussion.
Instituto Nacional de Investigación Agropecuaria