By Aditi Goyal, Statistics, Genetics and Genomics, ‘22
Author’s Note: I wrote about this topic after being introduced to the idea through a speaker series. I think the applications of modern day computer science, genetics and statistics creates a fascinating crossroads between these academic fields, and the applications are simply astounding.
Next Generation Sequencing (NGS) has revolutionized the field of clinical genomics and diagnostic genetic tests. Now that sequencing technologies can be easily accessed and results can be obtained relatively quickly, several scientists and companies are relying on this technology to learn more about genetic variation. There is just one problem: magnitude. NGS and other genome sequencing methods generate data sets in the size of billions. As a result, simple pairwise comparisons of genetic data that have served scientists well in the past, cannot be applied in a meaningful manner to these data sets [1]. Consequently, in efforts to make sense of these data sets, artificial intelligence (AI), also known as deep learning or machine learning, has introduced itself to the biological sciences. Using AI, and its adaptive nature, scientists can design algorithms aimed to identify meaningful patterns within genomes and to highlight key variations. Ideally, with a large enough learning data set, and with a powerful enough computer, AI will be able to pick out significant genetic variations like markers for different types of cancer, multi-gene mutations that contribute to complex diseases like diabetes, and essentially provide geneticists with the information they need to eradicate these diseases, before they manifest in the patient.
The formal definition for AI is simply “the capability of a machine to imitate intelligent human behavior” [2]. But what exactly does that imply? The key feature of AI is simply that it is able to make decisions, much like a human would, based on previous knowledge and the results from past decisions. AI algorithms are designed to take in information, generate patterns from that information, and apply it to new data, about which we know very little about. Using its adaptive strategies, AI is able to “learn as it goes,” by fine-tuning its decision-making process with every new piece of data provided to it, eventually making it the ultimate decision-making tool. While this may sound highly futuristic, AI has been used for several years in applications throughout our daily lives from the self-driving cars being tested in the Silicon Valley, to the voice recognition program available on every smartphone today. Most chess fans will remember the iconic “Deep Blue vs Kasparov” match, where Carnegie Mellon students developed an IBM supercomputer using a basic AI algorithm designed to compete against the reigning chess champion of the world [3]. Back then, in 1997, this algorithm was revolutionary, as it was one of the major signs that AI was on par with human intelligence. [4]. Obviously, there is no question that AI has immense potential to be applied in the field of genomics.
Before we can begin to understand what AI can do, it is important to understand how AI works. Generally speaking, there are two ways AI algorithms are developed: supervised and unsupervised learning. The key difference between the two groups is that in supervised learning, the data sets we provide to AI to “learn” are data sets that we have already analyzed and understand. In other words, we already know what the output will be, before providing it to AI [5]. The goal, therefore, is for the AI algorithm to generate an output as close to our prior knowledge as possible. Eventually, by using larger and more complex data sets, the algorithm will have modified itself enough to the point where it does the job of the data scientist, but is capable of doing so on a much larger scale. Unsupervised learning, on the other hand, does not have a set output predefined. So, in a sense, the user is learning along with the algorithm. This technique is useful when we want to find patterns or define clusters within our data set without predefining what those patterns or clusters will be. For the purposes of genomic studies, scientists use unsupervised learning patterns to analyze their data sets. This is beneficial over supervised learning, since the gigantic data sets produced by omics studies are difficult to fully understand.
Some of the clearest applications of AI in biology are in cancer biology, especially for diagnosing cancer [6]. “AI has outperformed expert pathologists and dermatologists in diagnosing metastatic breast cancer, melanoma, and several eye diseases. AI also contributes to innovations in liquid biopsies and pharmacogenomics, which will revolutionize cancer screening and monitoring, and improve the prediction of adverse events and patient outcomes” [7]. By providing a data set of genomic or transcriptomic information, we can develop an AI program that is designed to identify key variations within the data. The problem lies, primarily, in providing the initial data set.
In the 21st century, an era of data hacks and privacy breaches, the general public is not keen to release their private information, especially when this information contains everything about their medical history. Because of this, “Research has suffered for lack of data scale, scope, and depth, including insufficient ethnic and gender diversity, datasets that lack environment and lifestyle data, and snapshots-in-time versus longitudinal data. Artificial intelligence is starved for data that reflects population diversity and real-world information” [8]. The ultimate goal of using AI is to identify markers and genetic patterns that can be used to treat or diagnose a genetic disease. However, until we have data that accurately represents the patient, this cannot be achieved. A study in 2016 showed that 80% of participants of Genome Wide Association Study (GWAS) were of European descent [9]. At first glance, the impacts of this may not be so clear. But when a disease such as sickle cell anemia is considered, the disparity becomes more relevant. Sickle cell anemia is a condition where red blood cells are not disk-shaped, as they are in most individuals, but rather in the shape of a sickle, which reduces their surface area, which in turn reduces their ability to carry oxygen around the body. This is a condition that disproportionately affects people of African descent, so it is not reasonable to expect to be able to find a genetic marker or cure for this disease when the data set does not accurately reflect this population.
Another key issue is privacy laws. While it is important to note that any genomic data released to a federal agency such as the NIH for research purposes will be de-identified, meaning that the patient will be made anonymous, studies have shown that people can be re-identified using their genomic data, the remaining identifiers still attached to their genome, and the availability of genealogical data and public records [10]. Additionally, once your data is obtained, policies like the Genetic Information Nondiscrimination Act do protect you in some ways, but these pieces of legislation are not all-encompassing, and still leave the window open for some forms of genetic discrimination, such as school admissions. The agencies conducting research have the infrastructure to store and protect patient data, but in the era of data leaks and security breaches, there are some serious concerns that need to be addressed.
Ultimately, AI in genomics could transform the world within a matter of days, allowing Modern biology, defined by the innovation of NGS technologies, has redefined what is possible. Every day, scientists all around the world generate data sets larger than ever before, making a system to understand them all the more necessary. AI could be the solution, but before any scientific revolution happens, it is vital that the legislation protecting citizens and their private medical information be updated to reflect the technology of the times. Our next challenge as a society in the 21st century is not developing the cure for cancer or discovering new secrets about the history of human evolution, but rather it is developing a system that will support and ensure the protection of all people involved in this groundbreaking journey for the decades to come.
References
- https://www.nature.com/articles/s41576-019-0122-6
- https://www.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/
- https://en.chessbase.com/post/kasparov-on-the-future-of-artificial-intelligence
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.278.5274&rep=rep1&type=pdf#page=41
- https://www.nature.com/articles/s41746-019-0191-0
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6373233/
- https://www.genengnews.com/insights/looking-ahead-to-2030/
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5089703/
- https://www.genome.gov/about-genomics/policy-issues/Privacy