Menu
Log in


SA Branch of Statistical Society of Australia – July meeting

20 Aug 2020 12:41 PM | Vanaja Thomas (Administrator)

Unlocking the secrets in your DNA USING machine learning and cloud-computing

Dr Natalie Twine is team leader of the CSIRO Transformational Bioinformatics Group and gave a talk to the SA members in July. The groups’ vision is to improve health care through digital technology.

The potential information held in within an individual is huge. A genome contains DNA which holds the blueprint for every cell in the human body and is like a fingerprint- unique to an individual. CSIRO have developed technology platforms around using cloud computing to investigate DNA, for example, VariantSpark and TRIBES.

VariantSpark

This can be used to explore differences in genes between sick and healthy individuals. Often diseases are controlled by more than a single gene, such that genes interact each with a variable contribution. An example is motor neurone disease or ALS which is a serious condition leading to death within 2 years of diagnosis. There is a data set of the genomes of 22K individuals of healthy and sick individuals that can be used to investigate the genes underlying this disease. The genome of one individual covers over 80 million gene variants for each individual so it is a large data set. Using VariantSpark on this data set, a machine learning (ML) approach using Random Forests was implemented to investigate interacting genes. Where traditional bioinformatics tools are underpowered, VariantSpark uses using an apache spark cluster which is parallelised with capacity for adding extra computing power. This approach can be implemented in various platforms such as AWS or Microsoft Azure. There is an example on AWS marketplace in demo notebooks

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8497971343024764/53198984527781/2559267461126367/latest.html

TRIBES

Disease is often inherited. With genetic ancestry it is becoming possible as more genome data becomes available to find genes underlying disease. Determining whether there is distant ancestry between people with the disease is important as it enables the identification of DNA that they have inherited from a common ancestor. These segments of DNA are known as identical by descent (IBD) segments. By exploring these IBD segments it narrows the research and enables potential drivers of the disease to be targeted. TRIBES which is available from GITHUB, is a pairwise classifier and very accurate compared to other tools. It enables the identification of new disease genes, it is also able to connect different families through these IBD regions. In an ALS example, it was able to connect 19 families from 25 and identify 5 independent or different mutations associated with ALS. The identification of different mutation is particularly important as it means that drug therapy can be targeted. The identification of novel genes was also possible.

COVID-19

A final example using technology was in the fight against the corona virus or COVID-19. CSIRO has been involved in replicating the virus, sequencing the genome, developing animal models and preclinical vaccine testing. They have found that the virus mutates approximately 25 times a year compared to influenza which mutates around 50 times per year. Identifying COVID-19 strains for vaccine testing is important. From the original host of the virus it may have mutated multiple times but its only possible to capture those virus in people that are tested so intermediate strains are lost. CSIRO have used ML to compare the virus genome isolated from people who have been tested and to identify how identical they are. This can aid in determining the most appropriate strain for vaccine testing. 

CSIRO have developed a web service which monitors new genome sequences of COVID-19. Data from around the world can be uploaded to an AWS platform and each day an analysis can be run comparing strains with results available on a website. This is enabled by using serverless powered computing which is agile, such that resources are able to be shared when it’s not used, its good if software is used sporadically.

The final impact is detecting community spread of the virus, where the different strains are monitored. They are able to track mutations to see how they may effect transmission, virulence and symptoms.

The take home message from the talk was that there is a massive explosion in data. IT technology can make a big difference in the fight against disease.

By Helena Oakey

Powered by Wild Apricot Membership Software