For most of its history, biology has been a qualitative, rather than quantitative, field. Scientists studied a single patient or a single gene or a single trait, because that’s what technology allowed for.
Today, DNA sequencing advances have made biology a data-rich field for the first time, and the volume of genomic data available is growing exponentially. With better, cheaper tools, you can now look across hundreds of thousands or even millions of patients, studying not just one gene but all of their genes.
Biology & Big Data Massive new studies are launching all the time. Just a few examples include the Million Veteran Program, the Autism Speaks MSSNG Project, and the Resilience Project. The bottleneck has now shifted from DNA sequence production to data analysis and management. Processing millions of genomes, and handling massive all-by-all comparisons of genomic information across them, takes more compute power than even the best university or private clusters offer. This is the kind of data-quantity phase transition that Google has seen before with search, video, and email. As scientists scale up their studies and query thousands or millions of genomes, you’ll need more scalable technologies than a local compute cluster. Cloud computing provides a useful, elastic resource for manipulating enormous datasets without the time or cost of moving that data from place to place. To help the life science, community organize the world’s genomic information and make it accessible and useful. Through our extensions to Google Cloud Platform, you can apply the same technologies that power Google Search and Maps to securely store, process, explore, and share large, complex biological datasets. Whether you are working with one genome or one million.
Product Manager(Google Genomics)
At Google we’ve seen a few major revolutions based on data. Biology is reaching a similar transition where data challenges and technology needs are changing rapidly. When you’re operating at a small scale, using a compute cluster with a few nodes works just fine. Scale that up by a factor of 1,000 or 1,000,000 and you really need different kinds of technologies to work with the data. Tackling these large studies — looking at hundreds of thousands or even millions of people — is essential for understanding complex genetic variability. There’s a need for algorithms that can handle these large sample sizes and find the signal in noisy data. The biggest thing we’re hoping to achieve with Google Genomics is to change the kinds of questions people can ask. Right now, there are lots of custom-built APIs for bioinformatics scientists and programmers,
but they aren’t interoperable. We developed an implementation of the API from Global Alliance for Genomics and Health to allow interoperability across multiple genome repositories. Without that kind of tool, scientists have to use FTP download to call data and query it on the local cluster, which can take weeks. Imagine if infrastructure were a solved problem. Imagine if massive, complex data analysis were a solved problem. I think we’re going to unleash a tremendous amount of creative insight into science by making this genomic information available quickly and at scale. The exciting thing is that we don’t even know what questions people will ask. The science that’s going to come out of this will be amazing. I can imagine a future where a high school science fair project can include analyses on cohorts of millions of patients and find effects that no PhD researcher at Harvard or Stanford ever thought to look for.