The Human Genome Project constituted one of the first large-scale research projects to generate vast amounts of biological information, what today we refer to as “Big Data”. Initiated in 1990, the project aimed to solve the sequence of the human genome in search of a deeper understanding of human biology and it required the collaboration of many laboratories throughout the world. Since then, significant advances in technology have led to an explosion of high-throughput experiments, which range from perturbations of one particular pathway to sequencing the entire genome of new organisms to parsing the transcriptomes of the thousands of species of bacteria in the human microbiome. Dr. Nicholas Provart, a bioinformatician and plant biologist in the Department of Cell and Systems Biology at the University of Toronto, quips that Big Data “starts when Excel is at its limits”. At this point, more advanced computational methods are required to get meaningful insights from the data. As such, with the increasing ease to collect large datasets, there is a pressing need to develop digital platforms and computational methods that allow researchers to generate and tackle relevant hypotheses.
Soon after the initiation of the Human Genome Project, it became obvious that just having a genome sequence was not enough to derive biological function; hence, other “omics” fields such as transcriptomics and interactomics started to be used to generate large datasets. There are currently many collaborative efforts to map the ‘-omes’ of different organisms. Many individual labs also perform their own high-throughput experiments. In both cases, much of the data produced gets stored in online databases, which were created so that scientists could archive accumulated knowledge and have easy access to the data. Some common databases include the Protein Databank (PDB) and GenBank, but many others exist. In fact, every year, the January issue of Nucleic Acids Research reports a list of new or improved databases; the 2015 issue included an impressive 176, bringing the overall number to more than 1550 molecular biology databases. Hence, the Internet is rich with data that every scientist can use for hypothesis generation, and there have been many success stories. For example, one of the first genomes to be sequenced was the Helicobacter pylori genome. BLASTing its genome against other databases identified several iron uptake systems that allow H. pylori to survive at low pH. More recently, a genome-wide association (GWAS) study that analyzed genetic variants of the human BCL11A locus identified an intronic enhancer that influences fetal haemoglobin levels and now represents an exciting target for therapy on sickle cell anemia patients.
Nonetheless, many reports suggest that biological scientists are lagging behind when it comes to using Big Data for hypothesis generation. In the case of transcriptomics, for example, it is now mandated by journals that gene expression data used in publications be submitted to the Gene Expression Omnibus (GEO) database. There are currently over 2000 datasets in GEO that come up for the search term “immunology”, yet many researchers do not consult this data when searching for meaningful targets in their individual projects. Part of the problem is that traditional scientists don’t know how to find data that would be useful or how to analyze it once they do. One of the main reasons for this is the absence of standardized criteria for the submission and organization of research data online. In some cases, there will be several databases for one particular field, especially in relatively young areas such as microbiome studies. In addition, some databases, such as those recording protein-protein interactions, require manual curation, meaning that someone is responsible for looking through the literature to find whether the particular interaction corroborates what the database indicates. This is a daunting task and most databases don’t have the resources to employ people for it. So it is often the principal investigators working on Big Data that curate small parts of the database.
With increasing diversity of sequencers and protocols for RNA/DNA extraction, sequencing data encounters particular challenges in standardization and accessibility of the metadata. In some cases, such as for 16S rDNA sequencing for microbiome studies, there is no requirement to publish this metadata, so many researchers only upload their sequence reads. For transcriptome data, most journals enforce the disclosure of information for the microarrays (MIAME guidelines), yet many investigators only give the bare minimum needed to publish their data, which is often not enough for someone else to replicate it. Even for computationally proficient scientists, this is a challenge. Dr. Deborah Winter, a computational immunologist in the lab of Ido Amit at the Weizmann Institute, explains that “it is often most cost effective to generate your own data, partly because when you’re not interested in the exact comparison previously done, you have to rework the experimental setup anyway, but also because many times you have no idea what assumptions they’ve made when processing and analyzing their data.”
Having descriptive and organized datasets becomes even more important for clinical work, since samples from human patients are a lot harder to obtain than those from mice that are constantly breeding in a facility. Dr. Williams Turpin is a postdoctoral fellow working with Dr. Kenneth Croitoru at the University of Toronto characterizing human susceptibility variants that lead to inflammatory bowel disease (IBD) (GEM project). He often interrogates online data but indicates that there are many cases when this is not possible and often there is a need to contact the investigators for extra information. He points out that for his work, or any GWAS studies, it is also important to know the genotype of the person, and this creates controversy as to whether an individual’s DNA fingerprint should be openly accessible. Nonetheless, combining large datasets in GWAS studies is crucial to gain biological significance.
Scientists in the bioinformatics field, such as Dr. Provart, have created tools to make Big Data easier to visualize and understand. For plant biologists, Dr. Provart describes, “There are investigator-driven tools such as AtCAST that allow you to ask if there are particular transcription datasets out there that have an expression profile similar to yours, and this is quite useful.” In the case of protein-protein interactions, the database developed in Mike Tyers lab while he was at the University of Toronto allows you to see if your protein has been shown to interact with another protein in several publications; that way, scientists looking through it are able to have more confidence in a particular interaction. “Unifying efforts are also very important when dealing with Big Data,” Dr. Provart elucidates. “One of the first big unifying efforts was Gene Ontology (GO), a brilliant idea from Michael Ashburner to standardize gene products based on biological processes, cellular components and molecular functions so we can then do a Gene Ontology enrichment analysis and make sense of large sets of data.” QIITA is a recent example of a unifying effort to analyze and store microbiome data. However, as Dr. Turpin indicates, it requires compliance from the users to upload their data in the format required. Dr. Michelle Smith, manager of the GEM project at Mount Sinai Hospital in Toronto, comments that many tools available online have been built by independent labs out of necessity rather than funding incentive, and although some go on to get updated and used by much of the scientific community, others stop running due to lack of people to oversee them.
The “single gene – single function” thinking that many researchers endorse becomes very small when looking at massive amounts of data.”
Despite the immense amount of data already available, hot-topic datasets may be hard to find online due to fear from investigators of getting scooped. Dr. Smith suspects that as the need to verify Big Data experiments increases, datasets will become more readily available. Even when computational biologists, such as Dr. Winter, work towards making their data publicly available, it takes time and effort. As she explains, it is often hard to come up with a good visual representation, and what seems like a very good idea in theory may not end up not being useful. This becomes even more complicated when scientists move away from microarray into RNA-seq and CHIP-seq or into non-model organisms, as there is no longer a reference tag or a gene identifier, respectively. In the case of the epigenome, there is not just one value per gene, but rather several values per base pair of the genome, which makes it a lot harder to archive and understand.
Probably the best efforts so far to help traditional scientists utilize Big Data are large-scale standardized projects that display the underlying data through easy-to-interpret visual interfaces. In immunology, the Immunological Genome Project (ImmGen) constitutes one of the most comprehensive of these efforts by aiming to map the transcriptional profiles of all immune cells and visualize them through an open-access web interface. Their website offers several tools to anybody that would like to use them. The most widely used tool, Gene Skyline and Gene Constellation, which allows you to check for expression of a particular gene in any immune cell, has been used over 400,000 times this year. On the other hand, the population comparison and human/mouse comparison tool has 300 usages, and strain comparisons only 7 for this year. Another such project is Blueprint. Launched in 2011, Blueprint aims to map the epigenome of all human hematopoietic cells.
Finally, Big Data is fueling a holistic approach to scientific thinking that emphasizes the importance of Systems Biology. The “single gene – single function” thinking that many researchers endorse becomes very small when looking at massive amounts of data. In contrast, the goal for many scientists in the field of Systems Biology is to create a model of a system that one can perturb and predict how it changes. For some models that include a small set of genes and many years of experimental work, this is close to being accomplished. For example, some plant gene networks have been programmed such that the phenotype observed in silico is reflective of the phenotype in planta. Other models that involve much larger sets of genes are still very much in the preliminary stages. Both ImmGen and Blueprint are aiming to obtain the transcriptomes and epigenomes, respectively, of immune cells upon perturbations, which will hopefully allow for the modelling of immune systems down the line.
Many challenges lie ahead when dealing with Big Data, but with them also comes the prospect of understanding the mechanistic intricacies of the extremely complex field of biology. It is without doubt that as more and more data becomes available, the need for every lab to understand bioinformatics will increase. Some universities are entertaining the establishment of a Bioinformatics Core, a pay for service that individual labs can use to help them with bioinformatics needs. Often though, it is hard for consulting bioinformaticians to understand the needs of a particular question without the background of a specialist. Another strategy is to establish collaborations between scientists and computer biologists, so that they work hand in hand on a particular question; however, here it is crucial to ensure that all parties involved are credited for their work, as computer biologists are sometimes viewed as lesser contributors to the intellectual work and subsequently omitted from the final product. It is encouraging to see that the pressing need to incorporate Big Data into the average biology lab has spurred some funding agencies to take action. Recently, the National Institute of Health initiated the Big Data to Knowledge (BD2K) initiative, which aims to enable the biomedical research community to use Big Data for research, and last year it announced $32 million in awards to mine Big Data. Dr. Winter envisions that in the future, bioinformatics will become an essential part of the basic biology training, thereby breaking the Big Data barrier in biology.
Acknowledgment: This article was written with the guidance from interviews with Dr. Provart, Dr. Winter, Dr. Turpin and Dr. Smith.
1. Correspondance with Liang Yang (ImmGen), Biology Software Engineer, Division of Immunology, Department of Microbiology & Immunobiology, Harvard Medical School, Boston, United States.
2. Interview with Dr. Deborah Winter. Postdoctoral Researcher, Department of Immunology, Weizmann Institute of Science, Rehovot, Israel.
3. Interview with Dr. Michelle Smith, Project Manager, Zane Cohen Centre for Digestive Diseases, Mount Sinai Hospital, Toronto, Canada
4. Interview with Dr. Nicholas Provart. Professor, Department of Cell and Systems Biology, University of Toronto, Toronto, Canada.
5. Interview with Dr. Williams Turpin, Postdoctoral Researcher, Department of Immunology, University of Toronto, Canada.
6. NIH grant review: http://grants.nih.gov/grants/guide/rfa-files/RFA-ES-15-004.html
7. Shay, T. & Kang, J. (2013). Immunological Genome Project and Systems Immunology. Trends in Immunology. 34 (12): 602–609.
8. Venter, J.C. et al. (2001). The Sequence of the Human Genome. Science. 291 (5507): 1304–51.
Mayra Cruz Tleugabulova
Latest posts by Mayra Cruz Tleugabulova (see all)
- Nanotechnology: The Future is Now - April 13, 2017
- The Big Data Barrier - June 15, 2015
- It’s Not All About T cells: Dendritic Cell Development in the Thymus - January 27, 2013