The ability to analyze large data sets ("Big Data") is an increasingly important skill in modern science. In Biochemistry, the increased volume and velocity of data is particularly evident in the rapid expansion of biological databases. We present a modular bioinformatics course to survey the analysis of genomic data for advanced undergraduates. Research activities include genome scanning for endogenous retroviruses, annotating genomic sequences and a brief exploration of programming in R. A summative poster session was used to disseminate their work. This course is amenable to remote or online instruction. Supplemental materials provided include a schedule and outline. This article reports a session from the virtual international 2021 IUBMB/ASBMB workshop, "Teaching Science on Big Data."
Keywords: course‐based (CUREs); bioinformatics; curriculum design development and implementation; genome research; undergraduate research; undergraduate researchbig data
Recently, ASBMB offered a virtual conference called "Teaching Science with Big Data," emphasizing the importance of incorporating manipulation of large datasets into curricula.1 As of 2021, the NCBI database had accumulated over 2.3 × 10
Several papers have explored methods for teaching bioinformatics to undergraduates5 including several from this journal.6–10 CUREs have been shown to increase learning,11 increase persistence in science majors,12,13 as well as having positive impacts on appreciation for science and career aspirations.14
At our small primarily undergraduate institution, we established a one‐semester course that utilized modular units so that students could gain basic hands‐on skills with database searches, genome annotation, programming on the BASH command line and in R. Three of these modules are based on consortia that support bioinformatics: HHMI SEA‐PHAGES, GEP, and UNH T3 (see Acknowledgements and Supplemental Materials). Content was provided on a need‐to‐know basis as introduction, so that classes could be primarily conducted as exploratory workshops. Our goal was to provide students with broad exposure to bioinformatics and enable them to analyze big data sets of viruses, flies, and humans. Pedagogical advantages include problem‐based learning, visualization of large data sets and use of bioinformatics tools. The course design is amenable to online/remote instruction.
The workflow is shown in Figure 1. For unit 1, students scanned a cross‐species database for potentially oncogenic endogenous retroviruses (ERVs) bearing sequence similarity to known human ERVs (Figure S1). For unit 2, students have submitted fully annotated phage genomes to GenBank.15 For unit 3, students have fully annotated Drosophila contigs. For unit 4, students have assembled microbial sequencing reads and tentatively assigned a genus. For unit 5, the students learn about the programming language R16,17 (available for free download18) and students can observe visualization of pre‐published data sets in R (Figure S2). Deliverables generated include: novel endogenous retroviral relationships, genomes assembled and annotated. Student progress was monitored weekly using shared electronic notebooks.
We have encountered challenges in implementation. For example, there has been significant variation in student preparation for the course. In addition, there is always a trade‐off between breadth and depth of content. We typically go into more depth with annotation and BASH, but the R programming is a quick overview (Supplemental Materials).
Student feedback has been largely positive. For the 2021 cohort, 100% of the students rated the overall quality of learning they did in the course as "excellent," as did 86% of the 2018 cohort. The rate at which students grasped the material varied: some commented that the pace was too fast and one commented that it was too slow. One student commented, "Every subject that was presented was explained very thoroughly which was very helpful. There were a lot of activities which made it easier to learn. The research project was a good way to incorporate everything we learned during the semester and show how much we knew." Another noted that the shared electronic notebooks were "...helpful in keeping track of data and the steps we did in each program."
It is important for instructors to have a working knowledge of NCBI databases, programming in BASH, programming in R and viral genome annotation. Students need only to be familiar with basic genetics and DNA structure at the Sophomore level or above.
There are abundant opportunities for other activities to extend the learning and these projects are often scalable. For example, students could explore the ERV sequences of an animal family, compare how the sequences differ and generate hypotheses about common ancestors and rates of mutation. There are many genomic or transcriptomic sequence records that have been analyzed for only one variable and can be explored. Genes of unknown function can be explored in annotated genomes.
In conclusion, these big data modules provide upper‐level undergraduates the opportunity to explore a range of modern bioinformatics tools. Students gain hands‐on experience with data manipulation, analysis and the boundaries of current knowledge. These students engage in authentic research experiences and contribute to the breadth of understanding of the vast sequencing data available.
Support was provided by Franklin Pierce University College of Health and Natural Sciences. HERV searches were based on searches such as those in Villeson et al.19 Phage genome sequences and annotation guidance were provided by the HHMI SEA‐PHAGES program.20–22 The Drosophila annotation was done under the auspices of the Genome Education Project using contigs that they provided.23 The command line BASH module is based on curriculum developed at the University of New Hampshire for the T3 program under the direction of W. Kelley Thomas with significant technical assistance from Joseph Sevigny and Devin Thomas (IPERT research education grant from the National Institute of General Medical Sciences at the National Institutes of Health [R25 GM125674]). The command line module was generously hosted at UNH on their teaching server, "Ron." Thanks to the 2018 and 2021 FPU Genome Research classes.
GRAPH: Appendix S1 Supplemental Materials—Outline of topics, timing of modules, resources.
GRAPH: Figure S1. Human endogenous retroviruses (HERVs) identified in other species. (A) A phylogenetic tree with blue arrows indicating species that students identified with genomes containing similar endogenous retrovirus sequences. (B) A cartoon of the sequence alignment from NCBI BLAST showing regions of the HERVs found in the database. (C) A sequence alignment from NCBI BLAST showing regions of identical sequence between HERVs and sequences in a parasitic worm.
MAP: Figure S2. Screenshots of projects. (A) Genome map ("phamerator map") of annotated bacteriophage genome. Green boxes represent forward strand genes, red boxes represent reverse genes. Scale in kilobases. (B) Screen shot of UNH teaching server Ron displaying large datasets. (C) UCSC Genome Browser displaying Drosophila contig analyzed. Each horizontal line displays a genome browser evidence track.
By Evan N. Bennett and Shallee T. Page
Reported by Author; Author