Next Generation Sequencing (NGS): a beginner’s guide
This blog post aims to provide a comprehensive overview of essential concepts for a better comprehension of next generation sequencing (NGS). It delves into the historical context of DNA discovery, explores various sequencing methodologies, highlights prevalent applications of NGS, and outlines fundamental procedural steps. These steps encompass sample collection, library preparation, cluster generation, sequencing, and the crucial data analysis phase. By the end of this article, readers will have a solid foundation in the fundamentals of NGS, equipping them with valuable insights into its scientific and technological complexity.
The discovery of DNA
DNA was first identified in 1869 by the Swiss physiological chemist Friedrich Miescher. He called it “nuclein”. Unfortunately for him, no one was interested in the first 50 years after his discovery. The Russian biochemist Phoebus Levene was the first to identify how RNA and DNA molecules were put together in 1919. Later, Erwin Chargaff expanded on Levene’s work and described that the amount of adenine (A) was usually similar to the amount of thymine (T) and that the amount of guanine (G) was nearly equal to the amount of cytosine (C) in all studied organisms. This idea that G=C and A=T, combined with X-ray crystallography work done by Rosalind Franklin and Maurice Wilkins, contributed to Watson and Crick’s 1953 proposal that DNA consists of a double helix.
From Sanger sequencing to next generation sequencing to third-generation sequencing
Sanger Sequencing was developed by Frederick Sanger in 1977. It was the first method of DNA sequencing and is based on the random incorporation of chain-terminating nucleotides during in vitro replication. At first, it was a very labour-intensive technique before it was commercialised in 1986. Currently, Sanger Sequencing is still frequently sued for small-scale projects.
The development of the sequencing by synthesis method, better known as next generation sequencing, meant the start of an explosion of DNA/RNA sequencing. This technique allows the massively parallel sequencing of millions of fragments simultaneously. At first, the sequencing cost was enormous, but now it’s more affordable. Although there are several platforms, the Illumina clonal bridge amplification method (see this video to understand how it works) is currently the most popular.
A few years later, the newest sequencing technology appeared: single-molecule sequencing, also known as third-generation sequencing. The most popular one is Oxford Nanopore’s technology. The DNA or RNA molecule passes through a nanoscale pore structure, and the machine measures changes in the electrical field surrounding the pore.
More recently, fourth-generation sequencing has been described that can be used to read the nucleic acid composition directly in fixed cells and issues.
Popular applications for NGS
NGS has become an invaluable tool for scientific research and clinical diagnosis. Examples are:
- Rapidly sequence whole genomes
- Deeply sequence target regions
- Transcriptome analysis (RNA-Seq) to study gene expression analysis
- Analysis of epigenetic factors such as targeted or genome-wide DNA or RNA methylation patterns
- DNA-Protein interactions
- RNA-Protein interactions
- Sequence cancer samples to study rare somatic variants, tumour subclones, etc.
- Study the human microbiome
- Identify novel pathogens
How does NGS sequencing work?
Here, I’ll explain the basic principle. In later blogs, I will discuss some specific methods in more detail.
- Sample preparation
DNA or RNA is extracted from the selected samples (blood, tissue, cultured cells, …) and checked for quality using standard methods such as Nanodrop, Tapestation or Bioanalyser.
Depending on the method, additional sample preparation steps might be necessary, such as reverse transcription into cDNA for some types of RNA library preparation. However, some library preparation protocols include this step.
For RNA-Seq experiments, it’s usually necessary to remove the ribosomal RNA, which accounts for roughly 80% of the RNA. This can be done with polyA selection or rRNA depletion. We’ll describe these methods and their pros and cons in a different blog.
- Library preparation
There are many different types of library preparation, but the basic principles are usually the same.
The DNA (or cDNA) has to be processed into relatively short fragments (100-800 bp). This requires random fragmentation of the DNA or cDNA by enzymatic treatment or sonication. After end-repair, small adapter sequences are ligated, which allows multiple samples to be pooled together, also known as multiplexing (note: some library prep kits have a combined fragmentation/adapter ligation step).
A size selection step is often performed next, usually by gel electrophoresis or magnetic beads. This will remove all fragments that are too short or too long for sequencing.
Usually, a library enrichment/amplification step (PCR) is performed next. In most protocols, this is combined with adding indexes so that samples can be pooled together on the sequencer. In a later blog, I’ll explain why you need adapters and indexes.
After a final clean-up, usually with magnetic beads, the final libraries can undergo quality checks using Tapestation or Bioanalyzer. Although not required, using qPCR to determine the final quantity accurately is a good practice.
- Cluster generation
Methods differ depending on the selected platform and chemistry, but as Illumina’s technology is the most used one, I’ll focus on that method here. Before the actual sequencing reaction can happen, the library must be attached to a solid surface, the flow cell, and clonally amplified to increase the signal so it can be detected from each fragment during the sequencing reaction. Also, this video about paired-end sequencing is useful to understand bridge amplification.
- Sequencing reaction
All the fragments are sequenced at the same time. Every cycle, one base is incorporated and detected on each fragment. When the run has finished, the data can be “demultiplexed” to generate a fastq file (or two fastq files in case of paired-end sequencing) with the raw data for each sample.
- Data analysis
Data analysis starts with a wide range of quality checks. You can see a few examples of our quality checks here. The good-quality reads are then mapped against the reference genome of the relevant species. From here, each analysis has to be tailored to the research project to extract meaningful information from the data.
And that’s where we come into the picture. Do you have questions about the analysis of your project? Do you want us to analyse your data? Please don’t hesitate to contact us!