Starting an Aligned Expression Analysis
Running an Expression Analysis of Aligned Reads
Curio is able to run an expression analysis on one or more aligned samples directly on the "Project Details" screen by first selecting all of the "Aligned Sequences" files which represent the samples you would like to analyze, and then choosing the "Available Actions -> Calculate Expressions" option from the dropdown menu. Alternatively, you can also start an expression analysis on a single sample by simply choosing the "Calculate Expression" option from the dropdown menu listed to the right of each sample's name.
Topics covered in this article:
- Starting an Aligned Expression Analysis
- Specifying the Features of the Genome to Analyze
- Controlling Feature and Meta Feature Matching
- Controlling Read Processing Options
After selecting the samples that you would like to analyze the expression of, Curio will bring up a dialog box asking you to select the key criteria that will determine how the expression metrics will be calculated. The criteria available can be roughly organized into the following areas:
- Features of the genome to analyze: specify if you want to calculate expression levels for genes, transcripts, exons or a custom set of features (i.e. regions of the genome).
- Feature matching: control how you want aligned reads to be matched to features (or "meta" features), and designate how to handle features with overlapping regions.
- Aligned read processing options: enable read-deduplication, quality filtering, and potential UMI/UMT processing options.
The following sections describe these three areas in more detail.
The "Genome Feature Set" option allows you to specify which regions of the genome you want to calculate the expression levels for. The expression analysis system will load all of the alignments (e.g. "reads") that potentially overlap with each feature designated in the selected genome feature file, and then count those alignments to determine the expression counts for each feature.
Feature Type To Count:
The "Feature to Count" setting allows you to specify the type of feature that you would like to count the expression levels for. Depending on the type of feature set you choose Curio has the ability to either count the expression of individual features or of "meta" features, where each meta feature (i.e. a "transcript" or "gene") logically represents a set of individual features (i.e. the "exons" of the transcript or gene).
E.g. if you're analyzing features defined by a GTF or GFF file that has separate feature records for "genes", "transcripts", and "exons" then you can choose to count reads that overlap with an individual exon feature as counting towards the "meta" feature of either the transcript or gene that the exon was a part of. Or, you can simply count the reads that overlap with the entire range of each gene, transcript, or exon feature individually using the same feature set.
Note that only the feature types found in the selected feature set file will be available as options to choose from. If you select a feature file that only has one feature type (such as a BED file), then Curio simply counts the expression level of each feature independently.
Community Feature Sets:
Curio offers a collection of standard feature sets for common use cases (whole genome analysis, whole exome analysis, etc.) for the standard assemblies (hg19, hg38, mm10, etc.) Those feature sets are available under the "Community Genome Feature Sets" section which you are welcome to use as needed.
Custom Feature Sets:
In addition, if you have a custom panel or set of genomic ranges you'd like to analyze, Curio can easily support that too. Simply upload your custom GTF, GFF, or BED file into the project, assign it to an assembly, and then you'll be able to select it when running an expression analysis. Note that the feature file needs to be assigned to the same assembly as the assembly that you chose when aligning the FASTQ data in order to be available for expression analysis.
As part of evaluating the position of each aligned read it is possible that any single read could count towards the expression of a generic feature (such as an exon) or of a "meta" feature (such as a transcript or gene) that logically represents a set of smaller features. In addition, any single read could potentially overlap with more than feature. The options in this section allow you to control how the reads will be counted towards the expression of each feature they overlap with.
Matching on Exons of Meta Features:
If you've selected a genome feature file defined by a GTF or GFF file that has separate feature records for "genes", "transcripts", and "exons" then you can choose to count reads that overlap with an individual exon feature as counting towards the "meta" feature of either the transcript or gene that the exon was a part of. Or, you can simply count the reads that overlap with the entire range or each gene, transcript, or exon feature individually using the same feature set.
Selecting the "Count only reads that overlap with exons of the gene/transcript" option tells Curio to only count a read if it overlaps with one of the exon features that are associated with a gene (or transcript) meta feature. If instead you select the "Count reads that overlap introns or exons of the gene/transcript" option, then the exon features are ignored and a read will be counted towards a gene (or transcript) meta feature if it overlaps anywhere between the start and end position of the feature.
Example (See Figure Below):
Imagine a feature file that contains "meta" features for genes and transcripts, where each gene consists of one or more transcripts and each transcript consists of one or more exons. Two genes, three transcripts, and five exons in one area of the genome defined by that feature file then could look like the blue boxes in the below diagram.
Consider then how the expression of "Read 3" could be counted. If you choose the "Count only reads that overlap with exons of the gene/transcript" option then that read could only be counted towards "Transcript #3" of "Gene Y". However, if you instead choose the "Count reads that overlap introns or exons of the gene/transcript" option then that read could also be counted towards "Transcript #1" of "Gene X".
Counting Reads that Overlap with Multiple Features:
Depending on the genomic ranges of the different features that are defined in the feature file you selected, it is possible that any read could potentially overlap with more than one feature. By default, Curio will count a read towards every feature (or "meta" feature such as a gene or transcript) that it overlaps with.
However, in some cases you may have a feature file that you know defines features that should logically never overlap with each other (some BED files are setup this way, for example). In that case, you can disable the "Multi Feature Overlap" option and Curio then won't count a read towards any feature if it is found to overlap with more than one.
Example (See Figure Above):
Imagine the same feature file that contains "meta" features for genes and transcripts made up of exons shown in the blue boxes in the above diagram. Consider then how the expression of "Read 5" could be counted. If you were counting the meta feature type of "Gene" then it would be counted towards "Gene X" regardless of what you set this setting to (since both exons it overlaps with are associated with transcripts that are assigned to that same gene). However, if you were counting the feature type of either "Transcript" or "Exon" then it would only count towards any feature if the "Multi Feature Overlap" option were enabled (otherwise, it would count towards none).
The expression analysis system needs to process the reads that were aligned to different areas of the genome, and then calculate metrics on how the reads overlap with the features to determine the expression level of each. There are several capabilities to control the quality and error corrections that will be applied to the reads as they are being processed.
Unique Molecular Id/Tag (UMI/UMT) Processing
If you enable Unique Molecular Id/Tag (UMI/UMT) processing the system will first group all of the reads at the same position into consensus families before calculating expression levels. You can also filter out consensus families with a smaller number of reads during the expression analysis, in order to prevent odd reads that don't have many duplicates present with a matching UMI/UMT from affecting the expression analysis results.
Important Note: If you did not enable UMI/UMT processing when aligning the reads (available on the "Pre-Processing" tab on the "Start Alignment" screen), then Curio automatically hides the UMI/UMT processing options on the expression analysis screen.
UMI/UMT Minimum Family Size:
If you enable Unique Molecular Id/Tag (UMI/UMT) processing the system will first group all of the reads at the same position into consensus families before calculating expression levels. After determining the consensus families at each position, this setting can then be used to remove the smaller families. E.g. if you set this to a value of "5 reads" then any consensus families that contain 4 reads or less would be excluded before calculating expression levels. Note that if you slide this setting all the way to the left (i.e. "Include all Families") then all consensus read families will be included in the analysis, even if the family only contains one read.
As part of an expression analysis UMI/UMT processing therefore provides a way to save reads that contain information from an original molecule that would have otherwise been filtered out during de-duplication, by using the UMI/UMT of each consensus family to determine the unique information that should be retained at each alignment position (instead of simply removing duplicate reads that have a matching alignment position and "CIGAR" alignment string.) In addition, by using the "Minimum Family Size" setting, you can get rid of reads that are potentially noise where there is no evidence of other reads at the same position that had a corresponding UMI/UMT.
UMI/UMT Family Hamming Distance:
If you enable Unique Molecular Id/Tag (UMI/UMT) processing the system will first group all of the reads at the same position into consensus families before attempting to calculate feature expression counts. By default, Curio will require that the UMI of all the reads within a consensus family are exactly the same. However, this setting allows you to adjust the number of base pairs that Curio will allow to be different between two UMIs to still group them into the same family (i.e. the allowable "hamming distance" between the UMIs that are within the same family.)
Example: Imagine, for example, that the reads at a given position contained UMIs as shown below in red:
ATTGCCACCTTAGCTAAATCTGCTTTTA
ATTGCCACCTTAGCTAAATCTGCTTTTA
ATTGCCACCTTAGCTAAATCTGCTTTTA
ATTGCCACCTGAGCTAAATCTGCTTTTA
CTTGCCACCTTAGCTCAATCTGCTTTTA
CTTGCCACCTTAGCTCAATCTGCTTTTA
CTTGCCACCTTAGCTAAATCTGCTTTTA
With the "Family Hamming Distance" setting set to "perfect match" (i.e. a hamming distance of zero) then the above reads would be grouped into two families. If instead you were to set this setting to allow for one base to be different when calculating the UMI families (i.e. a hamming distance of one), then there would be only a single consensus read family counted towards the expression of any features that the above reads overlap with.
Read De-Duplication:
When the "De-duplication" option is enabled, Curio will attempt to get rid of reads that are potential PCR duplicates. The algorithm used is to find all reads that have the same alignment position, orientation, and CIGAR alignment string, and in the case of a paired-end read the same aspects of the mate are taken into account. The read (or read pair) with the highest quality bases is kept, and all other reads that have a matching alignment position, orientation, and CIGAR string are then removed before calculating expression levels. Note that to calculate which reads (or read pairs) have the highest quality bases, the sum of the quality score of each base in the read is calculated for all bases where the quality is greater than or equal to Q15.
If you disable this option all reads will be included in the analysis, even if they are potential PCR duplicates. When UMT/UMI processing is enabled, this option will appear disabled since duplicate reads that have the same identifier will be automatically consolidated when UMT/UMI processing is in effect. Note that the option to enable UMT/UMI processing during expression analysis is only visible if you enabled UMT/UMI processing when first aligning the sequence alignment file.
Read Quality Filtering:
Some of the aligners (Bowtie, etc.) report a Phred-like quality score that is used to represent how likely the position chosen for the read alignment is correct. If this type of alignment quality score is available for the reads, then this setting can be used to exclude reads whose alignment quality is below the selected value.
Read further about: