Frequently Asked Questions About Data Analysis
Below you will find some of the questions that we are frequently asked about data analysis. If you have a question that is not listed below, please contact us.
What is the difference between the distribution set of analysis and the normalized set? What is it normalized against?
The difference between regular distribution and normalized distribution is how the data are counted. The regular distribution is based on the number observed from the data. The normalized distribution counts the value (either V, J, n-addtion, CDR3 length, etc.) of each distinct CDR3 as one no matter how many of the particular CDR3s are observed.
Will previous data be deleted from the website when a new set is uploaded?
No. We will always let you know in advance if data is going to be deleted from the server, but it is not our policy to remove old data in place of new data.
I am a blank title. Do not delete me.
Do you have a guide to help me understand the software?
How is the sequence assigned to the V or J? Is it an exact sequence alignment?
The best alignment of V, D, J and C segments to a sequencing read are assigned to the sequence read. We use the Smith-Waterman algorithm for local sequence alignment between sequencing reads and germline reference (human consensus from IGMT). The parameters for the alignment are: match = 1, mismatch = 3, gap_open = 5, gap_extension = 2. The cutoff score for V match is 50 and the cutoff score for J match is 20. In addition, the alignment is further to check for proper conserved motif around CDR3 region.
How do you normalize the data?
We scale all filtered read total to a fixed number, such as 10 million.
Do you use IMGT or your own tools to make assignments?
We have our own bioinformatics tools developed in-house. We use the publicly available sequences from the IMGT to make assignments.
So technically, you are not using “germline allele” sequences. These are really human consensus sequences. A “germline sequence” would be from that individual and it would be their repertoire from a non-immune related cell type (un-rearranged).
That is correct. We are using publicly available human consensus sequences.
Is the assignment based on DNA or protein sequence?
Could you give a description of the different filters being applied to the data?
Our SMART filter processes are mainly applied to TCR sequences, due to the fact that TCRs are supposedly without hyper-mutation, which allows us to apply a reference filter (remove reads with a mismatch to the reference sequence in the CDR3 region, see the pre-print for details) and also allow us to collapse sequence reads into one consensus. Without collapsing the sequence reads (which is required to gain the frequency of particular CDR3), we cannot apply the PCR error filter and Mosaic error filter. Therefore, for B cell, we apply only the sequence error filter (through overlapping region).
Why am I not seeing D genes in the analysis on iRweb?
D genes are utilized in the recombination that leads to the heavy chain. Quoting Roitt, Brostoff and Male, "The D segment is highly variable both in the number of codons and in the sequence of base pairs...More than one D segment may join to form an enlarged D region." There are mechanistic constraints that preclude the use of D genes that are 3' of the selected J gene. There are also possible insertions and deletions and other noise in the region where V and J join. Therefore, it can be difficult to detect D genes by sequence alignment. If we cannot detect which D gene was used, we do not enter a value. If the depth of sequencing is not deep, there may be no D genes listed.