Illumina Read Is Sense to the Fragment
This topic is incredibly like shooting fish in a barrel to get confused about, then here is as clear an explanation as I can muster. It will first out large picture and then get into the weeds.
Sequencing by synthesis, which is how nearly commercially available loftier-throughput sequencing technologies work as of December 2012 (run into notes on sequencing technologies), always synthesizes the new strand (which becomes your read) in a 5′-to-3′ direction. That's because this is how DNA polymerase works in our cells (indeed, in every living thing's cells) and sequencing relies on DNA polymerase. Since the new strand is synthesized 5′-to-three′, you are working your way upwardly the template strand in a 3′-to-v′ direction.
In any sequencing applied science, you PCR dilate the private Deoxyribonucleic acid fragments in one case they have hybridized to flowcells or beads. This ways y'all terminate up with both strands of Dna. If you were to read both of the strands from their respective three′ ends at once, you'd be getting two dissimilar sequences and your results would exist uninterpretable. To avert this trouble, sequencing technologies ligate non-complementary adapters to the three′ and v′ ends of Deoxyribonucleic acid fragments so that the primer for i adapter only begins synthesis on one strand and not on its complement.
In conventional paired-end sequencing, you simply sequence using the adapter for 1 stop, and and so once you're done yous first over sequencing using the adapter for the other cease.
This means your 2 reads are the reverse complement of the 100 iii′-near bases of the Watson strand and the Crick strand; these reads are assumed to exist identical to the 100 5′-nigh bases of the Crick strand and Watson strand respectively.
Here'southward what your reads correspond, then:
Therefore when you open your FASTQ files and look at a pair of reads, the sequences you meet are, conceptually,pointing towards each other on contrary strands. When you lot align them to the genome, one read should align to the frontward strand, and the other should align to the reverse strand, at a higher base of operations pair position than the get-go 1 so that they are pointed towards 1 another. This is known as an "FR" read – forward/reverse, in that lodge.
This is all for conventional paired-end sequencing. Some specialized technologies, such as using circularized Dna fragments to create big insert jumping libraries [Talkowski 2011], switch things around then that your reads ought to align in an "RF" position – reverse/forward, in that order. This is unlike from FR considering it means the opposite read aligned at a lower base pair position than the forwards read, and thus that they are pointing away from some other.
Merely if you're just doing conventional paired-terminate sequencing (i.e. Illumina), your reads are supposed to align FR, and if they instead align RF, FF or RR, that'due south a trouble and often indicates the reads aligned incorrectly (though it could also mean they aligned correctly and that a existent inversion or translocation exists in the sample's genome – see notes from Devin Absher's talk on calling structural variants). If read pairs don't align FR, about aligners will flag them as "not a proper pair" in the SAM/BAM file by zeroing the FLAG 0×02 chip (proper pair flag) (see SAM spec). Heng Li, author of BWA, states hither that BWA will but gear up the 'proper pair flag' to i for Illumina reads aligned FR (for SOLiD it allows FF or RR).
Therefore if you await at a SAM/BAM file (for Illumina data at least), it should be the case that in whatsoever pair of reads with the 0×02 bit set (i.e. considered a proper pair), exactly one of the two reads will have the 0×10 flake set besides (i.e. information technology is contrary-complemented; once more, meet the SAM file spec). For the read with its 0×10 bit set, the "SEQ" listed in the SAM file will be the opposite complement of the original read as seen in the FASTQ. That means that in the SAM file, the SEQs for a pair of reads are now both being presented in forward orientation even though the "FR" orientation information is stored in the FLAG.
For reads that don't form a proper pair, or aren't mapped at all, (almost) all bets are off. Pairs might be FF, RR or RF, and i might be mapped and the other not. Moreover, an unmapped read might have the 0×10 flag gear up, or not.
Why bother opposite-complementing the read if it doesn't align anywhere anyway? I don't know; I can't find whatsoever information in the BWA documentation most why this might occur. But I spot checked the FLAGs of unmapped reads in one of my BAMs:
cut -f ii 1_unmapped_sorted.sam | sort | uniq
And found that a diversity of FLAG values occur: [101,103,117,133,141,151,157,165,167,181,69,77,87]
, several of which have the 0×10 flake set, for instance 117. When I go back and pull out a sampling of the reads with flag value 117:
grep $'\t117\t' 1_unmapped_sorted.sam | caput | less
And then compare the reads in those to my original FASTQs, I discover that they are indeed reverse complemented, for example:
grep $'\t117\t' 1_unmapped_sorted.sam | head | less FCC0CHTACXX:vi:1101:10003:55579#GCCAATAT 117 chr2 130057568 0 * = 130057568 0 GGGACACACTGAGCTCAGGGATAGGGTGGAGGTGGACTGGACTGAGAGCAGCGTCAGAGGGGAAGGCACTGCAGCAGGGGCCCGACATAGGCAGGGGTAC grep FCC0CHTACXX:6:1101:10003:55579 -thou 1 -A iii 1_1.fq @FCC0CHTACXX:6:1101:10003:55579#GCCAATAT/i GTACCCCTGCCTATGTCGGGCCCCTGCTGCAGTGCCTTCCCCTCTGACGCTGCTCTCAGTCCAGTCCACCTCCACCCTATCCCTGAGCTCAGTGTGTCCC + _b_eeeeegggggiihiiiiiiiiihhfhhiifhhifffhihhiiihhhfhghdgdg`cdeeeR]bdddbcccbbcca^bccc`bbccccccbcd_`b_b
(You'll notice that RNAME for this read is chr2, but remember from the SAM spec that if 0×04 is prepare, no assumptions can be fabricated almost RNAME).
I conclude that even if 0×04 is set, it is still safe to assume that for whatsoever individual read that has its 0×ten flag set, the "SEQ" shown in the BAM file is actually the contrary complement of the original read from the FASTQ.
This came upwards because recently I was iterating through BAMs using pysam, trying to re-align unmapped reads, and for my particular purpose I wanted to have both of their sequences in the same orientation, i.e. both frontwards or both opposite. This was disruptive at start because zero, one or both of them might already exist reverse-complemented in the SEQ field. The most conceptually straightforward way is just to reverse complement whichever (neither, one or both) take.is_reverse = Truthful
, and so that now you're dorsum to baseline, and then you reverse complement exactly one of them. To avoid the actress step, though, logically you could use an XNOR, and then in Python:
if(read.is_reverse == mate.is_reverse): read = reversecomplement(read)
Addendum: for BWA, at least, the proper pair flag depends not only on the FR orientation but too on insert size being within a sure range (from BWA manual):
The maximum distance x for a pair considered to be properly paired (SAM flag 0×two) is calculated by solving equation Phi((x-mu)/sigma)=x/L*p0, where mu is the hateful, sigma is the standard error of the insert size distribution, Fifty is the length of the genome, p0 is prior of anomalous pair and Phi() is the standard cumulative distribution function. For mapping Illumina brusk-insert reads to the human genome, x is virtually 6-vii sigma abroad from the mean. Quartiles, mean, variance and x will be printed to the standard error output.
Source: https://www.cureffi.org/2012/12/19/forward-and-reverse-reads-in-paired-end-sequencing/
0 Response to "Illumina Read Is Sense to the Fragment"
Postar um comentário