Thursday, September 22, 2011

To or Not to with cufflink:


Cufflink is an amazingly easy to install and use software, that lured me into using it. However, it is not without its sets of pitfalls... I am still researching on the illusive nature of the outputs from this software.
this software and the types of outputs it produces using different commands:
Here are few things to keep in mind before trying to run cufflink
1. Cufflink can be run on sorted bam files/ sam files.
2. It can run in multi-threading mode with a -p option and is much faster than single threading mode.
3. The new cuffmerge program first converts your gtf files derived from cuffcompare to sam files, merges them, sorts them, runs cufflink on them with a set of hard coded parameters and then runs cuffcompare on the finally to give you your merged.gtf file.
[The cufflink commandline option: cufflinks -o ./merged_asm/ -F 0.05 -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 16 ./merged_asm/tmp/mergeSam_filewvcXTG.]

Earlier, I used a route as follows:
1. Run cufflink on each of the libraries with a reference gtf and without a reference gtf.
2. Merge the outputs separately using cuffcompare on, with and without reference gtf and merge the output with a script to indicate which of the transcripts are represented by a gene id. But the potential pitfall with this approach is the overlapping and shorter transcripts. This clearly is a stumbling block when you are trying to produce an assembled transcript.

In the recent cufflink versions, there is a accessory program called as cuffmerge, which the manual suggests for merging the individual gtfs. As I have mentioned earlier, this is again a wrapper, that internally calls cufflink and cuffcompare, albeit with several options already pre-set. So, what I did was, merged the bam files generated from different libraries, merged them with samtools, sorted with samtools and derived sam files from the sorted bam for running cufflink [ Please note running the same sorted sam files with a reference gtf file suggests it is not sorted, where as without a gtf file runs fine..]

Output files
with reference gtf:
genes.fpkm.tracking -> has no fpkm information
transcripts.gtf -> has fpkm column but meaningless (all 0.0000)
isoforms.fpkm_tracking -> similar in size and content with genes.fpkm.tracking

Without reference gtf:
transcripts.gtf -> has transcript and exon information with fpkm and cov values; the co-ordinates are 1 based.
genes.fpkm.tracking -> has gene info with fpkm. The co-ordinates are 0 based
isoform.fpkm_tracking -> same as genes.fpkm.tracking

Comparison between all-merged-sam with cufflink VS cufflink -> gtf -> cuffcompare:

1. In both the cases, only a single exon is reported per gene(Since cufflink is run after bowtie directly without running tophat, this may be the case).
2. Much less number of transcripts are found in the first case, and they are non-overlapping and larger than the later case, where the transcripts are short, overlapping. The FPKM is slightly higher than the FPKM values in the later case. The first one seems to merge several smaller transcripts together.

So, in essence, if you have various biological replicates of a single treatment type, instead of going through the path of running them individually with cufflink, followed by merging the results with cuffmerge, merge the map bam files first and follow this route. The FPKM values are much accurate in this case...

No comments: