Difference between revisions of "Methylation study"

From Crop Genomics Lab.
Jump to: navigation, search
(HTSeq)
 
(5 intermediate revisions by one user not shown)
Line 1: Line 1:
 
= Tuxedo =
 
= Tuxedo =
 +
 +
WD: 63:/data6/chojam96/methylation/tuxedo
 +
  
 
== Basic Bowtie2 pipeline ==
 
== Basic Bowtie2 pipeline ==
Line 11: Line 14:
 
Basic command:
 
Basic command:
  
  htseq-count -t exon (-f bam) (--stranded=no) <sorted.bam> <annotation.gff/gtf> > <htseq.txt>
+
  htseq-count -t exon (-f bam) (--stranded=no) (-q) <sorted.bam> <annotation.gff/gtf> > <htseq.txt>
  
 
Default input file format is SAM, so specifying '-f bam' for BAM input file is necessary.
 
Default input file format is SAM, so specifying '-f bam' for BAM input file is necessary.
Line 19: Line 22:
 
'''However, for ''V. radiata'' Illumina-sequenced GFF3 files, ID attributes are mixed up,''' so additional ''''--idattr'''' option should be used!
 
'''However, for ''V. radiata'' Illumina-sequenced GFF3 files, ID attributes are mixed up,''' so additional ''''--idattr'''' option should be used!
  
  htseq-count -t exon -f bam --stranded=no '''--idattr transcript_id''' <sorted.bam> <annotation.gff/gtf> > <htseq.txt>
+
  htseq-count -t exon -f bam --stranded=no -q '''--idattr transcript_id''' <sorted.bam> <annotation.gff/gtf> > <htseq.txt>
  
 
ID attributes may be:
 
ID attributes may be:
Line 26: Line 29:
 
* ''Name'' for GFF format
 
* ''Name'' for GFF format
 
* etc
 
* etc
 +
 +
== edgeR ==
 +
 +
Refer to [[2108 Hakyung TUXEDO]]
 +
 +
If edgeR not installed properly due to some version issues, follow instructions from [https://bioconductor.org/packages/release/bioc/html/edgeR.html edgeR installation].
 +
 +
 +
= DEG and methylation =
 +
 +
== DEG ==
 +
 +
From HTSeq, for six samples (three for K, three for S), files ''*_htseq.txt'' were produced and combined into ''combined_htseq.txt''
 +
 +
edgeR was performed on 'combined_htseq.txt' on R studio and 'combined_htseq.count.edgeR' (n=23106) was produced.
 +
 +
DEG was extracted using:
 +
python finddeg.py combined_htseq.count.edgeR combined_htseq.deg
 +
 +
DEGs from each cytosine contexts were extracted using:
 +
python extractmetccount.py <DEG file> <*_genes> <*_count.deg>
 +
 +
CpG_count.deg (n=21)
 +
 +
CHG_count.deg (n=23)
 +
 +
CHH_count.deg (n=4)
 +
 +
== Gene methylation count ==
 +
 +
WD: 63:/data6/chojam96/methylation/expression_methylation
 +
 +
 +
1. From ''193:/data2/chojam96/methylation/metilene/metilene_v0.2-7/Vr-Met-S(K)'', get bismark coverage files
 +
 +
Six files: 'CpG(H)_Vr-Met-S(K)_bismark.cov'
 +
 +
Chr    Pos    Pos    MetPercentage    Met_read    Unmet_read
 +
Vr01    44      44      100    1      0
 +
Vr01    45      45      100    1      0
 +
Vr01    359    359    100    1      0
 +
Vr01    360    360    100    2      0
 +
Vr01    371    371    100    1      0
 +
 +
2. Combine all cytosine context files &rarr; 'Vr-Met-S(K)_bismark.cov' and set minimum threshold as (met + unmet) >= 3 &rarr; 'Vr-Met-S(K)_bismark.min3.cov'
 +
 +
concatenate and run 'cut_min3.py'
 +
 +
Chr    Pos    Met_read    Unmet_read
 +
Vr01    417    4      0
 +
Vr01    459    2      1
 +
Vr01    467    1      2
 +
Vr01    606    6      0
 +
Vr01    608    3      1
 +
 +
3. Divide into chromosomes and scaffolds &rarr; 'Vr#(scaffold)_S(K).min3.cov'
 +
 +
python dividescaffolds.py Vr-Met-S(K)_bismark.min3.cov
 +
 +
4. '''Run 'getdata.py'''': takes GFF - only mRNA - file, complete gene ID and line type (S, K)
 +
 +
'''Coverage(.cov) files of ''all the scaffolds'' should be present in the same working directory!'''
 +
 +
python getdata.py <GFF.onlymRNA> <Gene ID> <nowiki><S/K></nowiki>
 +
python getdata.py ../Vradi_ver6.onlymrna.gff Vradi01g05390.1 S
 +
 +
This produces <Gene ID>_<nowiki><S/K></nowiki> file with following format:
 +
 +
'Vradi01g05390.1_S'
 +
''Position    Number of counts    Met/Unmet''
 +
#strand:-
 +
8291109 4      met
 +
8291109 -0      unmet
 +
8291141 5      met
 +
8291141 -0      unmet
 +
8291162 4      met
 +
8291162 -2      unmet
 +
8291182 12      met
 +
8291182 -0      unmet
 +
 +
5. Edit the file suitable for R graphs
 +
 +
python dataforbar.py <Gene ID>_<nowiki><S/K></nowiki> <Gene ID>_<nowiki><S/K></nowiki>.bar.tsv
 +
python dataforbar.py Vradi0185s00060.1_K Vradi0185s00060.1_K.bar.tsv
 +
 +
Columns:
 +
pos    count    type
 +
 +
6. Make R script and run Rscript &rarr; requires ''original'' GFF file
 +
 +
'''tsv files of all genes should be present in the same working directory!'''
 +
 +
python makerscript.py <gene list file> <GFF file> <output R script>
 +
python makerscript.py CpG_deg.genes ../Vradi_ver6.gff CpG_genes.rscript.R
 +
 +
Rscript CpG_genes.rscript.R
 +
 +
This produces <Gene ID>_<nowiki><S/K></nowiki>.graph.pdf files.
 +
 +
== Average methylation profiles among high/low synchrony groups ==
 +
 +
High syn group: Vr-Met-S Vr-Met-SK10 Vr_Met_SK49 Vr_Met_SK68 Vr-Met-SK186
 +
Low syn group: Vr-Met-K Vr_Met_SK94 Vr_Met_SK152 Vr-Met-SK156 Vr_Met_SK176
 +
D: 193:/data2/chojam96/methylation/metilene/metilene_v0.2-7/
 +
 +
WD: 63:/data6/chojam96/methylation/highlow_methylation
 +
 +
1. Create 'genelst' containing five genes for methylation profile figures
 +
 +
Vradi01g09470.1
 +
Vradi03g00420.1
 +
Vradi11g00660.1
 +
Vradi01g11940.1
 +
Vradi09g08580.1
 +
 +
2. Get high syn group bismark coverage information from 193
 +
 +
15 files: CpG(H)_Vr-Met-S(SK##)_bismark.cov.gz
 +
 +
 +
3. Combine all cytosine context and group files, set minimum threshold as (met + unmet) >= 3
 +
 +
run 'cut_min3.py' and concatenate
 +
 +
creates ''''highgroup.min3.cov''''
 +
 +
 +
4. Divide into chromosomes and scaffolds as 'Vr##' or 'scaffold'
 +
 +
run 'dividescaffolds.py'
 +
 +
 +
5. '''Run 'getdata.py''''
 +
 +
'''Coverage(.cov) files of ''all the scaffolds'' should be present in the same working directory!'''
 +
 +
python getdata.py <GFF.onlymRNA> <Gene_ID>
 +
python getdata.py ../Vradi_ver6.onlymra.gff Vradi01g09470.1
 +
 +
Creates ''''<Gene_ID>.tsv''''
 +
 +
 +
6. Run dataforbar.py
 +
 +
python dataforbar.py <Gene_ID.tsv> <Gene_ID.bar.tsv>
 +
 +
7. Run combinegroups.py
 +
 +
python combinegroups.py <Gene_ID.bar.tsv> <output_with_sum_values> <output_with_avg_values>
 +
python combinegroups.py Vradi03g00420.1.bar.tsv Vradi03g00420.1.sum.tsv Vradi03g00420.1.avg.tsv
 +
 +
8. Run makerscript.py
 +
 +
python makerscript.py <Gene_ID> <GFF file> <output>
 +
python makerscript.py Vradi03g00420.1 ../Vradi_ver6.gff Vradi03g00420.1.rscript
 +
 +
9. Repeat the same procedures with low syn group

Latest revision as of 04:49, 7 June 2019

Contents

Tuxedo

WD: 63:/data6/chojam96/methylation/tuxedo


Basic Bowtie2 pipeline

Refer to 2108 Hakyung TUXEDO

HTSeq

For detailed information, refer to HTSeq page.

Basic command:

htseq-count -t exon (-f bam) (--stranded=no) (-q) <sorted.bam> <annotation.gff/gtf> > <htseq.txt>

Default input file format is SAM, so specifying '-f bam' for BAM input file is necessary.

For files not following normal HTSeq pipeline, '--stranded=no' option is also required. Refer to 2108 Hakyung TUXEDO.

However, for V. radiata Illumina-sequenced GFF3 files, ID attributes are mixed up, so additional '--idattr' option should be used!

htseq-count -t exon -f bam --stranded=no -q --idattr transcript_id <sorted.bam> <annotation.gff/gtf> > <htseq.txt>

ID attributes may be:

  • transcript_id for GTF format
  • Name for GFF format
  • etc

edgeR

Refer to 2108 Hakyung TUXEDO

If edgeR not installed properly due to some version issues, follow instructions from edgeR installation.


DEG and methylation

DEG

From HTSeq, for six samples (three for K, three for S), files *_htseq.txt were produced and combined into combined_htseq.txt

edgeR was performed on 'combined_htseq.txt' on R studio and 'combined_htseq.count.edgeR' (n=23106) was produced.

DEG was extracted using:

python finddeg.py combined_htseq.count.edgeR combined_htseq.deg

DEGs from each cytosine contexts were extracted using:

python extractmetccount.py <DEG file> <*_genes> <*_count.deg>

CpG_count.deg (n=21)

CHG_count.deg (n=23)

CHH_count.deg (n=4)

Gene methylation count

WD: 63:/data6/chojam96/methylation/expression_methylation


1. From 193:/data2/chojam96/methylation/metilene/metilene_v0.2-7/Vr-Met-S(K), get bismark coverage files

Six files: 'CpG(H)_Vr-Met-S(K)_bismark.cov'

Chr     Pos     Pos     MetPercentage    Met_read    Unmet_read
Vr01    44      44      100     1       0
Vr01    45      45      100     1       0
Vr01    359     359     100     1       0
Vr01    360     360     100     2       0
Vr01    371     371     100     1       0

2. Combine all cytosine context files → 'Vr-Met-S(K)_bismark.cov' and set minimum threshold as (met + unmet) >= 3 → 'Vr-Met-S(K)_bismark.min3.cov'

concatenate and run 'cut_min3.py'

Chr     Pos     Met_read    Unmet_read
Vr01    417     4       0
Vr01    459     2       1
Vr01    467     1       2
Vr01    606     6       0
Vr01    608     3       1

3. Divide into chromosomes and scaffolds → 'Vr#(scaffold)_S(K).min3.cov'

python dividescaffolds.py Vr-Met-S(K)_bismark.min3.cov

4. Run 'getdata.py': takes GFF - only mRNA - file, complete gene ID and line type (S, K)

Coverage(.cov) files of all the scaffolds should be present in the same working directory!
python getdata.py <GFF.onlymRNA> <Gene ID> <S/K>
python getdata.py ../Vradi_ver6.onlymrna.gff Vradi01g05390.1 S

This produces <Gene ID>_<S/K> file with following format:

'Vradi01g05390.1_S'
Position    Number of counts    Met/Unmet 
#strand:-
8291109 4       met
8291109 -0      unmet
8291141 5       met
8291141 -0      unmet
8291162 4       met
8291162 -2      unmet
8291182 12      met
8291182 -0      unmet

5. Edit the file suitable for R graphs

python dataforbar.py <Gene ID>_<S/K> <Gene ID>_<S/K>.bar.tsv
python dataforbar.py Vradi0185s00060.1_K Vradi0185s00060.1_K.bar.tsv 

Columns:

pos    count    type

6. Make R script and run Rscript → requires original GFF file

tsv files of all genes should be present in the same working directory!
python makerscript.py <gene list file> <GFF file> <output R script>
python makerscript.py CpG_deg.genes ../Vradi_ver6.gff CpG_genes.rscript.R
Rscript CpG_genes.rscript.R

This produces <Gene ID>_<S/K>.graph.pdf files.

Average methylation profiles among high/low synchrony groups

High syn group: Vr-Met-S Vr-Met-SK10 Vr_Met_SK49 Vr_Met_SK68 Vr-Met-SK186
Low syn group: Vr-Met-K Vr_Met_SK94 Vr_Met_SK152 Vr-Met-SK156 Vr_Met_SK176
D: 193:/data2/chojam96/methylation/metilene/metilene_v0.2-7/
WD: 63:/data6/chojam96/methylation/highlow_methylation

1. Create 'genelst' containing five genes for methylation profile figures

Vradi01g09470.1
Vradi03g00420.1
Vradi11g00660.1
Vradi01g11940.1
Vradi09g08580.1

2. Get high syn group bismark coverage information from 193

15 files: CpG(H)_Vr-Met-S(SK##)_bismark.cov.gz


3. Combine all cytosine context and group files, set minimum threshold as (met + unmet) >= 3

run 'cut_min3.py' and concatenate

creates 'highgroup.min3.cov'


4. Divide into chromosomes and scaffolds as 'Vr##' or 'scaffold'

run 'dividescaffolds.py'


5. Run 'getdata.py'

Coverage(.cov) files of all the scaffolds should be present in the same working directory!
python getdata.py <GFF.onlymRNA> <Gene_ID>
python getdata.py ../Vradi_ver6.onlymra.gff Vradi01g09470.1

Creates '<Gene_ID>.tsv'


6. Run dataforbar.py

python dataforbar.py <Gene_ID.tsv> <Gene_ID.bar.tsv>

7. Run combinegroups.py

python combinegroups.py <Gene_ID.bar.tsv> <output_with_sum_values> <output_with_avg_values>
python combinegroups.py Vradi03g00420.1.bar.tsv Vradi03g00420.1.sum.tsv Vradi03g00420.1.avg.tsv

8. Run makerscript.py

python makerscript.py <Gene_ID> <GFF file> <output>
python makerscript.py Vradi03g00420.1 ../Vradi_ver6.gff Vradi03g00420.1.rscript

9. Repeat the same procedures with low syn group