Difference between revisions of "BZIP study"
KangHeum Cho (Talk | contribs) |
KangHeum Cho (Talk | contribs) |
||
(3 intermediate revisions by one user not shown) | |||
Line 32: | Line 32: | ||
=== Downloading and generating HMM consensus sequence file === | === Downloading and generating HMM consensus sequence file === | ||
− | WD: | + | WD: 63:/data6/chojam96/bZIP/identification |
* Download Stockholm alignment file from Pfam bZIP_1 > Alignment > Download options > Seed | * Download Stockholm alignment file from Pfam bZIP_1 > Alignment > Download options > Seed | ||
Line 77: | Line 77: | ||
'''''Trickest part in identification of bZIP genes''''' - the best way is to pick and choose gene by gene, with measures. | '''''Trickest part in identification of bZIP genes''''' - the best way is to pick and choose gene by gene, with measures. | ||
− | + | ||
+ | ==== Align peptide sequence file (bZIP.primary.fa, n=147) by 'MUSCLE' to manually see the alignment ==== | ||
muscle -in bZIP.primary.fa -out bZIP.primary.aligned.fa | muscle -in bZIP.primary.fa -out bZIP.primary.aligned.fa | ||
Line 83: | Line 84: | ||
Toggled sequence at 50% level to check conserved domains, but no consensus found. | Toggled sequence at 50% level to check conserved domains, but no consensus found. | ||
− | + | ||
+ | ==== Use 'interproscan' and 'NCBI CD search' for domain detection ==== | ||
+ | |||
+ | * Interproscan | ||
+ | |||
+ | * NCBI CD (conserved domain) search: For one query, use [https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi CD-search]. For multiple queries, use [https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi Batch CD-search]. | ||
+ | |||
+ | 19.04.21 → search ID: QM3-qcdsearch-1F885D58AB029684 | ||
+ | |||
+ | Two types of results: hitdata.tsv, featdata.tsv | ||
+ | |||
+ | '''hitdata.tsv contains 140 genes, and ''7 remaining genes'' confirmed as queries 'with no domain hits'''' | ||
+ | |||
+ | hitdata.tsv contains ''domains'', whereas featdata.tsv contains ''types of regions'' (e.g. DNA-binding). | ||
+ | |||
+ | 1. From hitdata.tsv, exclude any genes that does not have 'bZIP' or 'BRLZ (SMART accession)' in the domain description | ||
+ | |||
+ | python getbzip.py hitdata.tsv hitdata.bzip.txt | ||
+ | |||
+ | '''74 queries''' detected in hitdata.bzip.txt | ||
+ | |||
+ | 2. Extract description, accession information from hitdata.tsv | ||
+ | |||
+ | python extractbzip.py hitdata.tsv hitdata.bzip.txt hitdata.bzip.tsv | ||
+ | python getaccession.py hitdata.bzip.tsv hitdata.bzip.accession | ||
+ | |||
+ | * hitdata.bzip.accession | ||
+ | |||
+ | Gene_ID Domain_start Domain_end Domain_length Accession Description | ||
+ | Vradi01g12230.1 63 108 46 cd14703 bZIP_plant_RF2 | ||
+ | - 63 108 46 cl21462 bZIP | ||
+ | - 66 108 43 cd14704 bZIP_HY5-like | ||
+ | - 66 107 42 cd14707 bZIP_plant_BZIP46 | ||
+ | - 63 107 45 cd14702 bZIP_plant_GBF1 |
Latest revision as of 07:19, 22 April 2019
Contents |
Identification
Accession IDs for bZIP transcription factor domain:
Pfam-based approach
HMM domain clans
Protein domain information obtained from Pfam database.
- bZIP_1 (PF00170)
Regarded as the main domain for bZIP transcription factor. Clearly distinguishes basic region and leucine zipper.
→ Used for further analysis.
- bZIP_2 (PF07716)
Conserved basic region, weaker leucine zipper.
- bZIP_Maf (PF03131)
No distinct basic region. No clear leucine zipper interface.
Downloading and generating HMM consensus sequence file
WD: 63:/data6/chojam96/bZIP/identification
- Download Stockholm alignment file from Pfam bZIP_1 > Alignment > Download options > Seed
Seed: the curated alignment from which the HMM for the family is built.
- Use 'hmmbuild' command from HMMER software to convert Stockholm alignment into a profile HMM
After gunzipping, use following command.
hmmbuild [options] hmmfile alignfile
hmmbuild reads a multiple sequence alignment file alignfile , builds a new profile HMM, and saves the HMM in hmmfile.
For following commands for HMMER, refer to this page.
hmmbuild PF00170.hmm PF00170.seed
hmmscanning the protein sequence
- Use 'hmmpress' command for preparing HMM database
hmmpress [options] hmmfile
This produces four preparation files.
- Run 'hmmscan' command against Vigna radiata pepetide sequence fasta file
hmmscan [options] hmmdb seqfile hmmscan --tblout PF00170.out PF00170.hmm Vradi_ver6.fa.cds.primary.fasta.pepshorten.fa
No primary options used. Peptide file used for hmmscanning had only main transcripts(*.1, n=22427).
PF00170.out (result file of hmmscan) contained 147 nonredundant genes.
Out of 147 genes, 144 genes had complete peptide sequence, marked by termination codon sign (*).
- Get corresponding protein sequences
python getgenes.py PF00170.out Vradi_ver6.fa.cds.primary.fasta.pepshorten.fa bZIP.primary.fa
Confirming domain presence
Trickest part in identification of bZIP genes - the best way is to pick and choose gene by gene, with measures.
Align peptide sequence file (bZIP.primary.fa, n=147) by 'MUSCLE' to manually see the alignment
muscle -in bZIP.primary.fa -out bZIP.primary.aligned.fa
Toggled sequence at 50% level to check conserved domains, but no consensus found.
Use 'interproscan' and 'NCBI CD search' for domain detection
- Interproscan
- NCBI CD (conserved domain) search: For one query, use CD-search. For multiple queries, use Batch CD-search.
19.04.21 → search ID: QM3-qcdsearch-1F885D58AB029684
Two types of results: hitdata.tsv, featdata.tsv
hitdata.tsv contains 140 genes, and 7 remaining genes confirmed as queries 'with no domain hits'
hitdata.tsv contains domains, whereas featdata.tsv contains types of regions (e.g. DNA-binding).
1. From hitdata.tsv, exclude any genes that does not have 'bZIP' or 'BRLZ (SMART accession)' in the domain description
python getbzip.py hitdata.tsv hitdata.bzip.txt
74 queries detected in hitdata.bzip.txt
2. Extract description, accession information from hitdata.tsv
python extractbzip.py hitdata.tsv hitdata.bzip.txt hitdata.bzip.tsv python getaccession.py hitdata.bzip.tsv hitdata.bzip.accession
- hitdata.bzip.accession
Gene_ID Domain_start Domain_end Domain_length Accession Description Vradi01g12230.1 63 108 46 cd14703 bZIP_plant_RF2 - 63 108 46 cl21462 bZIP - 66 108 43 cd14704 bZIP_HY5-like - 66 107 42 cd14707 bZIP_plant_BZIP46 - 63 107 45 cd14702 bZIP_plant_GBF1