Difference between revisions of "Bin mapping"
KangHeum Cho (Talk | contribs) |
KangHeum Cho (Talk | contribs) |
||
(3 intermediate revisions by one user not shown) | |||
Line 45: | Line 45: | ||
UV.new.chojam.q30.SNP.recode.vcf -> '''153,384''' SNPs | UV.new.chojam.q30.SNP.recode.vcf -> '''153,384''' SNPs | ||
+ | |||
+ | |||
+ | 3) Parent는 resequencing, population은 GBS | ||
+ | |||
+ | CJ3, BS resequencing data: '''244:/hayasen/Workspace/YoonMY/variant.vcf''' | ||
+ | |||
+ | Filtering options: SNP-only, minQ = 30, ''homozygous & polymorphic'' (parent_homo.py) | ||
+ | |||
+ | |||
+ | 여기까지 준비된 raw files: | ||
+ | |||
+ | Parent resequencing data -> '''variant.q30.SNP.homo.vcf''' | ||
+ | |||
+ | Population GBS data -> '''UV.new.chojam.q30.SNP.recode.vcf''' | ||
+ | |||
+ | |||
+ | 4) '''Parent variant와 population variant 중 ''같은 loci position만 걸러낸다''''' | ||
+ | |||
+ | * 교집합 걸러내기 | ||
+ | python common_loci.py [parent_resequencing_vcf] [population_GBS_vcf] | ||
+ | python common_loci.py variant.q30.SNP.homo.vcf UV.new.chojam.q30.SNP.recode.vcf | ||
+ | -> ''produces parent_vcf'''.out''' and population_vcf'''.out''''' | ||
+ | |||
+ | * Sorting, excluding scaffolds | ||
+ | sort -V variant.q30.SNP.homo.vcf.out > parent.out | ||
+ | grep -v 'sca' parent.out > parent.chr.out | ||
+ | -> '''sort -V variant.q30.SNP.homo.vcf.out | grep -v 'sca' > parent.chr.out''' | ||
+ | |||
+ | sort -V UV.new.chojam.q30.SNP.recode.vcf.out > popult.out | ||
+ | grep -v 'sca' popult.out > popult.chr.out | ||
+ | -> '''sort -V UV.new.chojam.q30.SNP.recode.vcf.out | grep -v 'sca' > popult.chr.out | ||
+ | |||
+ | * 합치기 | ||
+ | python iteration.py [parent_resequencing_vcf] [population_GBS_vcf] [common_loci_out] | ||
+ | python iteration.py parent.chr.out popult.chr.out parent.popult.GT.out | ||
+ | |||
+ | Python izip: [https://stackoverflow.com/questions/3322419/how-to-iterate-across-lines-in-two-files-simultaneously izip_zip] | ||
+ | |||
+ | |||
+ | 5) Change into .loc form (genotypes 'A' and 'B) and transpose | ||
+ | |||
+ | python loc_form.py parent.popult.GT.out parent.popult.GT.loc | ||
+ | python transpose.py parent.popult.GT.loc parent.popult.GT.loc.transposed | ||
+ | |||
+ | Numpy array: [https://www.pluralsight.com/guides/different-ways-create-numpy-arrays Numpy_array] | ||
+ | |||
+ | Numpy transpose, shape: [https://numpy.org/doc/stable/reference/generated/numpy.transpose.html Numpy_transpose] |
Latest revision as of 01:49, 13 July 2020
Basic principle
기본이 되는 논문: High-throughput genotyping by whole-genome resequencing, Huang et al., Genome res., 2009
- SNPs between the two genome sequences were identified as potential markers for genotyping.
- 처음 마커의 모집단은 모부본 사이의 SNP
- 왜냐하면 모부본은 높은 시퀀싱 퀄리티를 가지고 있기 때문
Procedures
WD: 244:/hayasen/chojam/bin_mapping
하위 파일:
1) variant.vcf -> /hayasen/Workspace/YoonMY/variant.vcf 청자3호, 부석의 resequencing 파일로 variant calling한 것
2) UV.new.chojam.variant.vcf -> ../new_GBS/UV.new.chojam.variant.vcf CB population GBS data
분석 방법:
1-1) 너무 낮은 퀄리티의 variant는 걸러낸다. 기본적으로 q30은 깔고 들어간다. Depth는 신경쓰지 않는다.
1-2) SNP만 걸러낸다. InDel은 bin mapping strategy에서 쓰이지 않는다.
vcftools --vcf (variant.vcf/UV.new.chojam.variant.vcf) --out (variant.q30.SNP/UV.new.chojam.q30.SNP) --minQ 30 --remove-indels --recode
2) 모부본의 variant 중 homozygous, polymorphic 한 것들만 골라낸다.
python parent_homo.py variant.q30.SNP.recode.vcf variant.q30.SNP.homo.vcf
여기까지 걸러진 SNP 개수:
variant.q30.SNP.homo.vcf -> 317,561 SNPs
UV.new.chojam.q30.SNP.recode.vcf -> 153,384 SNPs
3) Parent는 resequencing, population은 GBS
CJ3, BS resequencing data: 244:/hayasen/Workspace/YoonMY/variant.vcf
Filtering options: SNP-only, minQ = 30, homozygous & polymorphic (parent_homo.py)
여기까지 준비된 raw files:
Parent resequencing data -> variant.q30.SNP.homo.vcf
Population GBS data -> UV.new.chojam.q30.SNP.recode.vcf
4) Parent variant와 population variant 중 같은 loci position만 걸러낸다
- 교집합 걸러내기
python common_loci.py [parent_resequencing_vcf] [population_GBS_vcf] python common_loci.py variant.q30.SNP.homo.vcf UV.new.chojam.q30.SNP.recode.vcf -> produces parent_vcf.out and population_vcf.out
- Sorting, excluding scaffolds
sort -V variant.q30.SNP.homo.vcf.out > parent.out grep -v 'sca' parent.out > parent.chr.out -> sort -V variant.q30.SNP.homo.vcf.out | grep -v 'sca' > parent.chr.out sort -V UV.new.chojam.q30.SNP.recode.vcf.out > popult.out grep -v 'sca' popult.out > popult.chr.out -> sort -V UV.new.chojam.q30.SNP.recode.vcf.out | grep -v 'sca' > popult.chr.out
- 합치기
python iteration.py [parent_resequencing_vcf] [population_GBS_vcf] [common_loci_out] python iteration.py parent.chr.out popult.chr.out parent.popult.GT.out
Python izip: izip_zip
5) Change into .loc form (genotypes 'A' and 'B) and transpose
python loc_form.py parent.popult.GT.out parent.popult.GT.loc python transpose.py parent.popult.GT.loc parent.popult.GT.loc.transposed
Numpy array: Numpy_array
Numpy transpose, shape: Numpy_transpose