收藏 分销(赏)

plink的GWAS数据处理作业流程.docx

上传人:二*** 文档编号:4549497 上传时间:2024-09-28 格式:DOCX 页数:25 大小:38.75KB
下载 相关 举报
plink的GWAS数据处理作业流程.docx_第1页
第1页 / 共25页
亲,该文档总共25页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、Data managementGenerate binary fileset-make-bed-make-bedcreates a newPLINK 1 binary fileset,afterapplying sample/variant filters and other operations below. For example,plink -filetext_fileset-maf0.05-make-bed -outbinary_filesetdoes the following:1. Autogeneratebinary_fileset-temporary.bed+.bim+.fam

2、. (The MAF filter has not yet been applied at this stage. See theOrder of operationspage for more details.)2. Readbinary_fileset-temporary.bed+.bim+.fam. Calculate MAFs. Remove all variants with MAF 0.05from the current analysis.3. Generatebinary_fileset.bed+.bim+.fam. Any samples/variants removed f

3、rom the current analysis are also not present in this fileset. (This is the -make-bed step.)4. Deletebinary_fileset-temporary.bed+.bim+.fam.In contrast, the fileset left behind by-keep-autoconvis just the result of step 1.-make-just-bim-make-just-fam-make-just-bimis a variant of -make-bed which only

4、 generates a .bim file, and-make-just-famplays the same role for .fam files. Unlike most other PLINK commands, these do not require the main input to include a .bed file (though you wont have access to many filtering flags when using these in no-.bed mode).Use these cautiously.It is very easy to des

5、ynchronize your binary genotype data and your .bim/.fam indexes if you use these commands improperly. If you have any doubt, stick with -make-bed.Generate text fileset-recode -recode-allele filename-recodecreates a new text fileset, after applying sample/variant filters and other operations. By defa

6、ult, the fileset includes a.pedand a.mapfile, readable with-file. The 12 modifier causes A1 (usually minor) alleles to be coded as 1 and A2 alleles to be coded as 2, while 01 maps A10 and A21. (PLINK forces you to combine 01 with-output-missing-genotypewhen this is necessary to prevent missing genot

7、ypes from becoming indistinguishable from A1 calls.) The 23 modifier causes a 23andMe-formatted file to be generated. This can only be used on a single samples data (a one-line-keepfile may come in handy here). There is currently no special handling of the XY pseudo-autosomal region. The AD modifier

8、 causes anadditive (0/1/2) + dominant (het = 1, otherwise 0) component file, suitable for loading from R, to be generated. A is the same, except without the dominance component.o By default, A1 alleles are counted; this can be customized with-recode-allele. -recode-alleles input file should have var

9、iant IDs in the first column and allele IDs in the second.o By default, the header line for .raw files only names the counted alleles. To include the alternate allele codes as well, add the include-alt modifier.o Haploid additive components are 0/2-valued instead of 0/1-valued, to maintain a consist

10、ent scale on the X chromosome.See also-R. The A-transpose modifier causes avariant-major additive component fileto be generated. This can also be used with -recode-allele. The beagle modifier causes unphased per-autosome.dat and .mapfiles, readable byBEAGLE3.3 and earlier, to be generated, while bea

11、gle-nomap generates a single .dat file (no chromosome splitting occurs in this case). The bimbam modifier causes aBIMBAM-formatted filesetto be generated. If your input data only contains one chromosome, you can use bimbam-1chr instead to write a two-column .pos.txt file. If all allele codes are sin

12、gle-character, you can use the compound-genotypes modifier to omit the space between each pair of allele codes in a single genotype call when generating a .ped + .map fileset. You will need to use the -compound-genotypes flag to load this data in PLINK 1.07, but its not needed for PLINK 1.9. The fas

13、tphase modifier causes per-chromosomefastPHASE filesto be generated. If your input data only contains one chromosome, you can use fastphase-1chr instead to exclude the chromosome number from the file extension. The HV modifier causes a Haploview-format .ped +.infofileset to be generated per chromoso

14、me. HV-1chr is analogous to fastphase-1chr. The lgen modifier causes along-format fileset, loadable with-lfile, to be generated. lgen-ref is equivalent to PLINK 1.07 -recode-lgen -with-reference. The list modifier causes agenotype-based listto be generated. This does not produce a .fam or .map file.

15、 The oxford modifier causes a Oxford-format.gen+.samplefileset to be generated. If you also include the gen-gz modifier, the .gen file is gzipped. The rlist modifier causes arare-genotype filesetto be generated (similar to -lists output, but with .fam and .map files, and without homozygous major gen

16、otypes). With the list and rlist formats, the omit-nonmale-y modifier causes nonmale genotypes to be omitted on the Y chromosome. The structure modifier causes aStructure-format fileto be generated. The transpose modifier causes atransposed text fileset, loadable with-tfile, to be generated. The vcf

17、, vcf-fid, and vcf-iid modifiers result in production of aVCFv4.2 file. vcf-fid and vcf-iid cause family IDs and within-family IDs respectively to be used for the sample IDs in the last header row, while vcf merges both IDs and puts an underscore between them (in this case, a warning will be given i

18、f an ID already contains an underscore).If the bgz modifier is added, the VCF file is block-gzipped. (Gzipping of other -recode output files is not currently supported.)The A2 allele is saved as the reference and normally flagged as not based on a real reference genome (PR INFO field value). When it

19、 is important for reference alleles to be correct, youll usually also want to include-a2-allele and -real-ref-allelesin your command. The tab modifier makes the output mostly tab-delimited instead of mostly space-delimited when the format permits both delimiters. tabx and spacex force all tabs and a

20、ll spaces, respectively. (Seethis pagefor guidelines on swapping tabs/spaces in other contexts.)For example,plink -bfilebinary_fileset-recode -outnew_text_filesetgeneratesnew_text_fileset.pedandnew_text_fileset.mapfrom the data inbinary_fileset.bed+.bim+.fam, whileplink -bfilebinary_fileset-recode v

21、cf-iid -outnew_vcfgeneratesnew_vcf.vcffrom the same data, removing family IDs in the process.Irregular output coding-output-chr MT codeNormally, autosomal/sex/mitochondrial chromosome codes in PLINK output files are numeric, e.g. 23 for human X.-output-chrlets you specify a different coding scheme b

22、y providing the desired human mitochondrial code; supported options are 26 (default), M, MT, 0M, chr26, chrM, and chrMT. (PLINK 1.9 correctly interprets all of these encodings in input files.)-output-missing-genotype char-output-missing-phenotype string-output-missing-genotypeallows you to change th

23、e character (normally the-missing-genotypevalue) used to represent missing genotypes in PLINK output files, while-output-missing-phenotypechanges the string (normally the-missing-phenotypevalue) representing missing phenotypes.Note that these flags do not affect -bmerge/-merge-list or the autoconver

24、ters, since they generate files that may be reloaded during the same run. Add -make-bed if you want to change missing genotype/phenotype coding when performing those operations.Set blocks of genotype calls to missing-zero-cluster filenameIfclusters have been defined,-zero-clustertakes a file with va

25、riant IDs in the first column and cluster IDs in the second, and sets all the corresponding genotype calls to missing. See thePLINK 1.07 documentationfor an example.This flag must now be used with -make-bed and no other output commands (since PLINK no longer keeps the entire genotype matrix in memor

26、y).Heterozygous haploid errors-set-hh-missingNormally, heterozygous haploid and nonmale Y chromosome genotype calls are logged toplink.hhand treated as missing by all analysis commands, but left undisturbed by -make-bed and -recode (since, once gender and/or chromosome code errors have been fixed, t

27、he calls are often valid). If you actually want -make-bed/-recode to erase this information, use-set-hh-missing. (The scope of this flag is a bit wider than for PLINK 1.07, since commands like -list and -recode-rlist which previously did not respect -set-hh-missing have been consolidated under -reco

28、de.)Note that the most common source of heterozygous haploid errors is imported data which doesnt follow PLINKs convention for representing the X chromosome pseudo-autosomal region. This should be addressed with -split-x below, not -set-hh-missing.-set-mixed-mt-missingMitochondrial DNA is subject to

29、heteroplasmy, so PLINK 1.9 permits heterozygous genotypes and treats MT more like a diploid than a haploid chromosome. However, some analytical methods dont use mixed MT genotype calls, and instead assume that no heterozygous MT calls exist. The-set-mixed-mt-missingflag can be used with -make-bed/-r

30、ecode to export a dataset with mixed MT calls erased.X chromosome pseudo-autosomal region-split-x last bp position of head first bp position of tail -split-x build code -merge-x PLINK prefers to represent the X chromosomes pseudo-autosomal region as a separate XY chromosome (numeric code 25 in human

31、s); this removes the need for special handling of male X heterozygous calls. However, this convention has not been widely adopted, and as a consequence, heterozygous haploid errors are commonplace when PLINK 1.07 is used to handle X chromosome data. The new -split-x and -merge-x flags address this p

32、roblem.Given a dataset with no preexisting XY region,-split-xtakes the base-pair position boundaries of the pseudo-autosomal region, and changes the chromosome codes of all variants in the region to XY. As (typo-resistant) shorthand, you can use one of the following build codes: b36/hg18: NCBI build

33、 36/UCSC human genome 18, boundaries 2709521 and b37/hg19: GRCh37/UCSC human genome 19, boundaries 2699520 and b38/hg38: GRCh38/UCSC human genome 38, boundaries 2781479 and By default, PLINK errors out if no variants would be affected by the split. This behavior may break data conversion scripts whi

34、ch are intended to work on e.g. VCF files regardless of whether or not they contain pseudo-autosomal region data; use the no-fail modifier to force PLINK to always proceed in this case.Conversely, in preparation for data export,-merge-xchanges chromosome codes of all XY variants back to X (and no-fa

35、il has the same effect). Both of these flags must be used with -make-bed and no other output commands.Mendel errors-set-me-missingIn combination with -make-bed,-set-me-missingscans the dataset for Mendel errors and sets implicated genotypes (as defined in the-mendeltable) to missing. -mendel-duoscau

36、ses samples with only one parent in the dataset to be checked, while -mendel-multigen causes (great-)ngrandparental data to be referenced when a parental genotype is missing. It is no longer necessary to combine this with e.g. -me1 1 to prevent the Mendel error scan from being skipped. Results may d

37、iffer slightly from PLINK 1.07 when overlapping trios are present, since genotypes are no longer set to missing before scanning is complete.Fill in missing calls-fill-missing-a2It can be useful to fill in all missing calls in a dataset, e.g. in preparation for using an algorithm which cannot handle

38、them, or as a decompression step when all variants not included in a fileset can be assumed to be homozygous reference matchesand there are no explicit missing calls that still need to be preserved.For the first scenario, a sophisticated imputation program such asBEAGLEorIMPUTE2should normally be us

39、ed, and -fill-missing-a2 would be an information-destroying operation bordering on malpractice.However, sometimes the accuracy of the filled-in calls isnt important for whatever reason, or youre dealing with the second scenario. In those cases you can use the -fill-missing-a2 flag (in combination wi

40、th -make-bed and no other output commands) to simply replace all missing calls with homozygous A2 calls. When used in combination with -zero-cluster/-set-hh-missing/-set-me-missing, this always acts last.You may want to combine this with-a2-allelebelow.Update variant information-set-missing-var-ids

41、template string-new-id-max-allele-len n-missing-var-code missing ID stringWhole-exome and whole-genome sequencing results frequently contain variants which have not been assigned standard IDs. If you dont want to throw out all of that data, youll usually want to assign them chromosome-and-position-b

42、ased IDs.-set-missing-var-idsprovides one way to do this. The parameter taken by these flags is a special template string, with a where the chromosome code should go, and a # where the base-pair position belongs. (Exactly one and one # must be present.) For example, given a .bim file starting withch

43、r1 . 0 10583 A Gchr1 . 0 886817 C Tchr1 . 0 886817 CATTTT CchrMT . 0 64 T C-set-missing-var-ids:#b37 would name the first variant chr1:10583b37, the second variant chr1:886817b37. and then error out when naming the third variant, since it would be given the same name as the second variant. (Note tha

44、t this position overlap is actually present in 1000 Genomes Project phase 1 data.)To maintain unique IDs in this situation, you can include $1 and $2 in your template string as well; these refer to the first and second allele namesin ASCII-sort order. So, if were using abashshell, we can try again w

45、ith-set-missing-var-ids:#b37$1,$2which would name the first variant chr1:10583b37A,G, the second variant chr1:886817b37C,T, the third variant chr1:886817b37C,CATTTT, and the fourth variant chrMT:64b37C,T. Note the extra backslashes: they are necessary inbashbecause $ is a reserved character there.Yo

46、u may still get a small number of duplicate ID errors when using $1 and $2. If indels are involved, it is likely that the ambiguity cannot be resolved by PLINK 1 at all, because it matters which allele is the reference allele1. Instead, you must e.g. use a shell script to manually name variants in your original VCF file; seethis blog post by Giulio Genovesefor a detailed discussion. We apologize for the inconvenience; PLINK 2.0 will extend -set-missing-var-ids

展开阅读全文
相似文档                                   自信AI助手自信AI助手
猜你喜欢                                   自信AI导航自信AI导航
搜索标签

当前位置:首页 > 教育专区 > 初中其他

移动网页_全站_页脚广告1

关于我们      便捷服务       自信AI       AI导航        获赠5币

©2010-2024 宁波自信网络信息技术有限公司  版权所有

客服电话:4008-655-100  投诉/维权电话:4009-655-100

gongan.png浙公网安备33021202000488号   

icp.png浙ICP备2021020529号-1  |  浙B2-20240490  

关注我们 :gzh.png    weibo.png    LOFTER.png 

客服