Here, we are posting a harmonized and imputed dataset of PLCO GWAS and exome data, consisting of all harmonizable PLCO genotype data from each completed scan of cancer cases and controls, as well as the key covariates of sex and participant ID. As PLCO is a prospective cohort, incident cancers and other diseases are occurring all of the time. It is therefore important that researchers use contemporary follow-up in order to precisely define cancer case/control status. Therefore, to use this data, researchers should obtain the genetic data from dbgap and in parallel obtain up-to-date data on cancer and other diseases through the PLCO Cancer Data Access System (CDAS): http://prevention.cancer.gov/major-programs/prostate-lung-colorectal/cancer-data-access-system. Also available in CDAS are a large variety of covariate and endpoints as well as published biomarker data, which can be used for both main-effect and gene x environment studies. Together, we believe that these data will serve as a helpful resource for the entire scientific community.
Item
Here, we are posting a harmonized and imputed dataset of PLCO GWAS and exome data, consisting of all harmonizable PLCO genotype data from each completed scan of cancer cases and controls, as well as the key covariates of sex and participant ID. As PLCO is a prospective cohort, incident cancers and other diseases are occurring all of the time. It is therefore important that researchers use contemporary follow-up in order to precisely define cancer case/control status. Therefore, to use this data, researchers should obtain the genetic data from dbgap and in parallel obtain up-to-date data on cancer and other diseases through the PLCO Cancer Data Access System (CDAS): http://prevention.cancer.gov/major-programs/prostate-lung-colorectal/cancer-data-access-system. Also available in CDAS are a large variety of covariate and endpoints as well as published biomarker data, which can be used for both main-effect and gene x environment studies. Together, we believe that these data will serve as a helpful resource for the entire scientific community.
boolean
C3274646 (UMLS CUI [1,1])
C0150098 (UMLS CUI [1,2])
C2350277 (UMLS CUI [1,3])
C1514515 (UMLS CUI [1,4])
C5446360 (UMLS CUI [1,5])
C0006826 (UMLS CUI [1,6])
C1706256 (UMLS CUI [1,7])
C1882979 (UMLS CUI [1,8])
C3165543 (UMLS CUI [1,9])
C1709709 (UMLS CUI [1,10])
C1551358 (UMLS CUI [1,11])
C0035173 (UMLS CUI [1,12])
C1522577 (UMLS CUI [1,13])
C3274646 (UMLS CUI [1,14])
C2698971 (UMLS CUI [1,15])
C4684740 (UMLS CUI [1,16])
C2349179 (UMLS CUI [1,17])
C1879847 (UMLS CUI [1,18])
C0596609 (UMLS CUI [1,19])
C1273305 (UMLS CUI [1,20])
This PLCO dataset contains data genotyped on Illumina GSA, Oncoarray and historical data on Illumina OmniExpress (OmniX), Omni2.5M (Omni25) and Omni5M (Omni5). Most of the platforms used in PLCO were run separately, processed and QCed at different times. GSA data was generated at CGR within a relatively short period. Oncoarray data was genotyped at CGR and multiple external Institutes. OmniX, Omni25 and Omni5M data was genotyped at CGR historically. Genotype data from OmniX and Omni25M was generated with different clustering files.
Item
This PLCO dataset contains data genotyped on Illumina GSA, Oncoarray and historical data on Illumina OmniExpress (OmniX), Omni2.5M (Omni25) and Omni5M (Omni5). Most of the platforms used in PLCO were run separately, processed and QCed at different times. GSA data was generated at CGR within a relatively short period. Oncoarray data was genotyped at CGR and multiple external Institutes. OmniX, Omni25 and Omni5M data was genotyped at CGR historically. Genotype data from OmniX and Omni25M was generated with different clustering files.
boolean
C1514515 (UMLS CUI [1,1])
C0150098 (UMLS CUI [1,2])
C4687476 (UMLS CUI [1,3])
C1285573 (UMLS CUI [1,4])
C2987304 (UMLS CUI [1,5])
C0179312 (UMLS CUI [1,6])
C3846158 (UMLS CUI [1,7])
C0035172 (UMLS CUI [1,8])
All genotype data was prepared in the binary PLINK file format. All released data should be in GRCh37/hg19. Chip data generated within CGR have had internal QC measures (iterative 80% and 95% sample- and variant-level call rate filters) applied, but not more stringent pre-imputation MAF and HWE filtering; external data have inconsistent QC due to provenance. Samples present in multiple genotyping datasets are released in all applicable datasets with the same synchronized PLCO ID.
Item
All genotype data was prepared in the binary PLINK file format. All released data should be in GRCh37/hg19. Chip data generated within CGR have had internal QC measures (iterative 80% and 95% sample- and variant-level call rate filters) applied, but not more stringent pre-imputation MAF and HWE filtering; external data have inconsistent QC due to provenance. Samples present in multiple genotyping datasets are released in all applicable datasets with the same synchronized PLCO ID.
boolean
C1285573 (UMLS CUI [1,1])
C5401465 (UMLS CUI [1,2])
C3844091 (UMLS CUI [1,3])
C3844095 (UMLS CUI [1,4])
C0600596 (UMLS CUI [1,5])
C3846158 (UMLS CUI [1,6])
C0034378 (UMLS CUI [1,7])
C0180860 (UMLS CUI [1,8])
C2699638 (UMLS CUI [1,9])
C0919481 (UMLS CUI [1,10])
C3846158 (UMLS CUI [1,11])
C1514515 (UMLS CUI [1,12])
C2348585 (UMLS CUI [1,13])
All subjects were split and cleaned by GRAF ancestry (see below) before imputation. More specifically, imputed data from each platform was split into 7 ancestral groups (African+African American, East Asian+Other Asian, European, Hispanic1, Hispanic2, Other, South Asian) based on ancestry assignment using GRAF (https://github.com/ncbi/graf).
Item
All subjects were split and cleaned by GRAF ancestry (see below) before imputation. More specifically, imputed data from each platform was split into 7 ancestral groups (African+African American, East Asian+Other Asian, European, Hispanic1, Hispanic2, Other, South Asian) based on ancestry assignment using GRAF (https://github.com/ncbi/graf).
boolean
C5447420 (UMLS CUI [1,1])
C2699638 (UMLS CUI [1,2])
C1710360 (UMLS CUI [1,3])
C0085756 (UMLS CUI [1,4])
C0027567 (UMLS CUI [1,5])
C0078988 (UMLS CUI [1,6])
C4540996 (UMLS CUI [1,7])
C1519427 (UMLS CUI [1,8])
C0239307 (UMLS CUI [1,9])
C0086409 (UMLS CUI [1,10])
TOPMED reference panel 5b was used for imputation with Michigan Imputation Server (https://imputationserver.sph.umich.edu). Pre-phasing using phased reference data from TOPMed release 5b was conducted using EAGLE 2.4 (doi: 10.1038/ng.3679). Imputation was conducted against the same reference panel using minimac4 (https://genome.sph.umich.edu/wiki/Minimac4). Due to the limitation of sample size allowed by Michigan Imputation Server, the GSA/European dataset was imputed by splitting to 4 different batches.
Item
TOPMED reference panel 5b was used for imputation with Michigan Imputation Server (https://imputationserver.sph.umich.edu). Pre-phasing using phased reference data from TOPMed release 5b was conducted using EAGLE 2.4 (doi: 10.1038/ng.3679). Imputation was conducted against the same reference panel using minimac4 (https://genome.sph.umich.edu/wiki/Minimac4). Due to the limitation of sample size allowed by Michigan Imputation Server, the GSA/European dataset was imputed by splitting to 4 different batches.
boolean
C1706462 (UMLS CUI [1,1])
C2699638 (UMLS CUI [1,2])
C1554143 (UMLS CUI [1,3])
C3846158 (UMLS CUI [1,4])
C0242618 (UMLS CUI [1,5])
C0449295 (UMLS CUI [1,6])
C0150098 (UMLS CUI [1,7])
C1534709 (UMLS CUI [1,8])
Each platform/ancestry pair was cleaned according to the filtering method in https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008500. Briefly, all variants with Rsq < 0.3 are removed to be consistent with traditional quality filters on MACH-style output. Then, the remaining variants are partitioned into minor allele frequency (MAF) bins {[0,0.0005], (0.0005,0.002], (0.002,0.005], (0.005,0.01], (0.01,0.03], (0.03,0.05], (0.05, 0.5]}. Variants in each bin are filtered out, starting at the lowest Rsq, until the average Rsq of remaining variants within the corresponding MAF bin is at least 0.9 (the Kowalski et al. citation suggests 0.8; the use of a more stringent threshold has no impact on common variation).
Item
Each platform/ancestry pair was cleaned according to the filtering method in https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008500. Briefly, all variants with Rsq < 0.3 are removed to be consistent with traditional quality filters on MACH-style output. Then, the remaining variants are partitioned into minor allele frequency (MAF) bins {[0,0.0005], (0.0005,0.002], (0.002,0.005], (0.005,0.01], (0.01,0.03], (0.03,0.05], (0.05, 0.5]}. Variants in each bin are filtered out, starting at the lowest Rsq, until the average Rsq of remaining variants within the corresponding MAF bin is at least 0.9 (the Kowalski et al. citation suggests 0.8; the use of a more stringent threshold has no impact on common variation).
boolean
C1710360 (UMLS CUI [1,1])
C5447420 (UMLS CUI [1,2])
C1709450 (UMLS CUI [1,3])
C0180860 (UMLS CUI [1,4])
C3846158 (UMLS CUI [1,5])
C4722262 (UMLS CUI [1,6])
C0205419 (UMLS CUI [1,7])
C0919481 (UMLS CUI [1,8])