Genes for Good
After answering 15 health history and 20 health tracking surveys on Facebook, you are able to get free genetic testing which gives you your raw data. This data is compatible with Promethease, and according to Genes for Good, the raw data consists of ~550,000 genotypes, assayed by microarray.
Genes for Good provides users with two types of data 'unphased' and 'imputed'. Files containing the name unphased, are the actual data that was observed. Imputed included many additional calls which are assumed based on imputation. Promethease can accept either type of file, or a .zip file with both. When provided the .zip with both promethease will process all of the unphased data, and then any rs#s which are in the imputation but not in the unphased. It will also identify snps which were imputed. This currently produces the best possible report. As of July 2017, the report will have roughly 32.5k snps.
It is notable that there will still be a significant number of conflicts, because that's what is in the imputed data. We consider this a notable deficiency of the current Genes for Good imputed file format. They are aware of the issue. To illustrate the concern:
grep rs4654925 ~/Downloads/gfg/GFG8_filtered_unphased_genotypes_vcf.txt 1 20227723 rs4654925 G C . PASS . GT 0/1
grep rs4654925 ~/Downloads/gfg/GFG8_filtered_imputed_genotypes_noY_noMT_vcf.txt 1 20227723 rs4654925 G T . PASS . GT 0|0 1 20227723 rs4654925 G C . PASS . GT 0|1
which is to say that we genotyped you for G/C and decided you are a hetero then we imputed you, and if we ask are you G or T, we think you're GG and if we ask are you G or C, we think you're GC
and cases like
grep rs2501401 ~/Downloads/gfg/GFG8_filtered_imputed_genotypes_noY_noMT_vcf.txt 1 24220063 rs2501401 A C . PASS . GT 0|0 1 24220063 rs2501401 A G . PASS . GT 1|1
grep rs2501401 ~/Downloads/gfg/GFG8_filtered_imputed_genotypes_noY_noMT_23andMe.txt rs2501401 1 24220063 AA rs2501401 1 24220063 GG
What is below is worth understanding, especially if you will be submitting one of the inner files, instead of the entire .zip.
Some Genes for Good files contain imputed data. These files will yield the largest Promethease report, but since some of your genotypes have been 'imputed', they are assumed, and may not be true. Individuals from ethnic groups which were not part of the Genes for Good training sets can expect more errors than people from similar ethnic groups. It's unclear exactly what groups were used for training, but it's safe to assume western europeans are well represented.
Files that don't mentioned 'imputed' should only include genotypes which were actually observed. This results in a smaller Promethease report, but one with higher confidence.
As of Feb 2017 Promethease reports for the unphased (non-imputed) Genes for Good raw data have ~16,700 genotypes reported, with ~2,700 of them being from ClinVar.
Some of their files also indicate 'noY_noMT' indicating that SNPs from these haploid chromosomes are not included. Without them it is impossible to see any data related to your haplogroups.
For a small ($2) additional fee, you can combine (pool) additional files together, which might let you have the best of both worlds.
Some additional questions are addressed at https://www.reddit.com/r/freebies/comments/67v9c5/free_dna_test_from_the_university_of_michigan/
Which Files Can Be Uploaded to Promethease?
Genes for Good appears to offer several download options. If you download a zipped file containing all your raw data, you will need to unzip that file first. There are usually 9 unzipped files, as shown here:
While Promethease can use the files in the VCF and 23andMe .txt formats, as shown in the image we recommend using the files that are in .gz format, since they are compressed and will upload quicker.
Interpreting a Pooled Report
If you choose to add your imputed data to your original data to get a combined Promethease report, notice that many of the genotypes in your Promethease report say 'count 2'. Those were the ones in both files, original and imputed. The ones that were only in one file don't say that. Since we expect everything from the original file to be in the imputed, that is enough to let you know what's imputed.
In practice there are a few genos that differ between the files, especially if you combine data from different companies. This relates mostly to different representations of the same information; for example, 23andMe chooses to use II or DD or DI to indicate in/dels (insertions & deletions). Genes for Good will usually use the actual genotype so you'll probably see (for example) rs1234(G;G) or rs1234(;) or rs1234(-;G). By clicking the checkbox for conflicts, you can find these.
Imputed Multiple Calls
The imputed file sometimes contains multiple lines with the same rs#, but different genotypes. A specific example looks like this 1 20227723 rs4654925 G T . PASS . GT 0|0 1 20227723 rs4654925 G C . PASS . GT 1|1
Which promethease feels should be interpreted to mean that rs4654925(G;G) is reported and rs4654925(C;C) is reported. GfG claims that they intend a different interpretation:
- The example you describe is a multi-allelic SNP. Most SNPs are biallelic, meaning that there are two possible alleles, for instance A or T. In your example, the possible alleles are G, T, and C. Since the imputed genotypes are not directly measured and are instead our best "guess" - using statistics and a reference genome of course - this results in a slightly tricky situation when dealing with multi-allelic SNPs. The first line is the imputed value if estimating between G and T. The second line is the imputed value if estimating between G and C. We discussed removing multi-allelic SNPs from the imputed files, but decided against it.
See also VCF.