|Summary||a 41k year old member of the genus Homo|
Explained at http://snpedia.blogspot.com/2012/02/denisova.html
I've read your blogpost on the Denisova genome and I used your Rs-ID-annotated VCF file you provide to transform it into a 23andMe-style file which I uploaded to openSNP (you can find the page here: http://opensnp.org/users/278) as I thought users might be interested in comparing their SNPs with Denisova. Thanks for your work on this!
But I've found some strange things after converting the file and either I'm not clever enough to parse VCF or this is an upstream-error. The thing: You mentioned in your blogpost that there are Y-chromosomal SNP-calls. But I also found that those Y-chromosomal SNPs seem to be diploid and heterozygotous for some sites. And the same is true for the mitochondrial SNPs
This is a line of the annotated VCF as an example:
Y 59033159 rs28447236 A G 1585.24 . AC=1;AF=0.50;AN=2;BaseQRankSum=2.001;DB;DP=151;Dels=0.01;FS=14.641;HRun=0;HaplotypeScore=15.4950;MQ=33.20;MQ0=2;MQRankSum=0.678;QD=10.50;ReadPosRankSum=-0.114 GT:AD:DP:GQ:PL 0/1:89,60:149:99:1615,0,2746
As far as I've understood VCF the genotype is encoded by the cell 0/1:89,60:149:99:1615,0,2746 with 0/1 being the genotype (0 = major allele, 1 = minor allele), is this right? If I'm right: Do you have a clue why those SNPs got called as diploid?
When in doubt, the X&Y weirdness is due to the pseudoautosomal region
While experimenting with my own confusion in the regions these commands came in handy, so I repaste them here
grab the Y snps
gzip -dc denisovan1.vcf.gz | grep '^Y' | cut -f2 > thepos.txt
count how many are in each 1mb region
perl -p -e'$_ = int($_/1000000)."\n"' thepos.txt | uniq -c
43 snps in megabase 2 540 snps in megabase 3 368 snps in megabase 4 273 ... 5 205 ... 6 116 7 40 8 3370 9 3084 10 11389 13 168 14 111 15 83 16 87 17 61 18 49 19 6 20 91 21 46 22 59 23 9 24 3 25 6 26 1 27 1039 28 1513 58 748 59
Note that megabase 13 has a ton of snps. (betcha didn't know 11389 snps weighs 2000lbs).
My weird SNPs weren't well clustered to the PARs, but there were some spikes around 13Mb and 58Mb.
In the wiki text they talk about "with proper, full data processing as per the best practices in the GATK should lead to very good calls". Well I didn't do best practices (never found an obvious link to any) I ran with defaults, and I figured that contributed to the weirdness.
I've not been able to explain it, except to conclude that Unified Genotyper doesn't do well with haploid, and hoped others might dig deeper. So far it seems you're the first.
I think this conversation is worthy of wider awareness, so I'm reposting to http://www.snpedia.com/index.php?title=User:Denisova