With openSNP surpassing 1000 users, it's now biggest public source of western/caucasian genome-wide genotyping data, offering some new possibilities. One of the biggest challenges though is that most users don't even have their ethnicity marked down (although it would be suspect even if they had). I've ran ADMIXTURE analysis of Razib Khan's PHYLOCORE dataset with K = 1 to 60 with an eye towards stratifying the openSNP data by ethnicity, although I've not decided on the way to classify the samples yes, with regards to choice of K and whether or not I want to cluster the results further or just choose the highest ADMIXTURE cluster for each. At this point the expected number of non-Caucasian genotypes in openSNP data is so low that this isn't a huge concern yet. At present time my test data includes 428 usable individual genotypes extracted from openSNP data.
Here's an example of early product of the haplotyping effort from another page:
The new PLINK 1.9 version now supports 23andMe to VCF conversion out of the box, although "conversion" is little grandiose in this case. PLINK simpy takes first allele of each genotype marking it as main allele for each location rather than determining major and minor alleles, accepting the rest of the data as is. While most tools certainly won't care about this shortcut, I'm not particularly happy with the approach. In my own pipeline I have reference locations and SNP's and major & minor alleles from Genome Reference Consortium Human Reference pre-extraced, and match the 23andMe data in same order into that. It has the effect that unexpected alleles cannot be parsed, though I consider that as feature - so far one of the openSNP files includes SNP's not existing in any other.
Other benefits of the method include ability to parse FTDNA files as well with only small changes, although at present time openSNP data contains only about 5 individuals with Build 37 data. Build 36 data is parseable too, but would involve uplifting the data into Build 37, which while possible to integrate into the current pipeline seems unjustified when so far it would only provide genotypes of about dozen individuals, but raise questions about differences between sample groups. No doubt I'll integrate them at later stage when more samples are available from openSNP though. Using this method also allows me to directly map haploid X to diploid X required by BEAGLE4 and match the chromosome names to reference. In future it will also allow me to build a merged VCF file directly in one pass for faster pipeline.
As hinted before the pipeline currently employs BEAGLE4. A major benefit to BEAGLE4 is that it can do pseudo-phasing, IBD and imputation, serving as a multi-purpose tool in the pipeline. Also 23andMe's choice of BEAGLE as basis of their Finch phasing tool indicates it's at least modifiable to handle the computational complexity of million sample sets. There are some studies which suggest BEAGLE3 is marginally worse than some alternatives, although still suggest using it. BEAGLE4 has apparently not yet been indepently compared to other methods, but in the authors' own introduction paper they describe it as significantly improved over BEAGLE3 and urge using it instead of BEAGLE3. One future option would be comparing BEAGLE4 performance to the competition and determining whether there is cause to substitute some tools in the pipeline.
A direction of experimentation is that current tools make use of 1000 Genomes Phase 1 Version 3 reference panels, with 1092 samples; seeing as Phase 3 alignment with 2535 samples has just been released the question of how using it instead would affect results presents itself. It's bit unclear why BEAGLE4 is still using Phase 1 reference, although there's certainly diminishing returns from larger reference panels. The standard reference panel doesn't include 35,280 of the autosomal locations tested on 23andMe and FTDNA chips in any case, which should be because they're under BEAGLE's < MAF 0.5% cutoff in the data since because dbSNP 137 should cover all SNP's of interest, although this assumption bears to be checked as well.
Another challenge is the lack of marked relationship status in the openSNP data. I'm still trying to determine whether it even affects BEAGLE4 in particular, although I would certainly feel much better about the allele frequency and haplotype determinations if relationships were marked. Fortunately detection of cryptic relationships is fairly routine now, and PLINK 1.9 runs it's IBD analysis very fast. It reveals 9 parent-child tuples, one parents-child triplet and two who appear to be grandparent-grandchild tuples. Determining the parent from unmarked pairs doesn't seem possible, but I don't think it makes particular difference from statistical point as long as they're unrelated to anyone else. Automated rules should be able to generate pedigree file from the PLINK IBD, though there are a number of special cases that need to be taken into account. For fine IBD determination this certainly doesn't matter, because results from multiple different runs are routinely combined to improve detail.
Major concer is that the publicly shared genotype files could have been tampered with or simply contain poor quality genotypes, of course. There are suggestions that genotyping quality is not so crucial for many of these approaches, because they determine the likelihood of a genotype from available data anyway. It would require a major, concentrated effort to manipulate enough genotype files, without triggering IBD or HBD traps, to budge the average haplotype consensus significantly. Nevertheless this can't be ignored when using public sources. Systematic genotyping errors would certainly be a concern, and it would likely be beneficial to exclude HWE violations and extremely rare genotypes from haploty and IBD analysis. Unfortunately they're often also some of the most interesting SNP's. Clearest way to address this will be to compare results from runs with and without those quality controls.
Finally something should certainly be said about the privacy of the individuals in the sample group. People who submit their genotypes to openSNP implictly grant permission to their use and analysis, although it would be of interest to determine just how well they understand this. Most of them have already anonymized their submission, though not all, and I doubt many of them have given consideration to the possibilities of IBD analysis potentially identifying them at future date. Certainly the data is in the open now, so such analysis cannot be avoided. One of the paradoxes of the issue is that whether we want to or not, we're leaving our genetic fingerprint everywhere we go, making me question if any of us have "genetic privacy" from the outset. Certainly artists have already raised this issue by for example exhibiting sculptures sculpted on the basis of DNA extracted from gumballs or cigarette stubs. What then of parties with more resources? Given the falling price of genotyping, I have serious doubts whether any of us can escape being genetically profiled by parties we'd rather have not. Although for now, they're certainly an exception.
[PMID 22883141] Phasing of Many Thousands of Genotyped Samples - suggests HAPI-UR could be faster for pure haplotype inference, but also Beagle3 is fastest on moderate sample sizes.
Somewhat related to haplotypes, certainly all these rely on solid pseudo-phasing. BEAGLE3 doesn't seem to fare too well, although it's noted "Interestingly, unlike IMPUTE2 and the MaCH programs, the overall imputation quality of BEAGLE was not negatively affected by the inclusion of more distantly related reference subjects" and "When using BEAGLE, imputation quality of low frequency SNPs was not negatively affected by the inclusion of more distantly related populations in the AFR+EUR and ALL panels". [PMID 23226329] It seems they filtered out low frequency SNP's to make BEAGLE4 perform more poorly, and given potential interest in low-frequency SNP's and the non-homogenous pool of samples on openSNP, this seems to somewhat justify BEAGLE's use.  suggests that IMPUTE2's memory usage would be prohibitive, although most papers like [PMID 22384356] "Genotype imputation with thousands of genomes" claim just the opposite. This might be because BEAGLE allows imputing multiple overlapping regions, which is slower but allows potentially high scaling with low memory.
Comparing BEAGLE, IMPUTE2, and Minimac Imputation Methods for Accuracy, Computation Time, and Memory Usage Note David Hinds' criticism.
[PMID 23226329] Assessment of Genotype Imputation Performance Using 1000 Genomes in African American Studies - Considerations for hard imputation, where IMPUTE2 seems to win.
Arguably greatest public interest right now, who wouldn't want to know they're related 5000 years ago... even if we're all related several times over within the last 1000 years alone, it just rarely survives recombination. [PMID 23667324] "The Geography of Recent Genetic Ancestry across Europe." (Incidentally, BEAGLE3 used) IBD performance is certainly related with the other uses as well. Since most tools don't implement IBD, BEAGLE3's performance here is hard to put into context with the other uses, although IBD results are easier to score objectively provided a known pedigree is available. In the French Canadian founder population study [PMID 24129432] it was noted "Indeed, data phased with ShapeIT provided IBD results that were more strongly correlated with the genealogical information than those phased with Beagle." The pre-phasing for GERMLINE suggests BEAGLE3 might perform worse than ShapeIt, but it might be true only in this specific use and the procedure used for phasing, such as number of iterations.
The original paper on BEAGLE3's FastIBD on the contrary noted "If the input data are phased, GERMLINE is between 2 and 3 orders of magnitude faster than ten runs of fastIBD. Phasing data with BEAGLE takes time that is similar to one run of fastIBD, so in practice, the computation time for ten runs of fastIBD is approximately one order of magnitude larger than the computation time for GERMLINE when the phasing step is included. However, the greatly improved accuracy of fastIBD compensates for the increased computing time." [PMID 21310274] Since BEAGLE4 with improved algorithm has been out for some time, with further improved BEAGLE4.1 promised for early 2015, new comparisons need to be obtained to reproduce those claims.
I've finally had a chance to put some effort into this study line again, and am currently finishing phasing and IBD runs of 514 23andMeV3 tested individuals from OpenSNP with the whole, unfiltered 1000 Genomes Phase 3 final release as reference panel, using Beagle 4 r1398. The runtime per iteration of a chromosome has risen from around 10 minutes to two hours; a further order of magnitude slowdown. Given the number of OpenSNP samples hasn't increased very much and the Beagle revisio notes mostly detail improved computational efficiency (Albeit also a change of singlescale from 1 to 0.8) I assume most of the change in runtime is due to using unfiltered refence panel. Regardless, I believe using whole panel for phasing to be beneficial, if computationally taxing, and at least any novel IBD segments can be merged with other runs.
diCal-IBD from 2014 is worth looking at, if only for the demographic inference approach. Naturally it claims to be better than all other methods, though it DID consider RefinedIBD. It has some extra features like TMRCA calculation and detecting putative positive selection, which make it certainly worthy contended if it's computationally feasible. Parente2 paper was published in 2015, once again it's a study that claims to outperform all state of the art competitors, but it looks like Beagle4's RefinedIBD was not included in "state of art", only the older FastIBD method. Parente2 looks interesting at least in that it claims to work directly on unphased data, though if phasing is a goal anyway then it isn't a huge benefit. Currently, program seems to exist in "Linux" binary form only, and I see no mention of license, so this would be hard choice to include in an analysis pipeline, but I'll keep it in mind.
[PMID 24129432] Genome-wide patterns of identity-by-descent sharing in the French Canadian founder population - FastIBD (BEAGLE3) recommended
[PMID 23535385] Improving the Accuracy and Efficiency of Identity by Descent Detection in Population Data - BEAGLE4 RefinedIBD, says better than 10 runs of BEAGLE3 for IBD
There's quite a flood of new techniques and software for related analysis now. I might just list these for the off chance I have time to dwelve into the details. Comments as I have them...
[PMID 23608192] Estimating variable effective population sizes from multiple genomes: a sequentially markov conditional sampling distribution approach