Population Frequency Edit Policy
The population diversity box is used on 31116 snps in SNPedia. This page tries to capture how and where in the information comes from and how to work with it. The formalized policy originates with a discussion between users User:Jlick, User:JohnLloydScharf and User:Cariaso. At present this text is a best effort at capturing what has been said, and clarifying areas of confusion, uncertainty and disagreement. In time this page should become less discussion oriented and more of a pure policy statement.
Authors below are generally
User:Jlick (top level)
What is Obsolete
- I did not name the FOX3A gene and it is cited by the article. If FOX03A is obsolete, why did you use it in the first place???? If you are saying FOX03A is obsolete, then isn't the research is obsolete.
- No. HUGO changes gene names, and that doesn't mean every paper which used the old names is suddenly obsolete. Sometimes papers mislabel protein names as gene names, or use non HUGO names. In the response above, in addition to the question of the trailing A, I also note there is some confusion about O vs 0 (capital o vs zero). These are just some of the sources of chaos we have to deal with.
To allow for backwards compatibility, population template MUST use exact order for the first four groups: CEU, HCB, JPT, YRI.
- I do not know what importance or relevance "backward compatibility," has for a user of SNPedia and nothing fell apart when I changed the order, except to make the differences more visible between groups...i.e., more information.
- SNPedia's primary reason for existing is to enable Promethease and similar programs to exist. We don't build databases for the fun of it, we do it to automate the answering of questions. Backwards compatibility helps to ensure changes to SNPedia don't break dozens of tools you'll never see. The fact you couldn't see anything fall apart doesn't mean it didn't happen. In this case I think Jlick is holding to advice I game him when he first introduced the extra populations with HapMap3. I think at this point Promethease would be unaffected, and I'm willing to see that rule relaxed, but I'm not certain it should be. There is value in having all population diversity boxes follow a given order, so that people can trust the top line is CEU even when its missing, etc. This allows more quickly interpreting the boxes. I'm open to discussion on this, but for the moment backwards compatibility should be treated as a very important attribute.
- So, you are supposing people trust CEU for some purpose? It is certainly not an indication of European results. Listing them in the order of the major or ancestral allele makes better sense for those exploring their results. In fact, it gives information more visible and therefore is quicker, if you want them to "quickly interpreting the boxes. :::You have files that are not part of these individual articles. In fact, you use in Promethease, from what I can see, calls up files stored on SNPedia completely separate and there are many examples of SNPs used in those files that are not documented in these articles. If you are having it read "CEU" rather than a position, then the data can be trusted. You are, in effect, throwing away a great deal of data having internally and externally valid information by being so dependent on HapMap.
Accuracy and Reputation of Source
rs805264 I'm not sure if you meant to add this one as the rsid you used is for rs805297. Currently the population template is designed only for HapMap data sources. It will need to be redesigned before other sources such as POPU2 can be used. Any new sources must have good coverage and reputation for accuracy.
- Resolving the rs# confusion is important, but I'd like to strongly support the statement about good coverage an reputation.
- CEU is of doubtful use and not externally validated. If you question NBIC's inclusion of other sources, then it is your reputation that will be called into question; not theirs. Think about it. John Lloyd Scharf 17:14, 7 November 2011 (UTC)
rs10947055 and rs1109324 and rs1155974 and rs34490746 Change "HapMapRevision" or remove it if you can't figure out the data source. (If you can't figure it out, there's probably a reason it shouldn't be changed.)
- If "HapMapRevision" is important, you should be able to point out what the source is. If you cannot document it, how do we know it exists?
- That something is documented by NCBI should be enough.
- Disagreed. It's a good start, but there is too much garbage in NCBI to use that as a hard and fast rule.
- NCBI is the gold standard on this, not HapMap. It is not garbage. Excluding competing information is not scientific by modern standards of research. Mendel could throw out data because the math for probability and statistics to discover his creaming of data could not be discovered for 100 years. John Lloyd Scharf 17:14, 7 November 2011 (UTC)
Automation and Bot Reliability
- The primary reason is that if population frequency mixes versions, or includes non-standard populations the information is very likely to to overwritten and lost in the future when new hapmap data is released. There are currently 25k snps in SNPedia, most of which have HapMap data. Manually editing that is hopeless, and I favor pure automation over a hopeless effort at perfection.
- There is rarely anything "pure" about automation. It compounds a single error over and over. John Lloyd Scharf 17:14, 7 November 2011 (UTC)
rs884460 It isn't necessary to add http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs884460 as it is already included in the template.
rs34490746 You got the orientation wrong. Should be plus. Also the hapmap data is suspect as dbsnp claims it is from r27 but the source data from hapmap does not include it. This is usually because someone submitted the QC- results. In any case, I've edited it for order and removed the incorrect r28 designation.
In general it is usually best to just add new SNPs with just the rsid and let the bots fill in the details later to avoid simple typos as well as other problems of manually adding entries. If there is something the bots get wrong or leave out then the change should be documented to show why the added information is more up to date or correct. This allows the bots to be improved. Jlick 20:15, 6 November 2011 (UTC)
- Let me know what criteria you are using for your decisions rather than blaming the "bot" for it. You know and admit it has flaws. If you create it, you are responsible for it, but "bot" is not an authoritative source. Y text was truncated here
- All software has bugs, MediaWiki, Semantic Mediawiki, SNPedia, SNPediaBot, Promethease, and JLickBot are no exceptions. So far User:JlickBot works amazingly well. The bugs in question are most often due to some remarkably pathological cases in the upstream data. It is important to document the specific failures you find, because they can silently be affecting many other articles.
But you believe your bot is more authoratative than that of NCIB.John Lloyd Scharf 05:24, 9 November 2011 (UTC)
The issues below are not clear to User:Cariaso please expand or remove.
- There is no reason to expand or remove this unless or until the SNP article is repaired. It is documented in detail that the population data is in error and contradicted by NBIC data. If it is not clear to you, perhaps you should read these personally for awhile rather than using a bot because you are compounding errors and making SNPedia unreliable. My original was in error, it seems, but this made it far worse. Perhaps you do not know that 0.283=28.3%. I cannot understand why this is not clear to you, so I cannot make it any more concise. John Lloyd Scharf 17:14, 7 November 2011 (UTC)
- rs884460 It isn't necessary to add http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs88446. 0 as it is already included in the template
- I think I must have put that there for a reason. Your new data on alleles does not conform with
- 1HapMap-CEU European 120 IG AA=0.283 AG=0.450 GG=0.267
- That is the data as of Jan 30, 2007
- You data is as of Sep 6 2000. I only used it because it had more samples.
- Sep 06, 2000-HapMap-HCB Asian 86 IG AA=0.488 AG=0.465 GG= 0.047
- Oct 29, 2006-HapMap-HCB Asian 90 IG AA=0.511 AG=0.444 GG=0.044
- Jan 30, 2007-HapMap-HCB Asian 90 IG AA=0.511 AG=0.444 GG=0.044
- Nov 06, 2011-Unknw-HCB Asian ?? IG AA=0.555 AG=0.416 GG=0.029
- The last is your "HapMapRevision=28" version.
- Sep 06, 2000-HapMap-JPT Asian 172 IG AA=0.651 AG=0.326 GG=0.023
- Oct 29, 2006-HapMap-JPT Asian 090 IG AA=0.622 AG=0.356 GG=0.022
- Jan 30, 2007- HapMap-JPT Asian 090 IG AA=0.622 AG=0.356 GG=0.022
- Nov 06, 2011-Unknw-HCB Asian ??? IG AA=0.664 AG=0.301 GG=0.035
- The last is your "HapMapRevision=28" version.
- Sep 06, 2000-HapMap-YRI Sub-Saharan African 226 IG AA=0.664 AG=0.283 GG=0.053
- Oct 29, 2006-HapMap-YRI Sub-Saharan African 120 IG AA=0.617 AG=0.317 GG=0.067
- Jan 30, 2007-HapMap-YRI Sub-Saharan African 120 IG AA=0.617 AG=0.317 GG=0.067
- Nov 06, 2011-Unknw-YRI Sub-Saharan African 120 IG AA=0.673 AG=0.27 2 GG=0.054
- The last is your "HapMapRevision=28" version.
- The rest of them are also not in conformance with the Sep 06, 2000 revision.
John Lloyd Scharf 21:57, 6 November 2011 (UTC)
I don't have time at the moment to reply to all of the issues, but I would like to document the current practice of my bot and the general issues with other approaches. Currently the source of data is from HapMap directly, specifically HapMap Release 28 (Phase I, II, III) data released August 2010 from ftp://ftp.ncbi.nlm.nih.gov/hapmap/frequencies/2010-08_phaseII+III/
- Here you are not making sense. That is a NCBI site. If that "August 2010" data file is on the site of the NCBI, why would they not update http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs884460 ? Your data file came from an August 18 directory, but it was created on July 8th 2010. That entry has a build update of 36 while the dsnpa page of the same SNP shows a build update of 37.3.
- The listing in that file you are using for rs884460 is in "allele_freqs_chr2_YRI_r28_nr.b36_fwd.txt" and says
- urn:LSID:dcc.hapmap.org:Panel:Yoruba-30-trios:1 QC+ T 0.983 116 G 0.017 2 118
rs884460 chr2 140642488 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3
- http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=884460 shows a maximum of A=.805 and a minimum of A=0.755
- Your bot says in SNPedia: | YRI | 67.3 | 27.2 | 5.4
- This means the file you say you are using gives the major allele T is 98.3% and the minor alelle G is 1.7%. How does your bot deconstruct this into 67.3% or 0.673 for A[G-]?
John Lloyd Scharf 19:53, 8 November 2011 (UTC)
October 2011: NCBI released an update for the human genome annotation. The chromosome sequence is unchanged however this update includes more incremental updates released by the Genome Reference Consortium (GRC) as Fix or Novel Patches (assembly GRCh37.p5).
- Help me understand why we should assume your bot is correct and the NCBI bot be incorrect? What makes you think these files on the ftp are the gold standard when they are clearly behind the build indicated on the page? John Lloyd Scharf 17:48, 8 November 2011 (UTC)
Ordering of Ethnic Groups
Prior to Phase III, there were only four hapmap groups, CEU, HCB, JPT and YRI. The ordering chosen was based on alphabetical order and follows the conventions used by SNPediaBot prior to my involvement. Phase III added additional population groups, and HCB was also renamed to CHB. Phase III data is also available separately but is less extensive, so the Phase I, II, and III combined release has been preferred.
There are extensive consistency problems with the HapMap data found in dbSNP. The primary problem is that the latest data they have as of this writing is from HapMap release 27 which is from February 2009.
- If the data does not change, why should the release number? Again, given the files you have used are a build 36 and the page documents it as build 37.3, why would we conclude your bots got the right file from the ftp and NCIB's bot has the wrong file? John Lloyd Scharf 18:05, 8 November 2011 (UTC)
In many cases it is difficult to determine basic facts like: 1) when the data was loaded (the submission date is often for the first upload, not the latest update) 2) the release uploaded (sometimes in the submission report but not always) 3) which phases were included 4) who uploaded the data (sometimes the hapmap data was submitted by a third party, not hapmap themselves, and sometimes submissions are mixed with other data sources) 5) whether the uploaded data passed quality control or not (older releases of hapmap included all results and had a column noting whether the results were QC+ or QC-; current releases are QC+ only. in any case, dbsnp doesn't include this information) and other issues I'm probably not remembering at the moment.
Expanding from HapMap
By sourcing the data directly from HapMap, all of these problems are avoided. However, that does result in some cases where there is no data in the current release for certain SNPs or certain population groups. The big question is whether it is preferable to have more data, some of which may be wrong, or to have the latest data which is most likely to be correct?
- But you are NOT sourcing the data directly from HapMap. You are sourcing it to a file on the NCIB site with a file that is far behind on its build numbers as 36 when we are at 37.3. That file comes from the Sanger Institute, which is a team of HapMap, but HapMap is a NCIB project. Your assumption this directory up to date or correct is not evident. John Lloyd Scharf 18:05, 8 November 2011 (UTC)
Currently there are also architectural issues with the population template. Currently the template is can hold only one version field ("HapMapRevision") for the entire template. This becomes a problem when you want to try to use different sources of data in one template as there is no way to say that say the GIH data is from release 27 and the rest is from release 28. This also becomes an issue when adding additional sources of data to the template as it would be useful to be able to track the source and release of these as well.
So this probably means that our best way forward is to develop a new template architecture to accommodate more flexible sources of data. Off the top of my head here are some of what I think are some of the requirements: allele frequencies and genotype frequencies both as there seems to be a growing number of source of allele frequencies such as 1000 genomes, alfred, etc. More than two alleles/three genotypes and support for indels and variable length alleles. All fields defined rather than a template where fields are implied by position. Each data blob should included a population label (e.g. CEU), source (e.g. "HapMap") and release/version/build to allow more flexible mixes of data. -- Jlick 10:09, 8 November 2011 (UTC)
- With 1000 Genomes you could start here:
John Lloyd Scharf 20:01, 8 November 2011 (UTC)