Testing of reliability of NevGen predictor

April 2017 testing, and general notes

During this month was performed the first testing of NevGen's prediction reliability, using SNP-analyzed samples from several projects, whose haplotypes are known to which subclade they belong. For testing were used only samples which were not known previously to NevGen, and which are not part of NevGen's statistics for it's subclades.
Not all such haplotypes were used, haplotypes which are too close to any haplotype that is part of statistics of it's already known subclade are ignored. For example, if we have 67 markers-long haplotype called H1 whose known subclade (supported by NevGen) is called SUBCL1, and in NevGen's statistics for SUBCL1 exists haplotype H2 which is less than 6 values distant from H1 (genetic distance is less than 6), than H1 is ignored and not used for testing. For haplotypes of 37 markers ignored are those with 3 or less such differences, and for those of 111 markers ignored are those with 8 or less. Since almost all such haplotypes are right predicted (although NevGen's prediction algorithm has nothing to do with genetic distances), this way percentages of right predicted haplotypes are lowered, and this results are rather conservative. Number of such haplotypes is not small, they were not counted, but in R1b U106 testing there was probably between 50 and 100 of them.
Also, haplotypes that belong to subclades which are not supported by NevGen (usually smaller ones, with too few SNP-proven haplotypes, with no or less than three full 111-marker haplotypes, or only with haplotypes of the same surname...) are ignored. Since their usage in predictor yields false positives.

Is need only to add when haplotype is considered "right" or "wrong" predicted by predictor in this testing. For simplicity, it is considered "right" if it's already known subclade has been given the greatest probability (and ofcourse if it is greater than zero), which means if it is the first in results list. In many cases, SNP-proven subclade were given substantiate probabilities (for example more than 20% or 30%), but nevertheless they are considered wrong if they are not the first in results list.

After testing was finished, all haplotypes used in testing are added to statistics of NevGen's supported subclades, whether they were wrong or right predicted. That way predictor's statistics is made better.

Carefull user might notice that performance of 111-marker haplotypes is, unexpectedly, not much better (or even worse) than performance of 67, or sometimes even 37 markers (case of R1a). From logical point of view, it is expected that more markers means more reliable predictions, and greater percentage of right predicted 111-markers haplotypes than those with 67. But, that would be the case only if NevGen's statistics holds only haplotypes with 111 markers, which is not case. In some haplogroups and subclades there are only smaller percentage of haplotypes with full 111 markers. Majority of haplotypes from poorer countries (noneuropean, Eastern Europe, Southern Europe and Balkans) has at most 67 markers. In some countries even 67 markers are not majority. And shortage of 111-marker haplotypes is reason why statistics for markers 68-111 is not so good as for those from 1 to 37. Because every 111-markers haplotype is also 67- or 37-markers haplotype, but vice-versa does not stand. There are many nonactive NevGen's subclades (now not used, because statistics for NevGen-supported subclade cannot be made without at least one full 111-markers haplotype) with status "waiting for any 111-markers haplotype to show up". In G haplogroup there is one subclade with more than 20 haplotypes in statistics, but with only one with 111 markers.

I1 from Sweden

First it was done with haplotypes of I1 from Sweden Project, which suited to strict rules defined above. There were 41 of them, plus additional six from Ulster Heritage Project, which makes total of 47. Haplotypes were from different length. 39 out of 47 are considered right predicted, which makes 82.98%.

Here is structure by length of those 39 who are considered "right predicted" in this testing:
1 x 12 markers
10 x 37
16 x 67
12 x 111

Among those considered right predicted is one with only 12 markers, with probability of 95.81%, which surprised us much. It belongs to subclade FGC10430.

Now structure of eight considered wrong predicted:
1 x 12 markers
3 x 37
2 x 67
2 x 111

Of 8 wrong, three were cases with very close subclades given highest probability. For example haplotype with 37 markers gave such results, when right subclade S14887> FGC21980 is on second place with 12.71%, and on the first place is very close subclade S14887> S18218:

I1 P109>Y3662> S14887> S18218 55.96%
I1 P109>Y3662> S14887> FGC21980 12.71%

Other 5 haplotypes gave result with more distant subclades on the first place. Wrongly predicted haplotype with 12 markers gave next results, whose confirmed subclade is L813:

I1 P109> Y3662>Y4045 31.03%
I1 L22>L813 15.45%
...

R1b U106

Now we go on results of testing with R1b U106 haplotypes from U106 Project. R1b is much harder nut to break. There are 168 haplotypes in total used in testing. 123 of them are considered right predicted, which makes 73.21%.

Here is structure by length of those 123 who are considered "right predicted" in this testing:
1 x 12 markers
11 x 37
36 x 67
75 x 111

Structure of 45 considered wrong predicted:
15 x 37 markers
11 x 67
19 x 111

Here too many wrong predicted sample haplotypes gave most of their probability to very close subclades, like in next case where haplotype with 111 markers belongs to S1911:

R1b U106>Z381> Z156>DF98> S18823 55.64%
R1b U106>Z381> Z156>DF98> S1911 14.71%

In some cases biggest probability went to subclades which are not even U106, like in this case when 111-markers haplotype belongs to FGC8512:

R1b L51>L151> CTS4528> S14328 25.63%
R1b U106>Z381> Z301>FGC8512 12.82%

But, in majority of wrongly predicted cases, biggest probability went to subclades of U106. Nevertheless, results of U106 predictions are not satisfying in general. Many subclades in NevGen's statistics are represented with small number of samples, and their number must be increased in order better prediction to be achieved.

R1b samples from Ulster Heritage Project (mostly L21)

From publicly available haplotypes in Ulster Heritage Project used were 49 R1b samples, of whose 38 are considered right predicted, which makes 77.5%.

Structure of 38 who are considered "right predicted" in this testing:
5 x 37 markers
13 x 67
20 x 111

Structure of 11 considered wrong predicted:
6 x 67 markers
5 x 111

Of those 11, nine were cases when SNP-proven subclade is downstream of M222, and the first subclade predicted is very close another subclade downstream of M222. Like in this example, where 67-markers haplotype belongs to S588.

R1b L21>DF13> Z39589>DF49>> M222>FGC4077 46.16%
R1b L21>> Z39589>DF49>> M222>S658>> S588 32.47%

Subclades of M222 are very young, and even more hard to distinquish among themselves, even on 111 markers. Nevertheless, NevGen recognized right subclades in more than half of samples (12) downstream of M222.

Ulster's R1b-s gave much better score than R1b U106 haplotypes, not only by percentage (77.5% vs 73.21%), but more significantly, in structure of those which are considered wrong predicted. Those from Ulster, which are mostly downstream of L21, in all 11 cases gave highest probability to very close subclades, which was not allways the case with haplotypes from U106. What is reason for such difference in quality of prediction? Most probably it is because average number of samples per subclade is considerably greater in subclades typical for British Isles (where L21 is very strong), than in subclades of U106, with more uniform spread across Northern Europe. So, it is clear, in order to be more reliable, predictor needs more samples for it's statistics! It is obvious when huge U198 subclade of U106 is considered. It has highest number of samples in statistics of all U106 subclades in NevGen, and till now, no sample SNP-tested positive for it (with at least 37 markers) was predicted wrong in NevGen, at least by us.

In some weeks similar testing of NevGen's R1b prediction shall be made on some even bigger public Project, like Irish, Scottish, or British Isles.

R1a from Sweden

For R1a prediction testing was used 36 haplotypes from Sweden Project, with two additional from Ulster Project (who are descendants of Scandinavians anyway), 38 in total. 35 of them are considered right predicted, or 92.1%.

Structure of right predicted:
1x12
10x37
9x67
15x111

Structure of wrong predicted:
1x67
2x111

For all three wrong predicted sample haplotypes top probability went to very close subclades, like in this example:

R1a Y2395>Z284> L448>CTS4179 (all others) 61.24%
R1a Y2395>Z284> L448>CTS4179> YP386 16.97%
R1a Y2395>Z284> L448>CTS4179> YP704 1.67%


May 2017 testing of prediction of R1b from British Isles

In the first week of May 2017 NevGen R1b Level was tested using deep SNP-analyzed haplotypes from three public FTDNA projects of British Isles background, which were previously unknown to NevGen's statistics.
Projects used here are:
Ireland Y-DNA Project
Scottish Y-DNA Project
British Isles DNA Project by County

All haplotypes used in testing are known which subclade they belong of subclades supported by NevGen R1b Level. Rules of testing and criteria for selecting haplotypes are already described in General Notes in begining of this page and shall not be repeated here. 219 haplotypes in total passed strict rules, and 165 of them are considered right predicted, or 75.34%. Beside them other 168 haplotypes (which were all right predicted) were rejected because they were too close to some haplotypes from their already known subclade supported by NevGen (distance less than 4/37, 6/67 or 9/111).

Here is structure by length of those 165 who are considered "right predicted" in this testing:
20 x 37
48 x 67
97 x 111

Structure of 54 considered wrong predicted:
12 x 37 markers
23 x 67
19 x 111

Majority of tested haplotypes belong to L21 subclade of R1b, as is expected for persons of British Isles background. Substantial part of cases with wrong prediction goes to errors with distinquishing among subclades of M222. Final result (75.34%) of right predicted haplotypes is relevant only for haplotypes with origins from British Isles. For haplotypes of R1b from another parts of Europe such rate is not expected, but is expected to be less, due to less available haplotypes.


February 2018 testing of prediction of R1b L21

In the first week of February 2018 NevGen R1b Level was tested using deep SNP-analyzed haplotypes from L21 public FTDNA project, which were previously unknown to NevGen's statistics.

Like in earlier testings, all haplotypes used in testing are known which subclade they belong of subclades supported by NevGen R1b Level. Rules of testing and criteria for selecting haplotypes are unchanged, they are already described in General Notes in begining of this page. In total 71 haplotype passed strict rules, and 54 of them are considered right predicted, or 76.06%. Beside them other 49 haplotypes (which were all right predicted) were rejected because they were too close to some haplotypes from their already known subclade supported by NevGen (distance less than 4/37, 6/67 or 9/111).

Here is structure by length of those 54 who are considered "right predicted" in this testing:
2 x 37
4 x 67
48 x 111

Structure of 17 considered wrong predicted:
1 x 67
16 x 111

Here we can see that percentages of "right predicted" haplotypes are alike another testings of R1b Predictor. All 17 haplotypes considered "wrong predicted" gave the greatest probability to another L21 subclade. In 5 out of 17 cases, right subclade was on second place, like in this case.

Probability = 51.64% Fitness=66.70 [0.92] R1b L21>FGC11134>> S1121>> Z18170
Probability = 32.27% Fitness=66.34 [0.99] R1b L21>FGC11134>> A151

Among "right predicted" are two haplotypes with 111 markers which belong to S1051 subclade. They were very, very distant from another haplotypes in NevGen's statistics, 29 and 36 markers from it's closest. Such thing happpens seldom in R1b. In another haplogroups sometimes happens that NevGen right predicts haplotypes which 40+/111 or even 50+/111 markers distant from it's nearest in predictor's statistics, but such distances are hard to find in (western) R1b. This is result of that with distance of 29:

Probability = 52.71% Fitness=46.49 [0.82] R1b L21>DF13> Z39589>S1051
Probability = 0.00% Fitness=41.96 [0.77] R1b L21>DF13> FGC5494

From our experience, it seems that rates of "right predicted" haplotypes of U106, DF19 and DF99 are similar as those of L21. Rates in Z2103 might be even greater. But, rates in DF27 and U152 are much lower, especially in former, where I think they go to less than 30%. Those two are so unreliable partially due to less available deep-SNP tested haplotypes. But, even in them there are some subclades which are much more righly predicted. Unsurprisingly, those are subclades which are common in British Isles.