Predictor's Generated Visual Statistics for haplotypes


03. 10. 2021.


In this article, I shall try to explain the meaning of visual statistics of fitting of any haplotype into any haplogroup. In this example will be used single full haplotype (111 STR markers long) which belongs to haplogroup R1b > U106.


There are two kinds of visual statistics, one which is simpler, and one which is more complicated, but might be more useful.



Let start with the simpler one.



Here we can see small sections for all 111 STR markers of our sample haplotype, against background of statistics of haplogroup R1b.


Now we shall take the closer look at one of them.


This is section for DYS19 STR marker. On the right side is displayed name of STR marker, together with value of haplotype on that marker (which is 14 in this case).

On the left side there are three wide green columns, of which only one, in the middle, is tall. Columns represent distribution of values of it's STR marker, in this case DYS19, in current haplogroup (in this case R1b).

In middle wide green column (the tallest one) we can see narrow black line. It marks distribution of value of current haplotype on current marker (14 in this case, which is displayed on far right side). That way, short column on the left represents distribution of value 13, and  short column on the right represents distribution of value 15. We can deduce from this chart that most of R1b haplotypes has value 14 on DYS19, and that values 13 and 15 are minority in haplogroup R1b, probably less than 10% each (we do not have enough space on image to display exact percentages for each value and each marker, so they are shown only in this imprecise way). Beware that this statistics does not display columns for STR values whose frequency is less than 1%, due to small number of available vertical pixels. In this case, our haplotype on this STR marker fits very well into R1b haplogroup, because it posseses it's the most frequent value.

From previous chart we can deduce one more fact, that DYS19 is slowly mutating STR marker. Because in whole of R1b only three values on it (13, 14 and 15) have frequency of more than 1%.




Second zoomed-in section (previous image) is for DYS456. It has 5 values that have frequency of more than 1% in R1b (and all of them under 50%), so this STR marker obviously mutates faster than DYS19 (FTDNA considers it as part of group of Faster Changing STR Markers).

In this case haplotype also has value 14 on this marker (DYS456), and (mostly) red line on the far left on image denotes that value 14 is not much frequent in haplogroup R1b (so on this marker our sample haplotype pretty badly fits into haplogroup R1b, and such “bad fitting”, or “outlaying value” is shown on chart by red line, which serves as warning). From chart we can also deduce that most frequent value for DYS456 in R1b is 16 (middle column), closely followed by value 15 (second column from the left).1


So, our sample haplotype has 23 markers (out of 111) that has smaller or bigger 'red line' on this image, which represent non-modal values (which essentially means values which are not the most frequent in target haplogroup). Remaining 111-23 = 88 STR markers have modal values of R1b.


Now we shall see our sample haplotype against background of statistics of haplogroup R1a (to which it does not belong).



As we can see, haplotype has even 51 non-modal values (those with red lines). It is much more than 23 in case of R1b. Also, we here have only one short red line, which denotes small difference from modal values.


And final example for simpler visual statistics in fitting of our sample haplotype into quite distant haplogroup: it is I2a1a > L621 Slavic-Carpathian & Disles.


Here we have plenty of red lines (non-modal values), 76 in total. And only 2 of them small. Also, many of them are far away form modal values for I2a1a > L621 Slavic-Carpathian & Disles. So, our sample haplotype fits into this haplogroup just like elephant fits into glass-shop.


So, if you see too many of those red lines, than belonging of the haplotype to the haplogroup is doubtful, to say at least.

Please beware that NevGen's way of predicting has nothing to do with the number of red lines, nor with their combined length. In fact, it has nothing with them at all. Algorithm is much more complicated.



Now about more complicated visual statistics.


While previous visual statistics contain only one background statistics for haplotype, those we shall speak about now use two background statistics. One of those statistics is statistics of subclade (for example R1b U106>Z381> Z156>DF96>> FGC13326), and another is subclade's superclade (or parent clade, for example R1b). Such images with double statistics serve to ease pinpointing of deeper subclade when it's more general haplogroup is certain or obvious (again R1b is good example).


Let see sections for CDYb STR marker of our sample haplotype againts two different statistics on two different images. This time their colours are not green but the first is saturated blue and represents statistics of whole R1b. Second is white, and represents statistics of R1b's deep subclade R1b U106 >> FGC13326.



On third image we see both two previous statistics combined. Parts where both statistics overlap are displayed here with less saturated blue (in fact, colour is mixture of saturated blue and white of two previous images). Parts of two statistics which does NOT overlap are displayed with their own original colours.



But, this statistics is not much useful. Why? Because value 38 is modal (or most frequent) for both R1b  and  R1b U106 >> FGC13326. Real usefulness of this comes when those modal values are different.


Let see now full two-statistics image of our sample haplotype.












Real use of  images with statistics of two subclades comes from green and orange fields which you can see on previous image. They can be found on columns for values that haplotype possesses (those which contain thin black or red lines). Let see zoomed-in section for DYS492, which is famous as STR marker that (if it's value is 13) may indicate that haplotype belongs to U106 subclade of R1b.


Statistics of deeper subclade (which does not overlap with statistics of parent clade) for haplotype's value of 13 is not displayed in white, but in green. With good reason. Because in that way, values of all markers which are equal to subclade's signature2 are displayed in green. And also, values of all markers which does NOT satisfy subclade's signature are displayed in orange, as sign of warning.


Let see another example, fitting of our sample haplotype against statistics of subclade R1b U106>Z381> Z156>DF96> S11515> FGC8410. Just to mention that our sample does not belong to it.



From this image we can deduce that STR signature of  R1b U106 >> FGC8410 (comparing to whole of R1b) contains next eight STR values: DYS437 = 14, YCAIIb = 22, DYS452 = 29, Y-A10 = 14, DYS461 = 13 (those five STR values our sample haplotype does not satisfy, but has modal values for whole of R1b, so they are displayed in orange), and DYS492 = 13, DYS717 = 18, and DYS504 = 16. Of eight STR values in STR signature, our sample haplotype contains only 3 (three last, so they are displayed in green), and two of them are also members of STR signature of  R1b U106 >> FGC13326.


On this image we can deduce that it is very doubtful that our sample haplotype belongs to R1b U106 >> FGC8410, because our sample does not satisfy big part of it's signature. Or, shortly, it has too much of orange colour on image, instead of green. So, good prediction should contain as much as possible of green parts and as low as possible of orange parts.


Also, beware that NevGen's way of predicting has nothing to do with the number of green or orange fields. But those can be useful when trying to assess whether is prediction right or wrong.


1Values allways grow by one from left to right.

2Signature of subclade (relative to its parent clade) are values which are modal (most frequent) in subclade but are NOT modal for the parent clade themselves. That way, signature for R1b U106 is DYS492 = 13, because it is most common value for this marker (more than 90%) in U106 subclade, but is minority value for the whole of R1b. Also, signature for  R1b U106 >> FGC13336 are values DYS442 = 11,  DYS492 = 13, DYS485 =14 and DYS717 = 18.