I am trying to estimate the similarity / overlap or vice versa the difference between two probability histograms in Mathematica.
There are two integrated measures that I discovered in Mathematica, namely the one KolmogorovSmirnovTest
and the PearsonChiSquareTest
. The problem I encounter is that when calculating the two histograms for data sets with very different sample sizes, the estimates become almost zero even though the two distributions are almost the same (example below).
Example:
n1 = 90000;
n2 = 20000;
RandomSeed(100);
ls1 = RandomInteger({1, 10}, n1);
RandomSeed(101);
ls2 = RandomInteger({1, 10}, n2);
hist1 = Histogram(ls1, {1}, "Probability",
AxesLabel > {"value", "probability"}, ChartStyle > {Yellow},
ChartLegends > {"List 1"});
hist2 = Histogram(ls2, {1}, "Probability",
AxesLabel > {"value", "probability"},
ChartStyle > {Directive(Red, Opacity(0.5))},
ChartLegends > {"List 2"});
a) Sample sizes n1=90000
and n2=20000
very different:
KolmogorovSmirnovTest(ls1, ls2)
PearsonChiSquareTest(ls1, ls2)
0.603708
0.389257
b) Sample sizes n1=30000
and n2=20000
more comparable:
KolmogorovSmirnovTest(ls1, ls2)
PearsonChiSquareTest(ls1, ls2)
0.999966
0.993693
which are closer to the (intuitively) expected values. But case a) leads to much lower estimates and far below $ 1 $.
When visually comparing the histograms, however, they overlap almost perfectly and are almost equally uniform in both cases:
Applying these measures to hist1
and hist2
is not possible, but try the following:
KolmogorovSmirnovTest(HistogramList(ls1, {1}, "Probability")((2)),
HistogramList(ls2, {1}, "Probability")((2)))
PearsonChiSquareTest(HistogramList(ls1, {1}, "Probability")((2)),
HistogramList(ls2, {1}, "Probability")((2)))
leads to similar results, that is, far below $ 1. $

Is there a way to continue using these builtin functions despite very different sample sizes (from which the individual histograms are calculated)? That is, the distributions are compared regardless of the sample size from which they come.

Are there other related metrics in Mathematica that can be used to quantify how similar (or how much they overlap) two probability histograms are?