Applied Statistics

A company that consists of two work shifts (day and night) conducts a study to determine whether the absence of day shift workers has patterns similar to those of the night shift. A sample of 150 day workers and 120 night workers is carried out. The results show that 48 day workers have been absent at least five times during the previous year, while only 33 workers under the same conditions have been absent on the night shift.

a) Find a 95% confidence interval for the true difference in proportions of workers absent at least five times during a year.

probability or statistics – How to generate random variate in custom domain for a distribution?

I have a distribution defined in a particular domain of the variable but now I want to generate a random variable not in the entire domain but only in a subset of the domain. Here is what I’m trying to do

a=0.25*(Pi);
pdf= ProbabilityDistribution(1/Cos(x)^2, {x, -a, a}, Method -> "Normalize");
RandomVariate(pdf)

This will generate a random variable from the given distribution between (-a,a). But I want to generate the variable between some subset, say (-a/2,a/2). How do I do this?

I have tried changing the domain of the distribution itself but it is easy to see that this will redefine the whole distribution and is not the same as what I want.

probability or statistics – Problem with KarhunenLoeveExpansion’s output

I have a matrix called realizationMat, it contains 101 measurements of 90 realizations of a stochastic process.

I use KarhunenLoeveDecomposition over it as

VA1 = realizationMat[[1]];
VA2 = realizationMat[[2]];
VA3 = realizationMat[[3]];
KLVariablesB = 
  KarhunenLoeveDecomposition[{VA1, VA2, VA3, ...}];

Note that I have more elements VA, but I don’t write them for the shake of brevity.

In the documentation is written: rows of the transformation matrix m are the eigenvectors of the covariance matrix formed from the arrays ai. Also, The transformed arrays bi are uncorrelated, are given in order of decreasing variance, and have the same total variance as ai.

However, if I write ListPlot[{KLVariablesB[[2, 1]]}, Joined -> True], it doesn’t look like an eigenvector at all, see figure 1Figure 1, fist row of m. Also, if I instead plot ListPlot[{KLVariablesB[[1, 1]]}, Joined -> True], it looks way more as an eigenvector, see figure 2Figure 2, yet these aren’t the eigenvectors of the original covariance matrix.

Can someone please tell me what is wrong with my code?

Best regards.

statistics – How many BTC transactions happened in 2019?

I’m interested in a comparison between different blockchains and payment networks.

Bitcoin has a block time of around 10 minutes, so there should be roughly 52,560 blocks in one year. However, the number of transactions per block varies.

Is the number of BTC transactions that happened in 2019 document somewhere?

(Extra question: Is the volume in BTC or even USD at the time of the trade available?)

statistics – Sliding window max/min algorithm without dynamic allocations

I’m working on a suite of DSP tools in Rust, and one of the features of the library is a collection of windowed statistics (mean, rms, min, max, median, etc) on streams of floating-point samples. The user provides an iterator of floats and a fixed-size mutable slice/array of floats as inputs, and get back a iterator that lazily calculates the desired stat one sample at a time. The contents of the float slice/array become the initial window, and can be used as scratch space internally. This slice/array also determines the sliding window size.

// A sliding mean with a window size of 64, initially filled with 0.42.
let sample_iter = /* an iterator that yields floats */;
let buffer = (0.42; 64);
let sliding_mean_iter = SlidingMean::new(sample_iter, buffer);

I’d ideally like to have my library also usable in an embedded environment, and so far I’ve been able to do so by not using any dynamic memory allocations.

As of now, I have efficient mean and rms implementations, but I’m stumped on the windowed min/max. Looking up sliding window min/max algorithms online(1)(2), I see that:

  1. A naive approach of scanning the window after each update leads to a O(nk) runtime, where n is the length of the input iterator and k is the window length. This feels like it could be inefficient for larger window sizes.
  2. Using a deque allows for an efficient implementation of O(n) (with amortized insertion of O(1)), but using a deque would require dynamic allocations.
  3. Since the passed-in buffer is overloaded to both set the initial window contents as well as to serve as scratch space, I can’t mess with the type of that, it has to stay as a slice/array of plain floats. This means I can’t make it an array of (int, float) to store array indices alongside the data, for example.

Are there any clever tricks or optimizations I can use to have my cake (efficient sliding min/max) and eat it too (avoid having memory allocations)?

EDIT: For reference, since it came up in the comments, here’s a walkthrough example of a sliding max with window length 3:

Initial samples: {3 4 3 1 5 2 7 3 4 1}

Step 1: {(3 4 3) 1 5 2 7 3 4 1} -> The max of (3 4 3) is 4.
Step 2: {3 (4 3 1) 5 2 7 3 4 1} -> The max of (4 3 1) is 4.
Step 3: {3 4 (3 1 5) 2 7 3 4 1} -> The max of (3 1 5) is 5.
Step 4: {3 4 3 (1 5 2) 7 3 4 1} -> The max of (1 5 2) is 5.
Step 5: {3 4 3 1 (5 2 7) 3 4 1} -> The max of (5 2 7) is 7.
Step 6: {3 4 3 1 5 (2 7 3) 4 1} -> The max of (2 7 3) is 7.
Step 7: {3 4 3 1 5 2 (7 3 4) 1} -> The max of (7 3 4) is 7.
Step 8: {3 4 3 1 5 2 7 (3 4 1)} -> The max of (3 4 1) is 4.
Done.

The output is therefore {4 4 5 5 7 7 7 4}.

probability or statistics – Stochastic noise with known propability density function

How can I generate samples for random function, which functional “probability density” is known? When I say “probability density” for random function $xi(mathbf{q},t)$, I mean, that there is such a functional $P(xi(mathbf{q},t))$, that all the mean values due to the $xi$ realizations should be calculated as following path integral:

$langle A(xi)rangle_{xi}=int mathcal{D}xi A(xi) P(xi)$.

In my particular case this PDF is the following:

$
P(xi(mathbf{q},t))= exp left(- int d z int frac{d^{2} q}{(2 pi)^{2}} q^{11/3}|{xi}|^{2}right)
$

So the process is gaussian in a functional sense. I understand, that it is always possible to make your model discrete and manually simulate the process, but is there some built-in instrument in Mathematica, which allows you to deal with such things?

Thank everyone in advance!

probability or statistics – Unknown statistical function – groupings of permutations

Please bear with the vagueness of this question’s title, as my question itself has to deal with the fact that I don’t know what to call the operation I’m looking for. I have a statistical operation involving groupings and permutations, and I’m not sure how to refer to the name of the operation I want. I have some déjà vu from statistics classes that what I’m trying to do is a “named mathematical operation”, but I’m too rusty to remember what it might be called.

I suspect that there is a named function in Mathematica that could get me my desired results in a one-liner, but I don’t even know what to search for in the manual. Unfortunately, I’m going to have to simply post an example of the desired results and some code, and hope somebody recognizes it.

The desired results are:

Given any integer n, first form the set of all integers in the range {0,n}. Call this the input to the unknown function. Then, form the set of permutations (order matters) of this set, and all possible groupings within that set of permutations. For example, for n=3, the desired results are:

{
 {{{1, 2, 3}}, {{1, 2}, {3}}, {{1}, {2, 3}}, {{1}, {2}, {3}}},
 {{{1, 3, 2}}, {{1, 3}, {2}}, {{1}, {3, 2}}, {{1}, {3}, {2}}},
 {{{2, 1, 3}}, {{2, 1}, {3}}, {{2}, {1, 3}}, {{2}, {1}, {3}}},
 {{{2, 3, 1}}, {{2, 3}, {1}}, {{2}, {3, 1}}, {{2}, {3}, {1}}},
 {{{3, 1, 2}}, {{3, 1}, {2}}, {{3}, {1, 2}}, {{3}, {1}, {2}}},
 {{{3, 2, 1}}, {{3, 2}, {1}}, {{3}, {2, 1}}, {{3}, {2}, {1}}}
}

I have been able to achieve this with the following ad-hoc function:

PermutationGroupings = Function(range,
   With(
    {
     groupings = 
      Thread@Unevaluated@Splice@*Permutations@# &@
       IntegerPartitions@range,
     perms = Permutations@Range@range
     },
    Outer(TakeList, perms, groupings, 1)
    ));

PermutationGroupings@3

The above function relies upon what I know about list manipulation, rather than core mathematical functions. It’s at least better than some earlier attempts that used Groupings and Permutations together to come up with a massive list which needed to be manipulated and pruned-down to remove duplicates. I’ve been binging on the reference manual, and came up with a bunch of other inefficient and half-baked alternatives involving everything from FrobeniusSolve(...) to Subsequences(...) of DeBruijnSequence(...)s, and all I’ve concluded is that I have a lot to learn. But I could’ve sworn that there is a term for what I’m trying to do, and probably a dedicated Mathematica function to do it.

Any leads? Thanks!

Again, the output looks like:

Grid(#, Frame -> All) &@
     Map(Column(#, Alignment -> Center, 
        Spacings -> {0, 0}) &, #, {2}) &@
   Map(Grid(List@#, Frame -> All) &, #, {3}) &@
 PermutationGroupings@3

enter image description here

What are the most useful descriptive statistics when analysing performance benchmark results?

You are interested in measuring request latency. A typical latency metric is the duration between a network request being received, and the first byte of the response being returned.

Such latencies typically form a highly skewed distribution that has a long tail. Thus, measures like mean or standard deviation can be misleading.

Furthermore, some statistics are not very robust, meaning that they are sensitive to outliers. Such statistics include min/max, and to some degree the mean as well.

So let’s talk about percentiles instead. Percentiles are typically fairly robust. The 50% percentile is the median. It sits right in the middle of the dataset, in the sense that half of all measurements are above, half are below. This is a very robust metric, since outliers don’t change the ranking very much. But it’s also not a very interesting metric. Usually, most requests have acceptable latencies. It’s usually more important to have consistently good latencies, i.e. to get outliers under control. That’s why 95% or 99% percentiles are often monitored. For the 95% percentile, 95% of all requests will be faster than the statistic. Only 1 in 20 requests is slower. Of course, percentiles further away from 50% become less robust. It is important to consider the size of the data set. E.g. the 99% percentile in a data set of 100 measurements is effectively meaningless, but might become useful when you have 500 measurements or more.

Percentiles are great for visually inspecting “tail latencies”, but are mathematically inconvenient. When you do an experiment to reduce latencies, you may or may not see a clear difference. Statistical tests allow you to quantify the change. If your data set is large enough, you can use tests for normally distributed data such as a t-test. The Mann-Whitney U-test is very flexible and can be used for datasets with arbitrary distributions, but is less sensitive. Use statistical software such as R to do the calculations.

So in conclusion: gather metrics for percentiles (e.g. 50%, 90%, 95%). Do experiments to reduce such tail latencies. If the change isn’t obvious, use statistical tests to see if your changes are moving in the right direction. Note that real-world benchmarking can be very tricky. In particular, the real cause of slowdowns might only show itself under load, or might not be visible in test environments.

statistics – Showing Function Minimizes Distance

In the context of K-modes clustering, how can I show that for a set $C = (X_1, X_2, …, X_n)$ of categorical observations, the mode of C which is the set $m = (q_1, q_2,…, q_n)$ minimizes the distance function $D(C, m) = sum dist(X_i, m)$? My initial idea was to expand D(C, m) like $|X_1 – q_1| + |X_2 – q_2| + … $ and then set it equal to 0, take partial derivatives and solve, but I’m not sure that works in this case. Any help is appreciated!