statistics – Benchmarking, why discard lowest time?

The lowest timing might indeed represent the “true” timing without outside interference, or it might be a measurement error. E.g. the boosting behaviour of a CPU might speed up the first run in a larger benchmarking suite, or a less congested network might speed up a net-dependent benchmark during certain times of day. Similarly, the highest timing might represent the true worst case, or non-representative interference. E.g. a background service might have started during the benchmark, or a SMR hard drive cache is being flushed during an IO-based benchmark.

Statistics like the mean (average) of some values is very sensitive to outliers. It is thus common to use a trimmed mean where we remove outliers, in the hopes of getting closer to the “true” mean. Various methods for determining outliers exist, with the simplest approach being to remove the top/bottom p%, for some value p.

However, it is not generally necessary to calculate the mean run time when doing benchmarking. For comparing the typical behaviour, we can use measures like the median or other quantiles. Especially when measuring latencies, quantiles like the 95%-percentile are often used (meaning: 95% of measurements were this fast or faster).

It is also unnecessary to calculate the mean when trying to determine whether one program is significantly faster than another. Instead of parametric statisical tests that require such parameters to be estimated from the sample, it is possible to use non-parametric tests. E.g. the Mann–Whitney Rank Sum Test only considers the order of values from two samples, not their actual values. While this is more flexible and more rigorous, this does lose some statistical power (you might not be able to detect a significant difference even if it exists).