I want to use 40 physical cores in my 2 * Xeon gold 6230 system with 64GB (2 * 32GB) installed memory. Operating system is Ubuntu 18.04.
The task is to solve as many eigenvalues of random matrices. For small matrices, in my test at 300×300, the performance has not suffered any significant loss due to the increase of independent instances of selfsolver workers. For example, ten workers who solve 1000 matrices (a total of 1E + 4 matrices) need twenty workers, each solving 1000 matrices (2E + 4 total), and forty workers, each solving 1000 matrices (4E + 4 matrices in total), about the same real-time finish.
However, when the matrix is large (2000 x 2000), MKL performance drops significantly as more employees are employed. MKL_NUM_THREADS = 1 in all tests.
- 1 worker, 10 matrices each: 1m15s to finish (CPU 100%)
- 10 workers, 10 matrices each: 2m23s to finish (CPU 1000%)
- 20 workers, 10 matrices each: 5m34s to finish (CPU 2000%)
Twenty workers achieve more than twice the performance of ten workers.
The tests are performed in Mathematica 10, Matlab 2019b, Python 3.7 and eigen3 (Link to Intel Mkl). The memory usage is below 12%.
Test code is simple, for example, the Mathematica code is:
Any idea to improve mkl's performance or determine hardware bottleneck is welcome.