performance tuning – Parallelizing large sum to use all cpu cores


I’m looking to compute the the following expectation over many samples $x$

$$E((I-alpha xx’)text{cov}(I-alpha xx’))$$

Simple example is below. samples controls number of $x$‘s sampled, and the issue is that when samples is large compared to d, Mathematica doesn’t use all cpu cores — I’m only seeing utilization on a single core. How do I make this computation better parallelized?

samples = 1000000;
alpha = 0.5;
d = 2;
ii = IdentityMatrix(d);
cov = RandomReal({-1, 1}, {d, d});
X = {{0, 3}, {2, 0}, {-1, -1}};
batchStepCov :=
  
  Mean((ii - alpha Outer(Times, #, #)).cov.(ii - 
        alpha Outer(Times, #, #)) & /@ RandomChoice(X, samples));
batchStepCov // Timing
```