numerical integration – Parallelizing or optimizing use of Outer and Integrate

I’m trying to optimize or parallelize the following code as it’s slow for larger n

{ind = indexToRep /@ Range(n^2), limits = Table(1-k/(a-1/2),{k,0,a-1}), overlap, matrixoflists, listofmatrices},
    overlap = Function({r1,r2},integralsForListofLimits(r1, r2, limits));
    matrixoflists = Outer(overlap,ind,ind,1);
    listofmatrices = Transpose(matrixoflists,{3,2,1});
    MapThread(convertToProjection(#) & , {listofmatrices}))

It requires the following helper functions:

indexToRep(n_) := indexToRep(n) = Module({j,m},j=Floor(Sqrt(n-1));


(* Outputs a list of length Length(limits) where each non-zero element is the result of a spherical harmonic integral with an upper limit given by the element of list limits *)
integralsForListofLimits(r1_?ListQ,r2_?ListQ,limits_) := integralsForListofLimits(r1, r2, limits) = integralsForListofLimits(r2, r1, limits) = Block(
{(Theta), (Phi), j1 = First(r1), m1 = Last(r1), j2 = First(r2), m2 = Last(r2)},
Return(If(Last(r1)!=Last(r2), ConstantArray(0.0, Length(limits)),
Subtract @@ (2*Pi*sphNormCoeffs(j2,m2)*sphNormCoeffs(j1,m1)*Integrate(LegendreP(j1,m1,x)LegendreP(j2,m2,x),x) /. x -> {1, limits})))

convertToProjection(mat_) := Block({u, w, v},
{u, w, v} = SingularValueDecomposition(mat);
u . DiagonalMatrix(If(Re(#)>0.5,1.,0.) & /@ Diagonal(w)) . ConjugateTranspose(v)

I tried replacing Outer with Parallelize @ Outer and similarly MapThread with Parallelize @ MapThread, but it was actually much slower on 8 parallel cores than just on a single one.

projOuterAllAngles(n,a) outputs a list of dimensions {a, n^2,n^2} and it basically computes a list of a matrices of size n^2 by n^2. These matrices are block-diagonal with non-zero entries being the result of the integrals from integralsForListofLimits.

Any help on how I can make this code faster would be much appreciated.