## graphs – Partition a Boolean automatic parallelization circuit

tl; dr: I have a problem when I have a Boolean circuit and need to implement it with very specific single-threaded primitives, so SIMD calculation after a threshold is much cheaper. I try to optimize the implementation.

In detail, the input is a combinatorial Boolean circuit (so no loops, states, etc.). I implement it in software with a rather unusual set of primitives, so that:

• Logical gates must be calculated individually, since the "engine" is single-threaded
• NOT gates are free
• AND, OR, XOR costs 1 (eg 1 second)
• N identical gates can be evaluated simultaneously at a price of 10 plus a small proportional term (eg, a stack of 20 different AND gates in 10 seconds, 50 different XOR gates in ~ 10 seconds, etc.) can be evaluated.

The goal is to implement the circuit with the given primitives while minimizing costs.

### What I tried

This problem is vaguely related to the problem of packing trash cans, but the differences – restrictions on the order of items and different costs for each "trash can" depending on the number of items – make me think that this is not particularly true.

I've been suggested to use integer linear programming that sounds like the best fit yet, but I'm not sure how to interpret the problem. In particular, I would use binary variables to represent if the implementation Gate / Charge M will replace the circuit Tor N, but then I do not know how to express the goals to be maximized / minimized.

## Parallelization – Calling the MPI process in the terminal of the MMA notebook (RunProcess / Run) on the Mac

I have to run my MMA notebook with an MPI + C process and retrieve the output from it. In a simple case without MPI, it's easy:

``````RunProcess[{"/name_of_executable", ToString[argument]}, "Standard Output"]
``````

But my line is: `mpirun -n #threads ./name_of_executable argument`

How do I make MMA do that? Thank you in advance!

## Parallelization – The use of processors across multiple CPU sockets (not 100%) is limited to a single CPU socket

I have a CPU with 32 physical cores, 64 virtual cores with two physical sockets. It is E5-2686 v4 Intel.

There are some problems.
1. Mathematica sees only 16 physical processors checking \$ ProcessorCount.
2. When implementing intensive operation on a single kernel, only 50% of the cores become 100% maximum while the other 16 cores are idle.

Is there a way to let mathematica use all cores across two sockets?

Many thanks.

## Parallelization – DistributeDefinitions evaluates the definitions, but only for a large number of definitions

I use Mathematica 11.3 and this seems to me to be a mistake. If possible, I'd like to have an idea for a workaround.

Here is an example of trivial code that works as expected:

``````nI = 10;
(NM[#] : = Print[#] ) & /@ Offer[1, nI];
Launch kernel[];
DistributeDefinitions[NM];
``````

That is, the above code is expected not to generate output.

Well, if the first line is changed in

``````nI = 20;
``````

The same code causes 40 lines to be printed! From 1 to 20 twice.

For some reason, the DistributionDefinition execution enforces the definition of NM, and I do not want that to happen before I use ParallelSubmit and WaitAll. I tried this on two computers with Mathematica 11.3. Do you have any ideas what happens?

## Parallelization – Using ExternalEvaluate in ParallelDo

I've had great success with Python in Mathematica (for some calculations that have very well-optimized Python packages, but not for Mathematica). I have some Mathematica functions that are calls to Python functions, similar to the example in the Applications section of this page: https://reference.wolfram.com/language/ref/ExternalEvaluate.html.

I would now like to execute these functions within a ParallelDo function, unfortunately it does not work. Here is a MWE and the issue:

``````session = StartExternalSession["Python-NumPy"];
ExternalEvaluate[session, "def double(x):
return x*2"];
double python[arg_] : = ExternalEvaluate[session, "double(" <> ToString[arg] <> ")"]Do[Pause[1]; To press[doublePython[i]]{{Reach[4]}]// Absolute timing
ParallelDo[Pause[1]; To press[doublePython[i]]{{Reach[4]}]// Absolute timing
``````

## Parallelization – ParallelDo offers another solution for self-system

I'm trying to calculate the eigensystem of a large matrix (eg 256×256). I've found that if I do this in a ParallelDo (because I actually compute many of these eigensystems), the result is different than if I calculated it on the main kernel.
Example:

``````seedRandom[1234]
Matrix = RandomReal[NormalDistribution[0, 1]{256, 256}];
(* To facilitate a Hermitian matrix *)
Matrix = Matrix + ConjugateTranspose @ Matrix;
mainkernel = own system[matrix];
parallel kernel = table[0, {j, 4}];
SetSharedVariable[parallelkernel];
ParallelDo[parallelkernel[[j]]= Own system[matrix]{j, 4}];

(* Check eigenvalues ​​*)
Max @ Abs @ (main kernel[[1]]-parallelkernel[[1,1]])
(*! = 0 *)
Max @ Abs @ (parallel kernel[[1,1]]-parallelkernel[[2,1]])
(* = 0 *)

(* Make sure the largest entry in the eigenvector is positive to get the same direction *)
signofmain = map[Sign@Total@MinMax[#] &, Main core[[2]]];
signofparallel = map[Sign@Total@MinMax[#] &, parallel kernel[[1, 2]]];

(* Check eigenvectors *)
Max @ Abs @ (signofmain * mainkernel)[[2]]-signofparallel * parallelkernel[[1,2]])
(*! = 0 *)
Max @ Abs @ (parallel kernel[[1,2]]-parallelkernel[[2,2]])
(* = 0 *)
``````

Obviously, all calculations done in ParallelDo show exactly the same result, but different from the main one.

I realize that the differences here are extremely small. However, a subsequent division by the difference between two eigenvalues ​​(in my case, self-energies) can in some cases lead to an error amplification of up to 10 ^ -2, which of course is not negligible.

Where does this difference come from and how can I avoid it?