I have to write a query which must rely on regular expressions and I am not allowed to use CLR or use anything outside the SQL Server environment. So I must use a R or Python external script. I have experience with Python. So I went with that.
I wrote the script it works but the input data is extremely large so the external script execution takes a long time. I already heavily optimized the script, but it is still slow. So I plan to parallelize it.
Python’s multiprocessing module doesn’t work inside the external script. So I got a strange idea:
What if I horizontally partition the temp table which holds the input data and execute the external script on the partitions, then I just get back their union?
Or perhaps there is a better way to parallel execution of an exterenal script?
I work on SQL Server 2019 Development version we also have an enterprise version which I guess doesn’t matter here.
@J.D rightly asked for more details. I am only allowed to share a limited amount:
The temp table providing the input has different columns for the actual-, birth- and mother names for different people.
I also have a dataframe in the external script containing valid first names and associated genders parsed from a JSON which is passed as a nvarchar(max). This method described here.
My script extracts titles, surnames, given names and guess biological sex and martial status. The names are not English names.
I apply a function to the first axis of the input Pandas Dataframe which does the extraction.
I use only Pandas provided methods.
With a standard python environment I would use the np.array_split function to split the dataframe and use the multiprocessing module to apply extraction function on the splits in parallel processes, then concatenate the dataframes. As you can see this is a text processing problem and cannot be vectored as mathematical can be. So tools like Dask cannot be used here.
Using CLR would be a sane choice. So a college of mine and I also recommended to use CLR. Our expert said using CLR is very dangerous. Yes I know about the CLR strict security option, but our expert does not. So I am forced to use external script against my better judgement.