regex – Request for help to speed up batch program for 17,000 TXT files

I have over 17,000 pages that have been scanned (for a local history archive) which I have OCRed using Tesseract to individual TXT files. I want to be able to search/locate every page containing a search word of more than 3, lower case letters. So for each TXT file I need to:

Delete all rubbish from the OCR text i.e. non-alphanumeric characters - jrepl "(^a-zA-Z0-9s)" "" /x /f %%G /O -
Remove 1, 2 and 3 letter words - jrepl "bw{1,3}b" "" /x /f %%G /O -
Change all characters to lower case - jrepl "(w)" "$1.toLowerCase()" /i /j /x /f %%G /O -
To be able to sort the remaining words they need to be on separate new lines - jrepl "s" "n" /x /f %%G /O -
Finally sort all unique words into alphabetic order and create the modified TXT file - sort /UNIQUE %%G /O %%G

I have a batch file that does the above using JREPL but it is very slow. It has been running for over 100 HOURS and I’m not even half way. Any suggestions so as to speed up the processing? I am running Windows 10. Thanks.
This is the Batch file I am running:-

Setlocal EnableDelayedExpansion
for %%G in (*.txt) do (
set old=%%G
echo !old!
@echo on

rem remove non-alphanumeric
call jrepl “(^a-zA-Z0-9s)” “” /x /f %%G /O –

rem remove 1, 2 and 3 letter words
call jrepl “bw{1,3}b” “” /x /f %%G /O –

rem all to lowercase
call jrepl “(w)” “$1.toLowerCase()” /i /j /x /f %%G /O –

rem replace spaces with new lines
call jrepl “s” “n” /x /f %%G /O –

rem reduce to unique words
sort /UNIQUE %%G /O %%G

)
pause

7 – Batch delete users

When I received this Drupal 7 site to manage, it has had years of being open for people to make accounts and now I have ~ 300 PAGES of users with names like ismehacker1234568 and such, with under a dozen legitimate accounts I need to keep.

Once, on another site long ago I tried removing the users from the database table using a SQL query. That was a disaster and broke the entire site. I want to avoid that this time around.

I also need to delete any content they made, but not anything the legitimate users made.

python – Processing Batch Job

Validation

This function _is_player_id_list_valid is stuck between two useful concepts – validating and returning bool, and throwing-or-not. Don’t attempt to do a half-measure of both. Given its current name, it would be less surprising to do

for player_id in provided_player_id_list:
    if player_id not in all_player_id_list:
        return False
return True

If you want to keep the exception, then

  • Delete the return
  • Change the return type to None
  • Use a more specific type than Exception
  • Rename the method to something like check_player_id_list.

For this method you should also get rid of the loop, cast the lists to sets, use set intersection, and then base your error message off of all of the missing elements instead of just the first.

Late serialization

Validation should be done on tranche_date. The sanest way to do this is expect a date of a specific format (which you probably already do, though you haven’t shown it); parse it into a real datetime (or perhaps date), and then re-serialize it in _get_all_player_id. Its representation should only be str at the extreme edges of your program – in your argument parsing, and in your S3 call. In the middle it should be a real date type.

Error messages

except BatchTimeoutException:
    print(f"Jobs are known to fail due data missingness.")

is… a little strange. You could print this and it would still be valid even if there were no timeout. Instead perhaps consider

    print('Batch timed out. Data may be missing.')

Note that this does not need to be an f-string. Also, what data?

How can I get a uniform white balance on a batch of JPEG images?

If you find that hitting the “auto” button in the GIMP levels dialog generally does the thing you’re looking for, you can batch that as described here.

Specifically, you would put this script:

(define (batch-auto-levels pattern)
(let* ((filelist (cadr (file-glob pattern 1))))
  (while (not (null? filelist))
         (let* ((filename (car filelist))
                (image (car (gimp-file-load RUN-NONINTERACTIVE
                                            filename filename)))
                (drawable (car (gimp-image-get-active-layer image))))
           (gimp-levels-stretch drawable)
           (gimp-file-save RUN-NONINTERACTIVE
                           image drawable filename filename)
           (gimp-image-delete image))
         (set! filelist (cdr filelist)))))

into the GIMP scripts directory (~/.gimp-x.x/scripts/ or %appdata%GIMPx.xscripts on Windows) named ‘batch-auto-levels.scm’ and then run

gimp -ifd -b '(batch-auto-levels "*.jpg")' -b '(gimp-quit 0)'

within the directory containing the images. Note that this will overwrite the images – copy them to a test directory and work on that until you know the results are what you want. Also make sure that your metadata is intact (GIMP is pretty good about this these days). You can set the JPEG quality you want to use as the default in the GIMP JPEG export dialog, then quit GIMP before running the batch script.

batch file – Wget or similar for WinCE 6.0 ARM

We have a bunch of handheld computers running WinCE 6.0 on ARM CPU’s.

  • Device = Datalogic Skorpio
  • Input = Stylus + On-Screen keyboard + Numeric pad

They need to be reset from time to time due to errors, and this requires the end-user to ship the device to us at IT, as a reset also wipes the configuration and files.

Loading files onto them are getting harder and harder for us, as Microsoft is making it more and more difficult to use “Windows Mobile Device Center” on Windows 10.

As they are Wi-Fi connected, they can download files from a web-server using IE. But this takes a long time when there are many files to download. Pluss I would like it to be DIY for the end-user, with some easy guidance over the phone.

I have searched the web for FTP clients, wget, unZIP, and/or other software, but came up empty.

I need a simple way to load files to these devices, preferably one I can implement into a batch script.

End a process started with START command in a Windows Batch File

I have the following batch file, which uses ADB to monitor device logs and searches for a string:

@ECHO OFF
ECHO Starting log monitor...
START /B adb.exe logcat > log

:LOOP
(TYPE log | FIND "string to find") > NUL
IF "%errorlevel%" == "1" GOTO LOOP

:END
ECHO String found!

The script starts the logcat command, which runs asynchronously and in the background, using START /B.

After the string is found, I would like to end the asynchronous logcat command, as it is no longer needed.

Is there any way of the main script telling the asynchronous script to end?


I know that I could technically use adb.exe kill-server or taskkill /F /IM adb.exe to end all ADB processes, but I need to only end the logcat command and continue running all other instances of ADB.

batch – Timeout detected. (data connection) command line winscp

I am downloading files from FTP server using winscp script it was fine till yesterday but today it keeps disconnecting. I do not know what changed today?. However, i am able to connect with GUI like filezilla.

script:

"C:Program Files (x86)WinSCPWinSCP.com" /rawconfig InterfaceSessionReopenAutoStall=90000  /log="D:oldwinscp_logwinscp_%yymmdd_hhmmss%.log"  /command ^
    "option batch abort" ^
    "open ftps://***t@***ws.com:%Password%@***.sharefileftp.com/DBS/DB_BACKUP -hostkey="*****"  -passive=on" ^
    "get * E:DBSDB_BACKUP"  ^
    "exit"
< 2020-12-28 10:35:15.080 230-Connection established from static-*****.
< 2020-12-28 10:35:15.158 230-You are connected as **** (invoices@****.com).
< 2020-12-28 10:35:15.158 230 Welcome to the iTech FTP site.
> 2020-12-28 10:35:15.158 SYST
< 2020-12-28 10:35:15.752 215 UNIX Type: L8
> 2020-12-28 10:35:15.752 FEAT
< 2020-12-28 10:35:15.799 211-Extensions supported:
< 2020-12-28 10:35:15.877  EPSV
< 2020-12-28 10:35:15.877  MDTM
< 2020-12-28 10:35:15.877  PASV
< 2020-12-28 10:35:15.877  REST STREAM
< 2020-12-28 10:35:15.877  SIZE
< 2020-12-28 10:35:15.877  UTF8
< 2020-12-28 10:35:15.877  PBSZ
< 2020-12-28 10:35:15.877  PROT
< 2020-12-28 10:35:15.877  X-NOVELLABS
< 2020-12-28 10:35:15.877  X-CITRIX
< 2020-12-28 10:35:15.877 211 End.
> 2020-12-28 10:35:15.877 OPTS UTF8 ON
< 2020-12-28 10:35:15.924 200 OK.
> 2020-12-28 10:35:15.924 PBSZ 0
< 2020-12-28 10:35:15.971 200 OK.
> 2020-12-28 10:35:15.971 PROT P
< 2020-12-28 10:35:16.018 200 Data connections set to secure (SSL) mode
< 2020-12-28 10:35:16.018 Script: Connected
. 2020-12-28 10:35:16.018 Connected
. 2020-12-28 10:35:16.018 Doing startup conversation with host.
< 2020-12-28 10:35:16.018 Script: Starting the session...
> 2020-12-28 10:35:16.018 PWD
< 2020-12-28 10:35:16.064 257 "/"
. 2020-12-28 10:35:16.064 Changing directory to "/".
> 2020-12-28 10:35:16.064 CWD /
< 2020-12-28 10:35:16.830 250 "/" is the current directory.
. 2020-12-28 10:35:16.830 Getting current directory name.
> 2020-12-28 10:35:16.830 PWD
< 2020-12-28 10:35:16.877 257 "/"
. 2020-12-28 10:35:16.877 Startup conversation with host finished.
< 2020-12-28 10:35:16.877 Script: Session started.
. 2020-12-28 10:35:16.877 Retrieving directory listing...
> 2020-12-28 10:35:16.877 CWD /DBS/DB_BACKUP
< 2020-12-28 10:35:17.705 250 "/DBS/DB_BACKUP" is the current directory.
> 2020-12-28 10:35:17.705 PWD
< 2020-12-28 10:35:17.736 257 "/DBS/DB_BACKUP"
> 2020-12-28 10:35:17.736 TYPE A
< 2020-12-28 10:35:17.783 200 ASCII mode selected. (Note: This server treats ASCII mode identically to Binary mode.)
> 2020-12-28 10:35:17.783 PASV
< 2020-12-28 10:35:17.830 227 Entering Passive Mode (***)
> 2020-12-28 10:35:17.830 LIST
. 2020-12-28 10:35:17.830 Connecting to **** ...
. 2020-12-28 10:35:32.236 Timeout detected. (data connection)
. 2020-12-28 10:35:32.236 Could not retrieve directory listing
. 2020-12-28 10:35:32.236 Connection was lost, asking what to do.
. 2020-12-28 10:35:32.236 Asking user:
. 2020-12-28 10:35:32.236 Lost connection. ("Timeout detected. (data connection)","Could not retrieve directory listing")
< 2020-12-28 10:35:32.236 Script: Lost connection.
< 2020-12-28 10:35:32.236 Script: Timeout detected. (data connection)

< 2020-12-28 10:35:32.236 Could not retrieve directory listing