Python parser using Regex for a course catalog

I’ve been trying to develop a type of parser for old course catalogs and have an idea of what I want to do but cannot figure it out. Basically, the premise is that I want to parse and find the course abbreviations, so Computer Science would be abbreviated as “(CSC).” Next, I would need to find the course numbers, course title, and course units. My regex pattern for these are simple:

course_abbrev = re.compile('((A-Z){3})')
course_num = re.compile('(0-9){3},?')
course_title = re.compile('.+?(?=I )')
course_units = re.compile('d')

The format of the catalogs all differ slightly, but they are relatively as followed:

"""
Computer Science (CSC)  
Chairman: ...
201 Introduction to Computing I, 3
(Information of the course)...

220 Another Comp Class I, 3
(Information)... 
...  
...
...

Dental Hygiene (DHY)
Chairman: ...
101...
"""

The text of the catalog is somewhat jumbled because it is being read via PyPDF2 since the catalogs are in PDF format but as such I am reading a page at a time of information. What would an efficient method be to go about finding the abbreviations, finding the number after that abbreviation, to then find the title after that number, and then the course unit. The re module has ways to list all of these patterns (re.findall()) or search for one of them (re.search()) but I am unsure how to go about finding one, storing it, and then finding a different regex pattern from there, storing it, etc.

javascript – Regex to match image id from url

I have a URL as follow:

https://res.cloudinary.com/frivillighet-norge/image/upload/v1501681528/5648f10ae4b09f27e34dd22a.jpg

and I want to match only the id of the picture at the end of the string without including .jpg. So far, I have written something like that: ^[A-Za-z0-9]{24}$ which matches a string of numbers and letters with a length of 24, since my id in the string has always length 24, but this does not work since it matches only strings of length 24.

Any help would be appreciated.

regex – Spark Scala: SQL rlike vs Custom UDF

I’ve a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using “spark sql rlike” method as below and it was able to hold the load until incoming record counts were less than 50K

PS: The regular expression reference data is a broadcasted dataset.

dataset.join(regexDataset.value, expr("input_column rlike regular_exp_column")

Then I wrote a custom UDF to transform them using Scala native regex search as below,

  1. Below val collects the reference data as Array of tuples.
val regexPreCalcArray: Array((Int, Regex)) = {
        regexDataset.value
            .select( "col_1", "regex_column")
            .collect
            .map(row => (row.get(0).asInstanceOf(Int),row.get(1).toString.r))
    }

Implementation of Regex matching UDF,

    def findMatchingPatterns(regexDSArray: Array((Int,Regex))): UserDefinedFunction = {
        udf((input_column: String) => {
            for {
                text <- Option(input_column)
                matches = regexDSArray.filter(regexDSValue => if (regexDSValue._2.findFirstIn(text).isEmpty) false else true)
                if matches.nonEmpty
            } yield matches.map(x => x._1).min
        }, IntegerType)
    }

Joins are done as below, where a unique ID from reference data will be returned from UDF in case of multiple regex matches and joined against reference data using unique ID to retrieve other columns needed for result,

dataset.withColumn("min_unique_id", findMatchingPatterns(regexPreCalcArray)($"input_column"))
.join(regexDataset.value, $"min_unique_id" === $"unique_id" , "left")

But this too gets very slow with skew in execution (1 executor task runs for a very long time) when record count spikes to 1M+, spark suggests not to use UDF as it would degrade the performance, any other best practises I should apply here? or any suggestions to do this efficiently would be very helpful.

regex – MS Word terminate wildcard in advanced search

Is there a way to termiante the wilcard in the search, when I want to search for example for two words?

Let’s say I have the following words

awesome mouse
awful albatros
awesome albatros
awful mouse

Now I’m trying to find specifically the combination of

aw* alba*

so my expected matches would be awful albatros and awesome albatros as two separate results

(obviously in the example ? would be more sufficient for wildcard, but I’m trying to use the wildcard for a search in foreign language where the endings of the word can have different length depending on the noun). So please, stick with the * wildcard

I tried all sort of things, including begin of word search and nested expressions, but I keep getting false matches, because the wildcard character * keeps interpreting everything as a match afterwards and does not terminate the word.

This is the closest I could get to making sense of it –

<(aw)*{1,} (alba)*

Issue is, the first * accepts everything after it and eats it up, when I in fact want to terminate it after a first space and then begin searching for second word (alba*)

enter image description here

How should I go about doing this?

bash – Can the regex matching pattern for awk be placed above the opening brace of the action line, or must it be on the same line?

I’m studying awk pretty fiercely to write a git diffn implementation which will show line numbers for git diff, and I want confirmation on whether or not this Wikipedia page on awk is wrong:

(pattern)
{
   print 3+2
   print foobar(3)
   print foobar(variable)
   print sin(3-2)
}

Output may be sent to a file:

(pattern)
{
   print "expression" > "file name"
}

or through a pipe:

(pattern)
{
   print "expression" | "command"
}

Notice (pattern) is above the opening brace. I’m pretty sure this is wrong but need to know for certain before editing the page. What I think that page should look like is this:

/regex_pattern/ {
    print 3+2
    print foobar(3)
    print foobar(variable)
    print sin(3-2)
}

Output may be sent to a file:

/regex_pattern/ {
    print "expression" > "file name"
}

or through a pipe:

/regex_pattern/ {
    print "expression" | "command"
}

Here’s a test to “prove” it. I’m on Linux Ubuntu 18.04.

test_awk.sh

gawk 
'
BEGIN
{
    print "START OF AWK PROGRAM"
}
'

Test and error output:

$ echo -e "hey1nhellonhey2" | ./awk_super_simple.sh
gawk: cmd. line:3: BEGIN blocks must have an action part

But with this:

test_awk.sh

gawk 
'
BEGIN {
    print "START OF AWK PROGRAM"
}
'

Test and output (it works fine!):

$ echo -e "hey1nhellonhey2" | ./awk_super_simple.sh
START OF AWK PROGRAM

Another example (fails to provide expected output):

test_awk.sh

gawk 
'
/hey/ 
{
    print $0
}
'

Output:

$ echo -e "hey1nhellonhey2" | ./awk_super_simple.sh
hey1
hey1
hello
hey2
hey2

(passes):

gawk 
'
/hey/ {
    print $0
}
'

Output:

$ echo -e "hey1nhellonhey2" | ./awk_super_simple.sh
hey1
hey2

regex – Set string from another file as variable in Powershell

I have this text file contains the following line

group1,name1
group2,name2
group3,name3

How do I iterate the value of those string in a powershell script to print match string? If manually assign variable I will do it something like this

get-content data.csv -ReadCount 1000 | foreach { $_ -match "group1" } | Out-File name1.txt -encoding Utf8

How I debug a mongodb slow regex query?

I have two simple queries on a collection of 22 millions documents.

query 1:

db.audits.find({"w.em": /^name.lastname/i})

return in less than 1 second.

query 2:

db.audits.find({"w.d": /^name.lastname/i})

it runs for more than 30seconds (and rightly not found results).

The only difference on the two queries is the field I am searching on. Both fields are indexed, you can find here the explain for both : it is identical!

How can the queries perform so differently??

I am on mongodb 3.4.23

regex – Why are tagged expressions not found in SQL Server Management Studio?

I understand from my reading that curly brackets denote tagged expression in a “find” in SQL Server Management Studio (SSMS) with “use regular expressions” toggled on. And that backslash n is a placeholder for that matched text in replace. Even before I get to the replace I cannot get the find to work. For instance with this text …

this that

… and this find …

{(a-z)*}

… I get “the following specified text was not found”. If I remove the curly brackets the find gets a hit on each of those two words as expected. What am I doing wrong? This is SSMS v18.5.

Regex in a log file using batches and queues in Java

This would be for one MUD Client who has elements of some kind telnet bot.

Here there are string Analyze a logged file for each line – starting with the newest. As soon as a "trigger" is called, the parser and the actions should be stopped for the sake of simplicity.

I think about one records Great for the triggers, but for now you have the trigger and lead to one map As follows.

Don't worry much about resource efficiency, as it is ultimately text files, but more about extensibility and flexibility.

package net.bounceme.dur.files;

import java.util.Iterator;
import java.util.List;
import java.util.ListIterator;
import java.util.Map;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class BotActions {

    private final static Logger log = Logger.getLogger(BotActions.class.getName());

    private Map triggers = null;
    private boolean triggered = false;

    private BotActions() {
    }

    public BotActions(Map triggers) {
        this.triggers = triggers;
    }

    private void pullTrigger(String line, Map.Entry entry) {
        log.info(line);
        log.info(entry.toString());
    }

    private void triggers(String line) {
        Pattern pattern = null;
        Matcher matcher = null;

        Iterator> triggerEntries = triggers.entrySet().iterator();

        while (triggerEntries.hasNext() && !triggered) {
            Map.Entry entry = triggerEntries.next();
            pattern = Pattern.compile(entry.getKey());
            matcher = pattern.matcher(line);
            if (matcher.matches()) {
                pullTrigger(line, entry);
                triggered = true;
            }
        }
    }

    public void everyLine(List list) {
        ListIterator listIterator = list.listIterator(list.size());
        while (listIterator.hasPrevious() && !triggered) {
            triggers(listIterator.previous().toString());
        }
    }

}

I really think the lines of text are a stack while the triggers are rather a queue, although the trigger sequence is more flexible. But just because of how the log file is initially parsed, these don't really seem to be options.

Anyway, interested in all comments.

Regex – Powershell removes double quotes if the line begins with double quotes

I need to remove double quotes in a text file only when lines start with "https".
The file content looks like this:

...
    "bla, bla, bla"
    "https://example.com"
    "bar, bar, bar"
...

I have to match "https://exaple.com", remove both double quotes, leave the double quotes on other lines so the content stays in place …

I've tried many methods, but I'm stuck because I don't know how to avoid double quotes in a regular expression or to declare a filter in the "if" or "where" statement and then replace the existing text. .

Can someone help me please?