Doing a re_sub in C using pcre

I’m trying to implement a re_sub and it’s been quite tricky for me to do. I wanted to take the python code to replace a string — re.sub(r";.+", " ", s) — and do it in C. Examples of grabbing a single regex match were pretty easy to find (and do) but doing multiple matches and then substituting proved to be quite tricky (for me at least).

Here is what I have so far:

#include <pcre.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <stdarg.h>

#define OFFSETS_SIZE 90
#define BUFFER_SIZE (1024*10)
char buffer(BUFFER_SIZE);

void exit_with_error(const char* msg, ...)
{
    va_list args;
    va_start(args, msg);
    vfprintf(stderr,  msg, args);
    exit(EXIT_FAILURE);
}

void re_sub(char buffer(), const char* pattern, const char* replacement)
{
    const char *error;
    int error_offset;
    int offsets(OFFSETS_SIZE);
    size_t replacement_len = strlen(replacement);

    // 1. Compile the pattern
    pcre *re_compiled = pcre_compile(pattern, 0, &error, &error_offset, NULL);
    if (re_compiled == NULL)
        exit_with_error(strerror(errno));

    // 2. Execute the regex pattern against the string (repeatedly...)
    int rc;
    size_t buffer_size = strlen(buffer);
    char tmp(buffer_size * 2);
    int tmp_offset=0, buffer_offset=0;

    while ((rc = pcre_exec(re_compiled, NULL, buffer, strlen(buffer), buffer_offset, 0, offsets, OFFSETS_SIZE)) > 0) {
        int start=offsets(0);
        int end=offsets(1);
        int len =  start - buffer_offset;

        // 1. copy up until the first match if the first match starts after
        //    where our buffer is
        printf("Start: %d | End: %d | Len: %d | BufferOffset: %dn", start, end, len, buffer_offset);
        if (start > buffer_offset) {
            strncpy(&tmp(tmp_offset), buffer+buffer_offset, len);
            tmp_offset += len;
        }

        // 2. copy over the replacement text instead of the matched content
        strncpy(&tmp(tmp_offset), replacement, replacement_len);
        tmp_offset += replacement_len;
        buffer_offset = end;
    };

    // exit early if there was an error
    if (rc < 0 && rc != PCRE_ERROR_NOMATCH) {
        free(re_compiled);
        exit_with_error("Matching error code: %dn", rc);
    }
    // now copy over the end if leftovers
    if (buffer_offset < buffer_size-1)
        strcpy(&tmp(tmp_offset), buffer+buffer_offset);

    free(re_compiled);
    strcpy(buffer, tmp);
}


int main(int argc, char* argv())
{
    strcpy(buffer, "Hello ;some comment");
    re_sub(buffer, ";.+", " ");
    printf("Buffer: %s", buffer);
    return 0;
}

And a working example: https://onlinegdb.com/SJtW_8Tr_ (note you have to go in the upper-right hand corner, click settings gear and add -lpcre into the Extra Compiler Flags (they don’t persist on the share link).

A few comments about the code:

  • I find it very tedious and quite hard to keep track of all the offsets, for example: &tmp(tmp_offset). Is there a better way to do this? I tried adding a ptr of char *tmp_ptr = tmp but for whatever reason I couldn’t get that way to work properly.
  • I think I need to have the user pass in the size_t buffer_len otherwise it’s very easy to get a buffer overflow, as the only thing I know about the buffer is the strlen. In this example here I have 10K in the buffer, so it’s not going to overflow, but in other examples it’s very possible that it could.
  • Is there something like a ‘package manager’ (like pip or npm) for C? For example, if I’m to move this code onto another server, how do I ‘build it with pcre, which may not even be on the server? Is this what cmake is for, or what’s a common way to ‘manage’ non stdlib packages?