Here is a perfect example, but far from the only case. I want to remove all “emojis” from a string. Because they are not confined to a neat single range, or even many well-defined ranges, this is impossible to do with a simple regular expression.
There are many false solutions online which simply do not work; even if they worked when they were made, new ranges and individual emoji code points are added all the time to the Unicode standard.
For this reason, I (apparently) have to figure out how to automate all this:
- Load http://www.unicode.org/Public/emoji/
- Parse that HTML output to get the latest version, currently “13.1”. It’s not obvious how to do that, and they could change how that directory listing is presented any day, breaking my mechanism.
- Load the right file, currently: https://www.unicode.org/Public/emoji/13.1/emoji-sequences.txt
- Somehow parse out all code points and ranges, which vary in format such as “2B1B..2B1C”, “27B0”, “1F1FE 1F1EA”, etc.
- Somehow construct a new, massive regular expression containing all of these ranges and code points.
- Incorporate this regular expression into my actual code as a function.
- Now I am able to remove emojis by calling my function.
Note that they don’t have a convenient, easily parseable file with a permanent URL such as:
If they at least had that, this would be much simpler, nicer and more logical. In fact, one might reasonably expect them to actually provide a regular expression on their own for common regexp engines.
In this, and many other situations, I get the feeling that the world is actively working against me. That I’m somehow “doing something I’m not supposed to do”. That my mentality is fundamentally incompatible with others’.
Can you think of some reasonable explanation as to why they make it so hard and seemingly go out of their way to make it cumbersome to automate? Why am I not allowed to easily block/remove emojis from a text if I don’t want to see them? Isn’t that a very reasonable and common desire? Emojis work in certain contexts, but in others, they are a pure annoyance and make it visually unpleasant to read a text.
Please note that, again, all existing “emoji removal” solutions are broken. They don’t block all of them. Or they block way too many, including legitimate Unicode, non-emoji characters. I’m primarily asking about why they make it so hard, rather than asking for a solution. I already know that I will have to make my own mechanism, and I’m going to do it, and I can do it, but it will take me a lot of time and energy for seemingly no good reason.
To give another example, I often want to figure out the latest version of some open source software. Frequently, they just have no simple plaintext/JSON URL to do this; I have to parse their HTML webpage made for humans. Even if we disregard the initial work that has to be invested by me and every other individual who need to automate these kinds of things, a major secondary issue is that they can break at any time in the future, so we need to keep updating our code as they change their “not really made for computers” website layouts.
I fundamentally don’t understand this, and I hope that there is some reasonable explanation.