On Windows, PowerShell misinterprets non-ASCII characters in mosquitto_sub output

Problem:

While the mosquitto_sub man page makes no mention of character encoding as of this writing, it seems that on Windows mosquitto_sub exhibits nonstandard behavior in that it uses the system’s active ANSI code page to encode its string output rather than the OEM code page that console applications are expected to use.(1)

There also appears to be no option that would allow you to specify what encoding to use.

PowerShell decodes output from external applications into .NET strings, based on the encoding stored in (Console)::OutputEncoding, which defaults to the OEM code page. Therefore, when it sees the ANSI byte representation of character é, 0xe9, in the output, it interprets it as the OEM representation, where it represents character Θ (the assumption is that the active ANSI code page is Windows-1252, and the active OEM code page IBM437, as is the case in US-English systems, for instance).

You can verify this as follows:

# 0xe9 is "é" in the (Windows-1252) ANSI code page, and coincides with *Unicode* code point
# U+00E9; in the (IBM437) OEM code page, 0xe9 represents "Θ".
PS> $oemEnc = (System.Text.Encoding)::GetEncoding((int) (Get-ItemPropertyValue HKLM:SYSTEMCurrentControlSetControlNlsCodePage OEMCP)); 
    $oemEnc.GetString((byte()) 0xe9)

Θ

Note that the decoding to .NET strings (System.String) that invariably happens means that the characters are stored as UTF-16 code units in memory, essentially as (uint16) values underlying the System.Char instances that make up a .NET string. Such a code unit encodes a Unicode character either in full, or – for characters outside the so-called BMP (Basic Multilingual Plane) – half of a Unicode character, as part of a so-called surrogate pair.

In the case at hand this means that the Θ character is stored as a different code point, namely a Unicode code point: Θ (Greek capital letter theta, U+0398).


Solution:

PowerShell must be instructed what character encoding to use in this case, which can be done as follows:

PS> $msg = & { 
      # Save the original console output encoding...
      $prevEnc = (Console)::OutputEncoding
      # ... and (temporarily) set it to the active ANSI code page.
      # Note: In *Windows PowerShell* - only - (System.TextEncoding)::Default work as the RHS too.
      (Console)::OutputEncoding = (System.Text.Encoding)::GetEncoding((int) (Get-ItemPropertyValue HKLM:SYSTEMCurrentControlSetControlNlsCodePage ACP))

      # Now PowerShell will decode mosquitto_sub's output  correctly.
      mosquitto_sub -h test.mosquitto.org -t tofol/test

      # Restore the original encoding.
      (Console)::OutputEncoding = $prevEnc
    }; $msg

{ "label": "eé" }  # OK

Note: The Get-ItemPropertyValue cmdlet requires PowerShell version 5 or higher; in earlier version, either use (Console)::OutputEncoding = (System.TextEncoding)::Default or, if the code must also run in PowerShell (Core), (Console)::OutputEncoding = (System.Text.Encoding)::GetEncoding((int) (Get-ItemProperty HKLM:SYSTEMCurrentControlSetControlNlsCodePage ACP).ACP)


Helper function Invoke-WithEncoding can encapsulate this process for you. You can install it directly from a Gist as follows (I can assure you that doing so is safe, but you should always check):

# Download and define advanced function Invoke-WithEncoding in the current session.
irm https://gist.github.com/mklement0/ef57aea441ea8bd43387a7d7edfc6c19/raw/Invoke-WithEncoding.ps1 | iex

The workaround then simplifies to:

PS> Invoke-WithEncoding { mosquitto_sub -h test.mosquitto.org -t tofol/test } -Encoding Ansi -WindowsOnly

{ "label": "eé" }  # OK

A similar function focused on diagnostic output is Debug-NativeInOutput, discussed in this answer.


As an aside:

While PowerShell isn’t the problem here, it too can exhibit problematic character-encoding behavior.

GitHub issue #7233 proposes making PowerShell (Core) windows default to UTF-8 to minimize encoding problems with most modern command-line programs (it wouldn’t help with mosquitto_sub, however), and this comment fleshes out the proposal.


(1) Note that Python too exhibits this nonstandard behavior, but it offers UTF-8 encoding as an opt-in, either by setting environment variable PYTHONUTF8 to 1, or via the v3.7+ CLI option -X utf8 (must be specified case-exactly!).