How do I read an email log that captures bodies in different combinations of encodings plus underlying format


I have a log of spam emails collected from various sources, in JSON. The goal is to move it all into plaintext to become a training corpus for a machine learning exercise. Mail subject, etc, is plaintext. Bodies, however, are encoded…

As an example,

“email_raw_body”: “–000_6fbc20cd885642f3aec9ff4c5f2facafUW01ME1360001bronzeuscenContent-Type: text/plain; charset=”utf-8″nContent-Transfer-Encoding: base64nnDQpfX19fX19fX19fX19fX19fX…”,

Figuring these out will be time consuming… There appear to be multiple encoding schemes, possibly related to different email clients and services? Is there a good package/library/api out there that I can use that will reliably decode these to plaintext/xml/HTML? I’m primarily in python, but will use whatever necessary.

Also, I’m new to this. Other considerations?