I have a log of spam emails collected from various sources, in JSON. The goal is to move it all into plaintext to become a training corpus for a machine learning exercise. Mail subject, etc, is plaintext. Bodies, however, are encoded…
As an example,
“email_raw_body”: “–000_6fbc20cd885642f3aec9ff4c5f2facafUW01ME1360001bronzeuscenContent-Type: text/plain; charset=”utf-8″nContent-Transfer-Encoding: base64nnDQpfX19fX19fX19fX19fX19fX…”,
Figuring these out will be time consuming… There appear to be multiple encoding schemes, possibly related to different email clients and services? Is there a good package/library/api out there that I can use that will reliably decode these to plaintext/xml/HTML? I’m primarily in python, but will use whatever necessary.
Also, I’m new to this. Other considerations?