beginner – Python script to compress and encrypt a list of files/directories (tar + gpg)

I’ve tried, as practise, to write a Python 3 script which takes a list of paths to files/directories and compresses them with tar, before encrypting the resultant tarball with gpg (for backing up to an external location). The script allows the user to choose which compression and encryption algorithms to use, and where to save the resultant file to.

I’m fairly new to Python and come from a C-style background, so I’m hoping for feedback on proper use of loops, branches, object management, and the like. If I’ve made any glaring errors with use for tar or gpg, I would appreciate them being pointed out too.

This is how my script runs:

  1. The read_arguments() function:
    1. Read and parse arguments and argument parameters.
    2. If there’s an error with arguments, state so and exit.
    3. Depending on options, read from the standard input stream and/or format paths to be in absolute form.
    4. Return an instance of Configuration with the configuration options.
  2. The compress(config) function, passed that Configuration object:
    1. Check that the specified output directory exists and try to create it if it doesn’t.
    2. Make 64 attempts at finding a unique file path for the compressed output using a random filename (we’re trying not to overwrite any existing files).
    3. Call tar to compress the paths we were passed, using the specified compression algorithm.
    4. Return a pathlib.Path instance pointing to the tarball.
  3. The encrypt(config, tarball_path) function, passed the Configuration object and the tarball path.
    1. Call gpg to symmetrically compress the tarball with the specified encryption algorithm, writing output to the specified location. gpg then invokes gpg-agent to ask for a password and so on and so forth.
  4. The cleanup(config, tarball_path) function, passed the Configuration object and the tarball path.
    • Also called if any part of compress() or encrypt() fails.
    1. Check to see if the user specified for the temporary tarball to not be deleted.
    2. Delete the tarball file if it’s alright for us to do so.

I also want to explain the --force-absolute, --, and -@ functions because they’re the ones that might be most confusing.

  • --force-absolute causes paths passed to the script to be resolved to their absolute forms before being passed to tar. Using this option is meant to help prevent naming collisions if a user passes files from multiple directories, and it’s also something I like because then I can see the full path of where files came from if I need to use the backup later on.
  • -- tells the script to interpret all further arguments as paths, so if you have a file/directory path starting with a dash “-“, it isn’t read as an option, and you don’t get a “hey this option doesn’t exit” error.
  • -@ tells the script to read file paths from the standard input stream (separated by newline characters). It’s useful for if you’ve got a directory/file with a space in its name, and you’re calling this script from another script – then you can just pipe paths and avoid the hassle of dealing with shell stuffs.

I made this script to use personally, it’s not a work project that’s going to be distributed to users or anything, but I try to write/describe programs as if they were going to be used by other people since it seems like a good idea for writing good code.

My environment is just Ubuntu 20.10 with Python 3.8.6, Gnu bash 5.0.17, Gnu tar 1.30, GnuPG 2.2.20 (libgcrypt 1.8.5), and Parallel BZIP2 1.1.13.

Here’s the code of the script (in a file extract.py):

#!/usr/bin/env python3

################################################################################################
##
##  This script takes a list of file/directory paths and compresses them with tar, before
##  encrypting that tarball with gpg. It includes options for:
##   * The output directory (which temporary and output files are written to);
##   * The name to give the final encrypted file;
##   * Whether or not to delete the temporary tarball file;
##   * If paths should be resolved to their absolute forms before being run through tar;
##   * The compression algorithm for tar to use;
##   * The encryptions algorithm for gpg to use.
##
################################################################################################

import os
import pathlib
import random
import subprocess
import sys

SCRIPT_NAME = pathlib.Path(sys.argv(0)).name

# Enumeration of return codes for various errors that may occur during this script.
class ReturnCode:
    SUCCESS = 0
    ARGUMENT_ERROR = 1
    PATH_NOT_FOUND = 2
    CANNOT_CREATE_OUTPUT_DIRECTORY = 4
    CANNOT_LOCATE_PATH_FOR_TEMPORARY_FILES = 5
    COMPRESSION_ERROR = 6
    ENCRYPTION_ERROR = 7
    CLEANUP_ERROR = 8

# Describes the configuration for this script as specified by command-line arguments and the
# standard input stream.
class Configuration:
    output_directory = pathlib.Path.cwd()
    output_name = "files"
    delete_temporary_files = True
    enforce_absolute_paths = False
    compression_algorithm = "gzip"
    encryption_algorithm = "AES256"
    paths = list()

# Reads and parses command-line arguments to interpret how the user wants this script to run.
# May also read from the standard input stream, depending on command-line arguments.
#
# Returns an instance of `Configuration`.
def read_arguments():
    # Prints help information for this script. Lines of help information should be at most 80
    # characters in length.
    def print_help():
        #     " -------------- This commented-out string is 80 characters long -------------- "
        print("Python 3 script for packaging and encrypting a set of files using tar/gpg.")
        print("")
        print("Usage:")
        print("  extract.py (option|file|directory)* (-- (file|directory)*)")
        print("")
        print("Options:")
        print("  --help                Print this help information and exit")
        print("  --out-dir PATH        Output directory path (default: './')")
        print("  --out-name NAME       Output file name (default: 'files')")
        print("  --no-deletion         Don't delete temporary files")
        print("  --force-absolute      Resolve relative paths before passing them to tar")
        print("  --compress-with ALGO  The program to use for compression (default: gzip)")
        print("  --encrypt-with ALGO   The algorithm to use for encryption (default: AES256)")
        print("  -@                    Read paths from stdin (seperated by newlines)")
        print("  --                    Specifies that all following arguments are paths")

    # If the user didn't specify any arguments, print help information and exit.
    if len(sys.argv) == 1:
        print_help()
        sys.exit(ReturnCode.ARGUMENT_ERROR)

    config = Configuration()
    read_from_stdin = False

    index = 1
    while index < len(sys.argv):
        argument = sys.argv(index)
        index += 1

        # If we ran into an argument that needs a parameter, this function ensures that a parameter
        # is specified and returns it. If a parameter is not specified, the script exits.
        def retrieve_parameter(index):
            if len(sys.argv) < index:
                print(SCRIPT_NAME + ": the '" + argument + "' option requires an argument.")
                sys.exit(ReturnCode.ARGUMENT_ERROR)
            return sys.argv(index)

        if argument == "-h" or argument == "--help":
            print_help()
            sys.exit(ReturnCode.SUCCESS)

        if argument == "--out-dir":
            config.output_directory = pathlib.Path(retrieve_parameter(index)).resolve()
            index += 1
        elif argument == "--out-name":
            config.output_name = retrieve_parameter(index)
            index += 1
        elif argument == "--no-deletion":
            config.delete_temporary_files = False
        elif argument == "--force-absolute":
            config.enforce_absolute_paths = True
        elif argument == "--compress-with":
            config.compression_algorithm = retrieve_parameter(index)
            index += 1
        elif argument == "--encrypt-with":
            config.encryption_algorithm = retrieve_parameter(index)
            index += 1
        elif argument == "-@":
            read_from_stdin = True
        elif argument == "--":
            for index in range(index, len(sys.argv)):
                config.paths.append(pathlib.Path(sys.argv(index)))
            break
        elif argument.startswith("-"):
            print(SCRIPT_NAME + ": the option '" + argument + "' does not exist.")
            sys.exit(ReturnCode.ARGUMENT_ERROR)
        else:
            config.paths.append(pathlib.Path(argument))

    # If the user specified to read paths from stdin, then we interpret each line of stdin as
    # being a path.
    if read_from_stdin:
        for line in sys.stdin:
            config.paths.append(pathlib.Path(line.strip("n")))

    existing_paths = list()
    nonexistent_paths = list()

    # Run through each path and check that it exists.
    for path in config.paths:
        if path.exists():
            existing_paths.append(path.resolve() if config.enforce_absolute_paths else path)
        else:
            nonexistent_paths.append(path)

    # If one or more paths doesn't exist (or isn't valid), print them and exit.
    if nonexistent_paths:
        if not existing_paths:
            print(SCRIPT_NAME + ": none of the specified paths seem to exist.")
            sys.exit(ReturnCode.PATH_NOT_FOUND)

        print(SCRIPT_NAME + ": the following paths do not seem to exist:")
        for path in nonexistent_paths:
            print(SCRIPT_NAME + ":   " + str(path))
        sys.exit(ReturnCode.PATH_NOT_FOUND)

    # Also exit if there aren't any valid or existing paths.
    if not existing_paths:
        print(SCRIPT_NAME + ": you need to specify one or more paths.")
        sys.exit(ReturnCode.ARGUMENT_ERROR)

    config.paths = existing_paths
    return config

# Returns a string of length `length` randomly populated with characters from `characters`.
def random_string(length = 16, characters="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"):
    # This seems to be the fastest method as per https://stackoverflow.com/a/19926932
    return "".join((characters(random.randrange(len(characters))) for i in range(0, length)))

# Cleans up temporary files from running this script.
def cleanup(config, tarball_path):
    if config.delete_temporary_files:
        try:
            tarball_path.unlink(True)
        except:
            print(SCRIPT_NAME + ": exception occured when trying to delete temporary files.")
            print(SCRIPT_NAME + ": attempted to delete: " + str(tarball_path))
            print(sys.exc_info()(0))
            sys.exit(ReturnCode.CLEANUP_ERROR)

# Compresses the specified files/directories into a tarball, and returns the path of that
# tarball.
def compress(config):

    # Check that the specified output directory exists and is a directory. If it isn't, then
    # attempt to create that directory.
    if not config.output_directory.is_dir():
        try:
            config.output_directory.mkdir(parents=True, exists_ok=True)
        except:
            print(SCRIPT_NAME + ": exception occured when trying to create the output directory.")
            print(SCRIPT_NAME + ": attempted to create directory: " + str(config.output_directory))
            print(sys.exc_info()(0))
            sys.exit(ReturnCode.CANNOT_CREATE_OUTPUT_DIRECTORY)

    # We need a path to store our compressed tarball in, so we'll make 64 attempts at finding a
    # random file name not in use by another file.
    destination = None
    attempts = 0
    while attempts < 64:
        destination = config.output_directory / ("tarball_" + random_string())
        if not destination.exists():
            break
        attempts += 1
    # If all those attempts fail, we assume something went wrong and just exit.
    if destination.exists():
        print(SCRIPT_NAME + ": could not find a valid path for temporary files.")
        sys.exit(ReturnCode.CANNOT_LOCATE_PATH_FOR_TEMPORARY_FILES)

    # Run tar on the specified files, using the specified compression algorithm.
    print(SCRIPT_NAME + ": compressing paths...")
    completed_process = subprocess.run((
        "tar",
        "-c",
        "--use-compress-program=" + config.compression_algorithm,
        "-f", str(destination)) + config.paths)
    # If tar returned a non-zero exit code, we print that code to the user and exit.
    if completed_process.returncode != 0:
        print(SCRIPT_NAME + ": error attempting to compress paths failed with return code '" + str(completed_process.returncode) + "'.")
        cleanup(config, destination)
        sys.exit(ReturnCode.COMPRESSION_ERROR)

    return destination

# Encrypts the tarball file with gpg using the specified encryption algorithm
def encrypt(config, tarball_path):
    print(SCRIPT_NAME + ": encrypting tarball...")
    completed_process = subprocess.run((
        "gpg",
        "--s2k-mode", "3",
        "--s2k-count", "65011712",
        "--s2k-digest-algo", "SHA512",
        "--s2k-cipher-algo", config.encryption_algorithm,
        "--output", str(config.output_directory / config.output_name),
        "--symmetric", str(tarball_path)))
    # If gpg returned a non-zero exit code, we print that code to the user and exit.
    if completed_process.returncode != 0:
        print(SCRIPT_NAME + ": error attempting to encrypt tarball failed with return code '" + str(completed_process.returncode) + "'.")
        cleanup(config, tarball_path)
        sys.exit(ReturnCode.ENCRYPTION_ERROR)

def main():
    config = read_arguments()
    tarball_path = compress(config)
    encrypt(config, tarball_path)
    cleanup(config, tarball_path)

if __name__ == "__main__":
    main()

And here, as supporting code, is the script I use to invoke encrypt.py (not so much looking for feedback on it, but it seemed like a good idea to include just for completeness):

#!/bin/bash

cd "$(dirname "$0")";

prefix="Backup_";
datestr=$(date +%Y-%m-%d);
suffix="";
extension=".tar.bz2.gpg"

for id in {a..z};
do :
    suffix=$id;
    if ( ! -f "$prefix$datestr$suffix$extension" );
    then
        break;
    fi
    if ( "$id" == "z" );
    then
        printf "backup.sh: too many existing backups!n";
        exit 1;
    fi
done

printf "backup.sh: creating archive...n";

files=$(
    ls |
    grep -v "^Backup_((:digit:)){4}-((:digit:)){2}-((:digit:)){2}((:lower:))$extension$" |
    grep -v "^tarball_((:alnum:)){16}$");

printf "$files" | python3 encrypt.py 
    --out-name "$prefix$datestr$suffix$extension" 
    --force-absolute 
    --compress-with pbzip2 
    -@;
code=$?

if ( $code != 0 );
then
    printf "backup.sh: encrypt.py failed with exit code $coden";
else
    printf "backup.sh: done!n";
fi

Much appreciated!