bash – Shell script to backup local files to public cloud

This is a script that I’ve written for personal use, and to educate myself. The script is a bit comment heavy to help future me remember all the details & design choices made at the time of writing.

The use case is simply to synchronize/copy files to an AWS S3 bucket using AWS CLI. There’s a separate CDK part of the project, which sets up the AWS infrastructure, but that isn’t really relevant here. There are some configuration items in a properties file, which are read, and then the script checks whether everything is in place on the AWS end, and if so, it reads through a config folder structure which folders to backup, and how (include & exclude patterns in respective files).

Going with Bash instead of a basic shell script was a deliberate choice, since this wouldn’t be run on any production server, and extreme portability wasn’t the main point here.

Folder structure of the overall project is:

- aws-infra
-- (various things here, that are out of the scope of the question)
- config
-- backup
--- Documents
---- includes.txt
--- Pictures
---- includes.txt (example at the end)
---- excludes.txt (example at the end)
--- (more files/folders following the same structure)
-- configuration.properties
- scripts
-- sync.sh

Theoretically I could’ve just run aws s3 sync on the base path, but since it’s a recursive command, and there are a lot (about 500k) of unnecessary files, it would take a lot of time to go through each of them separately.

#!/bin/bash

# Get the current directory where this file is, so that the script can
# called from other directories without breaking up.
DIR="$( cd "$( dirname "${BASH_SOURCE(0)}" )" && pwd)"

CONFIG_FOLDER="$DIR/../config"
PROP_FILE='configuration.properties'

# This is an associated array, i.e., keys can be strings or variables
# think Java HashMap or JavaScript Object
declare -A properties

# These are Bash arrays, i.e., with auto-numbered keys
# think Java or JavaScript array
declare -a includes excludes params

function loadProperties {
    local file="$CONFIG_FOLDER/$PROP_FILE"

    if (( ! -f "$file" )); then
        echo "$PROP_FILE not found!"
        return 2
    fi

    while IFS='=' read -r origKey value; do
        local key="$origKey"
        # Replace all non-alphanumerical characters (except underscore)
        # with an underscore
        key="${key//(!a-zA-Z0-9_)/_}"

        if (( "$origKey" == "#"* )); then
            local ignoreComments
        elif (( -z "$key" )); then
            local emptyLine
        else
            properties("$key")="$value"
        fi
    done < "$file"

    if (( "${properties(debug)}" = true )); then
        declare -p properties
    fi
}

function getBucketName {
    # Declare inside a function automatically makes the variable a local
    # variable.
    declare -a params
    params+=(--name "${properties(bucket_parameter_name)}")
    params+=(--profile="${properties(aws_profile)}")

    # Get the bucket name from SSM Parameter Store, where it's stored.
    # Logic is:
    # 1) run the AWS CLI command
    # 2) grab 5th line from the output with sed
    # 3) grab the 2nd word of the line with awk
    # 4) substitute first all double quotes with empty string,
    #    and then all commas with empty string, using sed
    local bucketName=$(aws ssm get-parameter "${params(@)}" | 
                       sed -n '5p' | 
                       awk '{ print $2 }' | 
                       sed -e 's/"//g' -e 's/,//g')

    properties(s3_bucket)="$bucketName"
}

function checkBucket {
    declare -a params
    params+=(--bucket "${properties(s3_bucket)}")
    params+=(--profile="${properties(aws_profile)}")

    # Direct stderr to stdout by using 2>&1
    local bucketStatus=$(aws s3api head-bucket "${params(@)}" 2>&1)
    
    # The 'aws s3api head-bucket' returns an empty response, if
    # everything's ok or an error message, if something went wrong.
    if (( -z "$bucketStatus" )); then
        echo "Bucket "${properties(s3_bucket)}" owned and exists";
        return 0
    elif echo "${bucketStatus}" | grep 'Invalid bucket name'; then
        return 1
    elif echo "${bucketStatus}" | grep 'Not Found'; then
        return 1
    elif echo "${bucketStatus}" | grep 'Forbidden'; then
        echo "Bucket exists but not owned"
        return 1
    elif echo "${bucketStatus}" | grep 'Bad Request'; then
        echo "Bucket name specified is less than 3 or greater than 63 characters"
        return 1
    else
        return 1
    fi
}

function create_params {
    local local_folder="$HOME/$1"
    local bucket_folder="s3://${properties(s3_bucket)}$local_folder"

    params+=("$local_folder" "$bucket_folder")

    if (( ${excludes(@)} )); then
        params+=("${excludes(@)}")
    fi
    
    if (( ${includes(@)} )); then
        params+=("${includes(@)}")
    fi

    params+=("--profile=${properties(aws_profile)}")

    if (( "${properties(dryrun)}" = true )); then
        params+=(--dryrun)
    fi

    if (( "${properties(debug)}" = true )); then
        declare -p params
    fi
}

# Sync is automatically recursive, and it can't be turned off. Sync
# checks whether any files have changed since latest upload, and knows
# to avoid uploading files, which are unchanged.
function sync {
    aws s3 sync "${params(@)}"
}

# Copy can be ran for individual files, and recursion can be avoided,
# when necessary. Copy doesn't check whether the file in source has
# changed since the last upload to target, but will always upload
# the files. Thus, use only when necessary to avoid sync.
function copy {
    local basePath="${params(0)}*"

    # Loop through files in given path.
    for file in $basePath; do
        # Check that file is not a folder or a symbolic link.
        if (( ! -d "$file" && ! -L "$file" )); then
            # Remove first parameter, i.e., local folder, since with
            # copy, we need to specify individual files instead of the
            # base folder.
            unset params(0)
            aws s3 cp "$file" "${params(@)}"
        fi
    done
}

function process_patterns {
    # If second parameter is not defined, then pointless to even read
    # anything, since there's no guidance on what to do with the data.
    if (( -z "$2" )); then
        return 1;
    fi

    # If the file defined in the first parameter exists, then loop
    # through its content line by line, and process it.
    if (( -f "$1" )); then
        while read line; do
            if (( $2 == "include" )); then
                includes+=(--include "$line")
            elif (( $2 == "exclude" )); then
                excludes+=(--exclude "$line")
            fi
        done < $1
    fi
}

# Reset the variables used in global scope.
# To be called after each cycle of the main loop.
function reset {
    unset includes excludes params
}

# The "main loop" that goes through folders that need to be
# backed up.
function handleFolder {
    process_patterns "${1}/${properties(exclude_file_name)}" exclude
    process_patterns "${1}/${properties(include_file_name)}" include
    
    # Remove the beginning of the path until the last forward slash.
    create_params "${1##*/}"
    
    if (( "$2" == "sync" )); then
        sync
    elif (( "$2" == "copy" )); then
        copy
    else
        echo "Don't know what to do."
    fi

    reset
}

function usage {
    cat << EOF
Usage: ${0##*/} (-dDh)

    -d, --debug   enable debug mode
    -D, --dryrun  execute commands in dryrun mode, i.e., don't upload anything
    -h, --help    display this help and exit

EOF
}

while getopts ":dDh-:" option; do
    case "$option" in
        -)
            case "${OPTARG}" in
                debug)
                    properties(debug)=true
                    ;;
                dryrun)
                    properties(dryrun)=true
                    ;;
                help)
                    # Send output to stderr instead of stdout by
                    # using >&2.
                    usage >&2
                    exit 2
                    ;;
                *)
                    echo "Unknown option --$OPTARG" >&2
                    usage >&2
                    exit 2
                    ;;
            esac
            ;;
        d)
            properties(debug)=true
            ;;
        D) 
            properties(dryrun)=true
            ;;
        h)
            usage >&2
            exit 2
            ;;
        *)
            echo "Unknown option -$OPTARG" >&2
            usage >&2
            exit 2
            ;;
    esac
done

# set -x shows the actual commands executed by the script. Much better
# than trying to run echo or printf with each command separately.
if (( "${properties(debug)}" = true )); then
    set -x
fi

loadProperties

# $? gives the return value of previous function call, non-zero value
# means that an error of some type occured
if (( $? != 0 )); then
    exit
fi

getBucketName

if (( $? != 0 )); then
    exit
fi

checkBucket

if (( $? != 0 )); then
    exit
fi

# Add an asterisk in the end for the loop to work, i.e.,
# to loop through all files in the folder.
backup_config_path="$CONFIG_FOLDER/${properties(backup_folder)}*"

# Change shell options (shopt) to include filenames beginning with a
# dot in the file name expansion.
shopt -s dotglob

# Loop through files in given path, i.e., subfolders of home folder.
for folder in $backup_config_path; do
    # Check that file is a folder, and that it's not a symbolic link.
    if (( -d "$folder" && ! -L "$folder" )); then
        handleFolder "$folder" "sync"
    fi
done

# Also include the files in home folder itself, but use copy to avoid
# recursion. Home folder & all subfolders contain over 500k files,
# and takes forever to go through them all with sync, even with an
# exclusion pattern.
# Remove the last character (asterisk) from the end of the config path.
handleFolder "${backup_config_path::-1}" "copy"

Properties file (SSM parameter name censored):

# AWS profile to be used
aws_profile=personal

# Bucket to sync files to
bucket_parameter_name=(my_ssm_parameter_name)

# Config folder where backup folders & files are found.
backup_folder=backup/

# Names of the files defining the include & exclude patterns for each folder.
include_file_name=includes.txt
exclude_file_name=excludes.txt

Example include file:

*.gif
*.jpg

Example exclude file to pair with above:

*

The script works, but I’m interested on how to improve it. For instance, the error handling feels a bit clumsy.