dgp.utils.cloud package

dgp.utils.cloud.s3 module

Various utility functions for AWS S3

class dgp.utils.cloud.s3.RemoteArtifactFile(artifact)

Bases: object

dgp.utils.cloud.s3.convert_uri_to_bucket_path(uri, strip_trailing_slash=True)

Parse a URI into a bucket and path.

uri: str

A full s3 path (e.g. ‘s3://<s3-bucket>/<s3-prefix>’)

strip_trailing_slash: bool, optional

If True, we strip any trailing slash in s3_path before returning it (i.e. s3_path=’<s3-prefix>’ is returned for uri=’s3://<s3-bucket>/<s3-prefix>/’). Otherwise, we do not strip it (i.e. s3_path=’<s3-prefix>’ is returned for uri=’s3://<s3-bucket>/<s3-prefix>/’).

If there is no trailing slash in uri then there will be no trailing slash in s3_path regardless of the value of this parameter. Default: True.

str:

The bucket of the S3 object (e.g. ‘<s3-bucket>’)

str:

The path within the bucket of the S3 object (e.g. ‘<s3-prefix>’)

dgp.utils.cloud.s3.delete_s3_object(bucket_name, object_key, verbose=True)

Delete an object in s3.

bucket_name: str

S3 bucket of the object to delete.

object_key: str

Key name of the object to delete.

verbose: bool, optional

If True print messages. Default: True.

dgp.utils.cloud.s3.download_s3_object_to_path(bucket_name, s3_path, local_path)

Download an object from s3 to a path on the local filesystem.

bucket_name: str

S3 bucket from which to fetch.

s3_path: str

The path on the s3 bucket to the file to fetch.

local_path: str

Path on the local filesystem to place the object.

bool

True if successful, False if not.

dgp.utils.cloud.s3.exists_s3_object(bucket_name, url)

Uses a valid S3 bucket to check the existence of a remote file as AWS URL.

bucket_namestr

AWS S3 root bucket name

urlstr

The remote URL of the object to be fetched

bool:

Wether or not the object exists in S3.

dgp.utils.cloud.s3.get_s3_object(bucket_name, url)

Uses a valid S3 bucket to retrieve a file from a remote AWS URL. Raises ValueError if non-existent.

bucket_namestr

AWS S3 root bucket name

urlstr

The remote URL of the object to be fetched

S3 object

ValueError

Raised if url cannot be found in S3.

dgp.utils.cloud.s3.get_string_from_s3_file(bucket_name, url)

Uses a valid S3 bucket to retrieve an UTF-8 decoded string from a remote AWS URL. Raises ValueError if non-existent.

bucket_namestr

AWS S3 root bucket name

urlstr

The remote URL of the object to be fetched

A string representation of the remote file

dgp.utils.cloud.s3.init_s3_client(use_ssl=False)

Initiate S3 AWS client.

use_ssl: bool, optional

Use secure sockets layer. Provieds better security to s3, but can fail intermittently in a multithreaded environment. Default: False.

service: boto3.client

S3 resource service client.

dgp.utils.cloud.s3.is_false(value)
dgp.utils.cloud.s3.list_prefixes_in_s3(s3_prefix)

List prefixes in S3 path. CAVEAT: This function was only tested for S3 prefix for files. E.g., if the bucket looks like the following,

  • s3://aa/bb/cc_v01.json

  • s3://aa/bb/cc_v02.json

  • s3://aa/bb/cc_v03.json

then list_prefixes_in_s3(“s3://aa/bb/cc_v”) returns [‘cc_v01.json’, ‘cc_v02.json’, ‘cc_v03.json’]

s3_prefix: str

An S3 prefix.

prefixes: List of str

List of basename prefixes that starts with s3_prefix.

dgp.utils.cloud.s3.list_prefixes_in_s3_dir(s3_path)

List prefixes in S3 path. CAVEAT: This function was only tested for S3 path that contains purely one-level prefix, i.e. no regular objects. Parameters ———- s3_path: str

An S3 path.

prefixes: List of str

List of prefixes under s3_path

dgp.utils.cloud.s3.list_s3_objects(bucket_name, url)

List all files within a valid S3 bucket

bucket_namestr

AWS S3 root bucket name

urlstr

The remote URL of the object to be fetched

dgp.utils.cloud.s3.open_remote_json(s3_path)

Loads a remote JSON file

s3_path: str

Full s3 path to JSON file

dict:

Loaded JSON

dgp.utils.cloud.s3.parallel_download_s3_objects(s3_files, destination_filepaths, bucket_name, process_pool_size=None)

Download a list of s3 objects using a processpool.

s3_files: list

List of all files from s3 to download

destination_filepaths: list

List of where to download all files locally

bucket_name: str

S3 bucket to pull from

process_pool_size: int, optional

Number of threads to use to fetch these files. If not specified, will default to number of cores on the machine. Default: None.

dgp.utils.cloud.s3.parallel_s3_copy(source_paths, target_paths, threadpool_size=None)

Copy files from local to s3, s3 to local, or s3 to s3 using a threadpool. Retry the operation if any files fail to copy. Throw an AssertionError if it fails the 2nd time.

source_paths: List of str

Full paths of files to copy.

target_paths: List of str

Full paths to copy files to.

threadpool_size: int

Number of threads to use to fetch these files. If not specified, will default to number of cores on the machine.

dgp.utils.cloud.s3.parallel_s3_sync(source_paths, target_paths, threadpool_size=None)

Copy directories from local to s3, s3 to local, or s3 to s3 using a threadpool. Retry the operation if any files fail to sync. Throw an AssertionError if it fails the 2nd time.

source_paths: List of str

Directories from which we want to sync files.

target_paths: List of str

Directories to which all files will be synced.

threadpool_size: int

Number of threads to use to fetch these files. If not specified, will default to number of cores on the machine.

dgp.utils.cloud.s3.parallel_upload_s3_objects(source_filepaths, s3_destinations, bucket_name, process_pool_size=None)

Upload a list of s3 objects using a processpool.

source_filepaths: list

List of all local files to upload.

s3_destinations: list

Paths relative to bucket in S3 where files are uploaded.

bucket_name: str

S3 bucket to upload to.

process_pool_size: int, optional

Number of threads to use to fetch these files. If not specified, will default to number of cores on the machine. Default: None.

dgp.utils.cloud.s3.return_last_value(retry_state)

Return the result of the last call attempt.

retry_state: tenacity.RetryCallState

Retry-state metadata for a flaky call.

dgp.utils.cloud.s3.s3_bucket(bucket_name)

Instantiate S3 bucket object from its bucket name

bucket_namestr

Bucket name to instantiate.

S3 bucket

dgp.utils.cloud.s3.s3_copy(source_path, target_path, verbose=True)

Copy single file from local to s3, s3 to local, or s3 to s3.

source_path: str

Path of file to copy

target_path: str

Path to copy file to

verbose: bool, optional

If True print some helpful messages. Default: True.

bool: True if successful

dgp.utils.cloud.s3.s3_recursive_list(s3_prefix)

List all files contained in an s3 location recursively and also return their md5_sums NOTE: this is different from ‘aws s3 ls’ in that it will not return directories, but instead the full paths to the files contained in any directories (which is what s3 is actually tracking)

s3_prefix: str

s3 prefix which we want the returned files to have

all_files: list[str]

List of files (with full path including ‘s3://…’)

md5_sums: list[str]

md5 sum for each of the files as returned by boto3 ‘ETag’ field

dgp.utils.cloud.s3.sync_dir(source, target, file_ext=None, verbose=True)

Sync a directory from source to target (either local to s3, s3 to s3, s3 to local)

source: str

Directory from which we want to sync files

target: str

Directory to which all files will be synced

file_ext: str, optional

Only sync files ending with this extension. Eg: ‘csv’ or ‘json’. If None, sync all files. Default: None.

verbose: bool, optional

If True, log some helpful messages. Default: True.

bool: True if operation was successful

dgp.utils.cloud.s3.sync_dir_safe(source, target, verbose=True)

Sync a directory from local to s3 by first ensuring that NONE of the files in target have been edited in source (new files can exist in source that are not in target) (NOTE: only checks files from target that exist in source)

source: str

Directory from which we want to sync files

target: str

Directory to which all files will be synced

verbose: bool, optional

If True, log some helpful messages. Default: True.

files_fail_to_sync: list

List of files fail to sync to S3 due to md5sum mismatch.

dgp.utils.cloud.s3.upload_file_to_s3(bucket_name, local_file_path, bucket_rel_path)
bucket_namestr

AWS S3 root bucket name

local_file_path: str

Local path to file we want to upload

bucket_rel_path: str

Path where file is uploaded, relative to S3 bucket root

s3_url: str

s3_url to which file was uploaded e.g. “s3://<s3-bucket>/<task-name>/<wandb-run>/<model-weights-file>”

Module contents