dgp.utils.cloud package¶
dgp.utils.cloud.s3 module¶
Various utility functions for AWS S3
- class dgp.utils.cloud.s3.RemoteArtifactFile(artifact)¶
Bases:
object
- dgp.utils.cloud.s3.convert_uri_to_bucket_path(uri, strip_trailing_slash=True)¶
Parse a URI into a bucket and path.
- uri: str
A full s3 path (e.g. ‘s3://<s3-bucket>/<s3-prefix>’)
- strip_trailing_slash: bool, optional
If True, we strip any trailing slash in s3_path before returning it (i.e. s3_path=’<s3-prefix>’ is returned for uri=’s3://<s3-bucket>/<s3-prefix>/’). Otherwise, we do not strip it (i.e. s3_path=’<s3-prefix>’ is returned for uri=’s3://<s3-bucket>/<s3-prefix>/’).
If there is no trailing slash in uri then there will be no trailing slash in s3_path regardless of the value of this parameter. Default: True.
- str:
The bucket of the S3 object (e.g. ‘<s3-bucket>’)
- str:
The path within the bucket of the S3 object (e.g. ‘<s3-prefix>’)
- dgp.utils.cloud.s3.delete_s3_object(bucket_name, object_key, verbose=True)¶
Delete an object in s3.
- bucket_name: str
S3 bucket of the object to delete.
- object_key: str
Key name of the object to delete.
- verbose: bool, optional
If True print messages. Default: True.
- dgp.utils.cloud.s3.download_s3_object_to_path(bucket_name, s3_path, local_path)¶
Download an object from s3 to a path on the local filesystem.
- bucket_name: str
S3 bucket from which to fetch.
- s3_path: str
The path on the s3 bucket to the file to fetch.
- local_path: str
Path on the local filesystem to place the object.
- bool
True if successful, False if not.
- dgp.utils.cloud.s3.exists_s3_object(bucket_name, url)¶
Uses a valid S3 bucket to check the existence of a remote file as AWS URL.
- bucket_namestr
AWS S3 root bucket name
- urlstr
The remote URL of the object to be fetched
- bool:
Wether or not the object exists in S3.
- dgp.utils.cloud.s3.get_s3_object(bucket_name, url)¶
Uses a valid S3 bucket to retrieve a file from a remote AWS URL. Raises ValueError if non-existent.
- bucket_namestr
AWS S3 root bucket name
- urlstr
The remote URL of the object to be fetched
S3 object
- ValueError
Raised if url cannot be found in S3.
- dgp.utils.cloud.s3.get_string_from_s3_file(bucket_name, url)¶
Uses a valid S3 bucket to retrieve an UTF-8 decoded string from a remote AWS URL. Raises ValueError if non-existent.
- bucket_namestr
AWS S3 root bucket name
- urlstr
The remote URL of the object to be fetched
A string representation of the remote file
- dgp.utils.cloud.s3.init_s3_client(use_ssl=False)¶
Initiate S3 AWS client.
- use_ssl: bool, optional
Use secure sockets layer. Provieds better security to s3, but can fail intermittently in a multithreaded environment. Default: False.
- service: boto3.client
S3 resource service client.
- dgp.utils.cloud.s3.is_false(value)¶
- dgp.utils.cloud.s3.list_prefixes_in_s3(s3_prefix)¶
List prefixes in S3 path. CAVEAT: This function was only tested for S3 prefix for files. E.g., if the bucket looks like the following,
s3://aa/bb/cc_v01.json
s3://aa/bb/cc_v02.json
s3://aa/bb/cc_v03.json
then list_prefixes_in_s3(“s3://aa/bb/cc_v”) returns [‘cc_v01.json’, ‘cc_v02.json’, ‘cc_v03.json’]
- s3_prefix: str
An S3 prefix.
- prefixes: List of str
List of basename prefixes that starts with s3_prefix.
- dgp.utils.cloud.s3.list_prefixes_in_s3_dir(s3_path)¶
List prefixes in S3 path. CAVEAT: This function was only tested for S3 path that contains purely one-level prefix, i.e. no regular objects. Parameters ———- s3_path: str
An S3 path.
- prefixes: List of str
List of prefixes under s3_path
- dgp.utils.cloud.s3.list_s3_objects(bucket_name, url)¶
List all files within a valid S3 bucket
- bucket_namestr
AWS S3 root bucket name
- urlstr
The remote URL of the object to be fetched
- dgp.utils.cloud.s3.open_remote_json(s3_path)¶
Loads a remote JSON file
- s3_path: str
Full s3 path to JSON file
- dict:
Loaded JSON
- dgp.utils.cloud.s3.parallel_download_s3_objects(s3_files, destination_filepaths, bucket_name, process_pool_size=None)¶
Download a list of s3 objects using a processpool.
- s3_files: list
List of all files from s3 to download
- destination_filepaths: list
List of where to download all files locally
- bucket_name: str
S3 bucket to pull from
- process_pool_size: int, optional
Number of threads to use to fetch these files. If not specified, will default to number of cores on the machine. Default: None.
- dgp.utils.cloud.s3.parallel_s3_copy(source_paths, target_paths, threadpool_size=None)¶
Copy files from local to s3, s3 to local, or s3 to s3 using a threadpool. Retry the operation if any files fail to copy. Throw an AssertionError if it fails the 2nd time.
- source_paths: List of str
Full paths of files to copy.
- target_paths: List of str
Full paths to copy files to.
- threadpool_size: int
Number of threads to use to fetch these files. If not specified, will default to number of cores on the machine.
- dgp.utils.cloud.s3.parallel_s3_sync(source_paths, target_paths, threadpool_size=None)¶
Copy directories from local to s3, s3 to local, or s3 to s3 using a threadpool. Retry the operation if any files fail to sync. Throw an AssertionError if it fails the 2nd time.
- source_paths: List of str
Directories from which we want to sync files.
- target_paths: List of str
Directories to which all files will be synced.
- threadpool_size: int
Number of threads to use to fetch these files. If not specified, will default to number of cores on the machine.
- dgp.utils.cloud.s3.parallel_upload_s3_objects(source_filepaths, s3_destinations, bucket_name, process_pool_size=None)¶
Upload a list of s3 objects using a processpool.
- source_filepaths: list
List of all local files to upload.
- s3_destinations: list
Paths relative to bucket in S3 where files are uploaded.
- bucket_name: str
S3 bucket to upload to.
- process_pool_size: int, optional
Number of threads to use to fetch these files. If not specified, will default to number of cores on the machine. Default: None.
- dgp.utils.cloud.s3.return_last_value(retry_state)¶
Return the result of the last call attempt.
- retry_state: tenacity.RetryCallState
Retry-state metadata for a flaky call.
- dgp.utils.cloud.s3.s3_bucket(bucket_name)¶
Instantiate S3 bucket object from its bucket name
- bucket_namestr
Bucket name to instantiate.
S3 bucket
- dgp.utils.cloud.s3.s3_copy(source_path, target_path, verbose=True)¶
Copy single file from local to s3, s3 to local, or s3 to s3.
- source_path: str
Path of file to copy
- target_path: str
Path to copy file to
- verbose: bool, optional
If True print some helpful messages. Default: True.
bool: True if successful
- dgp.utils.cloud.s3.s3_recursive_list(s3_prefix)¶
List all files contained in an s3 location recursively and also return their md5_sums NOTE: this is different from ‘aws s3 ls’ in that it will not return directories, but instead the full paths to the files contained in any directories (which is what s3 is actually tracking)
- s3_prefix: str
s3 prefix which we want the returned files to have
- all_files: list[str]
List of files (with full path including ‘s3://…’)
- md5_sums: list[str]
md5 sum for each of the files as returned by boto3 ‘ETag’ field
- dgp.utils.cloud.s3.sync_dir(source, target, file_ext=None, verbose=True)¶
Sync a directory from source to target (either local to s3, s3 to s3, s3 to local)
- source: str
Directory from which we want to sync files
- target: str
Directory to which all files will be synced
- file_ext: str, optional
Only sync files ending with this extension. Eg: ‘csv’ or ‘json’. If None, sync all files. Default: None.
- verbose: bool, optional
If True, log some helpful messages. Default: True.
bool: True if operation was successful
- dgp.utils.cloud.s3.sync_dir_safe(source, target, verbose=True)¶
Sync a directory from local to s3 by first ensuring that NONE of the files in target have been edited in source (new files can exist in source that are not in target) (NOTE: only checks files from target that exist in source)
- source: str
Directory from which we want to sync files
- target: str
Directory to which all files will be synced
- verbose: bool, optional
If True, log some helpful messages. Default: True.
- files_fail_to_sync: list
List of files fail to sync to S3 due to md5sum mismatch.
- dgp.utils.cloud.s3.upload_file_to_s3(bucket_name, local_file_path, bucket_rel_path)¶
- bucket_namestr
AWS S3 root bucket name
- local_file_path: str
Local path to file we want to upload
- bucket_rel_path: str
Path where file is uploaded, relative to S3 bucket root
- s3_url: str
s3_url to which file was uploaded e.g. “s3://<s3-bucket>/<task-name>/<wandb-run>/<model-weights-file>”