S3 - Accessing data in S3 quickly
The S3 R6 class is a wrapper over the standard AWS Python library, boto, accessed through reticulate. It contains enhancements that are relevant for data-intensive applications:
- Supports accessing large amounts of data quickly through parallel operations (functions with the
_manysuffix). You can download up to 20Gbps on a large EC2 instance. - Improved error handling.
- Supports versioned data through
S3$new(run=self)andS3$new(run=Run). - User-friendly API with minimal boilerplate.
- Convenient API for advanced features such as range requests (downloading partial files) and object headers.
The S3 R6 Class
S3
The Metaflow S3 client.
This object manages the connection to S3 and a temporary directory that is used to download objects. Note that in most cases when the data fits in memory, no local disk IO is needed as operations are cached by the operating system, which makes operations fast as long as there is enough memory available.
The easiest way is to use this object is as follows:
You can customize the location of the temporary directory with tmproot. It defaults to the current working directory.
To make it easier to deal with object locations, the client can be initialized with an S3 path prefix. There are three ways to handle locations:
Use a
metaflow.Runobject orself, e.g.S3$new(run=self)which initializes the prefix with the globalDATATOOLS_S3ROOTpath, combined with the current run ID. This mode makes it easy to version data based on the run ID consistently. You can use thebucketandprefixto override parts ofDATATOOLS_S3ROOT.Specify an S3 prefix explicitly with
s3root, e.g.S3$new(s3root='s3://mybucket/some/path').Specify nothing, i.e.
S3$new(), in which case all operations require a full S3 url prefixed withs3://.
Parameters
-
tmproot(character, default: ‘.’): Where to store the temporary directory. -
bucket(character, optional): Override the bucket fromDATATOOLS_S3ROOTwhenrunis specified. -
prefix(character, optional): Override the path fromDATATOOLS_S3ROOTwhenrunis specified. -
run(FlowSpec or Run, optional): Derive path prefix from the current or a past run ID, e.g. S3$new(run=self). -
s3root(character, optional): Ifrunis not specified, use this as the S3 prefix.
Downloading data
S3$get
S3$get(key = NULL, return_missing = FALSE, return_info = TRUE)Get a single object from S3.
Parameters
-
key(character or S3GetObject, optional, default NULL): Object to download. It can be an S3 url, a path suffix, or an S3GetObject that defines a range of data to download. If NULL, or not provided, gets the S3 root. -
return_missing(logical, default FALSE): If set to TRUE, do not raise an exception for a missing key but return it as ans3_objectwith$exists == FALSE. -
return_info(logical, default TRUE): If set to TRUE, fetch the content-type and user metadata associated with the object at no extra cost, included for symmetry withget_many
S3$get_many
S3$get_many(keys, return_missing = FALSE, return_info = TRUE)Get many objects from S3 in parallel.
Parameters
-
keys(list): Objects to download. Each object can be an S3 url, a path suffix, or an S3GetObject that defines a range of data to download. -
return_missing(logical, default FALSE): If set to TRUE, do not raise an exception for a missing key but return it as ans3_objectwith$exists == FALSE. -
return_info(logical, default TRUE): If set to TRUE, fetch the content-type and user metadata associated with the object at no extra cost, included for symmetry withget_many.
S3$get_recursive
S3$get_recursive(keys, return_info = FALSE)Get many objects from S3 recursively in parallel.
S3$get_all
S3$get_all(return_info = FALSE)Get all objects under the prefix set in the S3 constructor.
This method requires that the S3 object is initialized either with run or s3root.
Listing objects
S3$list_paths
S3$list_paths(keys = NULL)List the next level of paths in S3.
If multiple keys are specified, listings are done in parallel. The returned s3_object instances have $exists == FALSE if the path refers to a prefix, not an existing S3 object.
For instance, if the directory hierarchy is:
a/0.txt
a/b/1.txt
a/c/2.txt
a/d/e/3.txt
f/4.txt
The S3$list_paths(c('a', 'f')) call returns:
a/0.txt (exists == TRUE)
a/b/ (exists == FALSE)
a/c/ (exists == FALSE)
a/d/ (exists == FALSE)
f/4.txt (exists == TRUE)
S3$list_recursive
S3$list_recursive(keys = NULL)List all objects recursively under the given prefixes.
If multiple keys are specified, listings are done in parallel. All objects returned have $exists == TRUE as this call always returns leaf objects.
For instance, if the directory hierarchy is:
a/0.txt
a/b/1.txt
a/c/2.txt
a/d/e/3.txt
f/4.txt
The S3$list_recursive(c('a', 'f')) call returns:
a/0.txt (exists == TRUE)
a/b/1.txt (exists == TRUE)
a/c/2.txt (exists == TRUE)
a/d/e/3.txt (exists == TRUE)
f/4.txt (exists == TRUE)
Uploading data
S3$put
S3$put(key, obj, overwrite = TRUE, content_type = NULL, metadata = NULL)Upload a single object to S3.
Parameters
-
key(character or S3PutObject): Object path. It can be an S3 url or a path suffix. -
obj(any R object): An object to store in S3. -
overwrite(logical, default TRUE): Overwrite the object if it exists. If set to FALSE, the operation succeeds without uploading anything if the key already exists. -
content_type(character, optional, default NULL): Optional MIME type for the object. -
metadata(list, optional, default NULL): A list of additional headers to be stored as metadata with the object.
S3$put_many
S3$put_many(key_objs, overwrite = TRUE)Upload many objects to S3.
Each object to be uploaded can be specified in two ways:
As a
list(key, obj)wherekeyis a string specifying the path andobjis any R object.As a
S3PutObjectwhich contains additional metadata to be stored with the object.
S3$put_files
S3$put_files(key_paths, overwrite = TRUE)Upload many local files to S3.
Each file to be uploaded can be specified in two ways:
As a
list(key, path)wherekeyis a string specifying the S3 path andpathis the path to a local file.As a
S3PutObjectwhich contains additional metadata to be stored with the file.
Querying metadata
S3$info
S3$info(key = NULL, return_missing = FALSE)Get metadata about a single object in S3.
This call makes a single HEAD request to S3 which can be much faster than downloading all data with get.
S3$info_many
S3$info_many(keys, return_missing = FALSE)Get metadata about many objects in S3 in parallel.
This call makes a single HEAD request to S3 which can be much faster than downloading all data with get.
Handling results with s3_object
Most operations above return s3_object instances that encapsulate information about S3 paths and objects.
Note that the data itself is not kept in these objects but it is stored in a temporary directory which is accessible through the properties of this object.
s3_object
This object represents a path or an object in S3, with an optional local copy.
s3_object instances are not instantiated directly, but they are returned by many methods of the S3 R6 class.
s3_object$downloaded
Has this object been downloaded?
If TRUE, the contents can be accessed through path, blob, and text methods.
s3_object$key
Key corresponds to the key given to the get call that produced this object.
This may be a full S3 URL or a suffix based on what was requested.
s3_object$path
Path to a local temporary file corresponding to the object downloaded.
This file gets deleted automatically when an S3 scope exits. Returns NULL if this s3_object has not been downloaded.
s3_object$blob
Contents of the object as a byte string or NULL if the object hasn’t been downloaded.
s3_object$text
Contents of the object as a string or NULL if the object hasn’t been downloaded.
The object is assumed to contain UTF-8 encoded data.
s3_object$size
Size of the object in bytes.
Returns NULL if the key does not correspond to an object in S3.
s3_object$has_info
Returns true if this s3_object contains the content-type MIME header or user-defined metadata.
If FALSE, this means that content_type, metadata, range_info and last_modified will return NULL.
