MLFlowObjectStore#

class composer.utils.MLFlowObjectStore(path, multipart_upload_chunk_size=104857600)[source]#

Utility class for uploading and downloading artifacts from MLflow.

It can be initialized for an existing run, a new run in an existing experiment, the active run used by the mlflow module, or a new run in a new experiment. See the documentation for path for more details.

Note

At this time, only Databricks-managed MLflow with a ‘databricks’ tracking URI is supported. Using this object store requires configuring Databricks authentication through a configuration file or environment variables. See https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html#databricks-native-authentication

Unlike other object stores, the DBFS URI scheme for MLflow artifacts has no bucket, and the path is prefixed with the artifacts root directory for a given experiment/run, databricks/mlflow-tracking/<experiment_id>/<run_id>/. However, object names are also sometimes passed by upstream code as artifact paths relative to this root, rather than the full path. To keep upstream code simple, MLFlowObjectStore accepts both relative MLflow artifact paths and absolute DBFS paths as object names. If an object name takes the form of databricks/mlflow-tracking/<experiment_id>/<run_id>/artifacts/<artifact_path>, it is assumed to be an absolute DBFS path, and the <artifact_path> is used when uploading objects to MLflow. Otherwise, the object name is assumed to be a relative MLflow artifact path, and the full provided name will be used as the artifact path when uploading to MLflow.

Parameters

path (str) –
A DBFS path of the form databricks/mlflow-tracking/<experiment_id>/<run_id>/artifacts/<path>. experiment_id and run_id can be set as the format string placeholders {mlflow_experiment_id} and {mlflow_run_id}’.

If both experiment_id and run_id are set as placeholders, the MLFlowObjectStore will be associated with the currently active MLflow run if one exists. If no active run exists, a new run will be created under a default experiment name, or the experiment name specified by the MLFLOW_EXPERIMENT_NAME environment variable if one is set.

If experiment_id is provided and run_id is not, the MLFlowObjectStore will create a new run in the provided experiment.

Providing a run_id without an experiment_id will raise an error.
multipart_upload_chunk_size (int, optional) – The maximum size of a single chunk in an MLflow multipart upload. The maximum number of chunks supported by MLflow is 10,000, so the max file size that can be uploaded is 10 000 * multipart_upload_chunk_size. Defaults to 100MB for a max upload size of 1TB.

get_artifact_path(object_name)[source]#

Converts an object name into an MLflow relative artifact path.

Parameters: object_name (str) – The object name to convert. If the object name is a DBFS path beginning with MLFLOW_DBFS_PATH_PREFIX, the path will be parsed to extract the MLflow relative artifact path. Otherwise, the object name is assumed to be a relative artifact path and will be returned as-is.

get_dbfs_path(object_name)[source]#: Converts an object name to a full DBFS path.

list_objects(prefix=None)[source]#

See list_objects().

MLFlowObjectStore does not support listing objects with a prefix, so the prefix argument is ignored.

static parse_dbfs_path(path)[source]#

Parses a DBFS path to extract the MLflow experiment ID, run ID, and relative artifact path.

The path is expected to be of the format databricks/mlflow-tracking/<experiment_id>/<run_id>/artifacts/<artifact_path>.

Parameters: path (str) – The DBFS path to parse.
Returns: (str, str, str) – (experiment_id, run_id, artifact_path)
Raises: ValueError – If the path is not of the expected format.