MLFlowObjectStore#
- class composer.utils.MLFlowObjectStore(path, multipart_upload_chunk_size=104857600)[source]#
Utility class for uploading and downloading artifacts from MLflow.
It can be initialized for an existing run, a new run in an existing experiment, the active run used by the mlflow module, or a new run in a new experiment. See the documentation for
path
for more details.Note
At this time, only Databricks-managed MLflow with a โdatabricksโ tracking URI is supported. Using this object store requires configuring Databricks authentication through a configuration file or environment variables. See https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html#databricks-native-authentication
Unlike other object stores, the DBFS URI scheme for MLflow artifacts has no bucket, and the path is prefixed with the artifacts root directory for a given experiment/run, databricks/mlflow-tracking/<experiment_id>/<run_id>/. However, object names are also sometimes passed by upstream code as artifact paths relative to this root, rather than the full path. To keep upstream code simple,
MLFlowObjectStore
accepts both relative MLflow artifact paths and absolute DBFS paths as object names. If an object name takes the form of databricks/mlflow-tracking/<experiment_id>/<run_id>/artifacts/<artifact_path>, it is assumed to be an absolute DBFS path, and the <artifact_path> is used when uploading objects to MLflow. Otherwise, the object name is assumed to be a relative MLflow artifact path, and the full provided name will be used as the artifact path when uploading to MLflow.- Parameters
path (str) โ
A DBFS path of the form databricks/mlflow-tracking/<experiment_id>/<run_id>/artifacts/<path>. experiment_id and run_id can be set as the format string placeholders {mlflow_experiment_id} and {mlflow_run_id}โ.
If both experiment_id and run_id are set as placeholders, the MLFlowObjectStore will be associated with the currently active MLflow run if one exists. If no active run exists, a new run will be created under a default experiment name, or the experiment name specified by the MLFLOW_EXPERIMENT_NAME environment variable if one is set.
If experiment_id is provided and run_id is not, the MLFlowObjectStore will create a new run in the provided experiment.
Providing a run_id without an experiment_id will raise an error.
multipart_upload_chunk_size (int, optional) โ The maximum size of a single chunk in an MLflow multipart upload. The maximum number of chunks supported by MLflow is 10,000, so the max file size that can be uploaded is 10 000 * multipart_upload_chunk_size. Defaults to 100MB for a max upload size of 1TB.
- get_artifact_path(object_name)[source]#
Converts an object name into an MLflow relative artifact path.
- Parameters
object_name (str) โ The object name to convert. If the object name is a DBFS path beginning with
MLFLOW_DBFS_PATH_PREFIX
, the path will be parsed to extract the MLflow relative artifact path. Otherwise, the object name is assumed to be a relative artifact path and will be returned as-is.
- list_objects(prefix=None)[source]#
See
list_objects()
.MLFlowObjectStore does not support listing objects with a prefix, so the
prefix
argument is ignored.
- static parse_dbfs_path(path)[source]#
Parses a DBFS path to extract the MLflow experiment ID, run ID, and relative artifact path.
The path is expected to be of the format databricks/mlflow-tracking/<experiment_id>/<run_id>/artifacts/<artifact_path>.
- Parameters
path (str) โ The DBFS path to parse.
- Returns
(str, str, str) โ (experiment_id, run_id, artifact_path)
- Raises
ValueError โ If the path is not of the expected format.