Azure Machine Learning Service — Where is My Data? - JOYK Joy of Geek, Geek News, Link all geek

Azure Machine Learning Service — Where is My Data?

Part 4: An introduction to Datasets & Datastores

Jul 5 ·5min read

Abstract

Perhaps this is one of the important aspects of the Cloud machine learning platform. The dire need for every machine learning problem is its data. Mostly this data comes from various diverse sources, it is then refined, pruned, and massaged for various analyses and its consumption into the ML model. Due to this reason, the cloud machine learning defines data management SDK’s which can define, secure, and manage the data sources in the cloud platform.

This particular post is extracted from the Kaggle notebook hosted — here . Use the link to setup to execute the experiment.

rQziQzi.jpg!web

Photo by Lukas from Pexels

Datastores and Datasets

Datastores

Datastores is a data management capability and the SDK provided by the Azure Machine Learning Service (AML). It enables us to connect to the various data sources and then those can be used to ingest them into the ML experiment or write outputs from the same experiments. Azure provides various platform services that can be enabled as a data source, e.g., blob store, data lake, SQL database, Databricks, and many others.

The Azure ML workspace has a natural integration with the datastores defined in Azure, such as Blob Storage and File Storage. But, executing the ML model may require data and its dependencies from other external sources. Hence, the AML SDK provides us the way to register these external sources as a Datasource for model experiments. The ability to define a datastore enables us to reuse the data across multiple experiments, regardless of the compute context in which the experiment is running.

Register Datastores

As discussed, Datastoes are of two types — Default and user provisioned, such as Storage Blobs containers or file storage. To get the list of default Datasores of a workspace:

# get the name of defult Datastore associated with the workspace.
default_dsname = ws.get_default_datastore().name
default_ds = ws.get_default_datastore()
print('default Datastore = ', default_dsname)

To register the database using AML SDK —

from azureml.core import Workspace, Datastore
ws = Workspace.from_config()# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws,
    datastore_name='blob_data',
    container_name='data_container',
    account_name='az_store_acct',
    account_key='11223312345cadas6abcde789…')

To set or change the default datastore —

ws.set_default_datastore('blob_data')

QnAJBbB.png!web

View from Azure ML — Datastores

Upload files to Datastores

Upload the local files from the local system to the remote data store. This allows the experiments to directly run using remote data location. The target_path is the path of the files at a remote datastore location. The ‘Reference Path’ is returned once the files are uploaded to the datastore. When you want to use a Datastore in an experiment script, we must pass a data reference to the script

default_ds.upload_files(files=['../input/iris-flower-dataset/IRIS.csv'],target_path='flower_data/',overwrite=True, show_progress=True)flower_data_ref = default_ds.path('flower_data').as_download('ex_flower_data')
print('reference_path = ',flower_data_ref)

Experiment with Data Store

Once we have the reference to the Datastore as mentioned above, we need to pass this reference to an experiment script, as a script parameter from the Estimator. Thereafter, the value of this parameter can be retrieved and then used as a local folder —

####run = Run.get_context()#define the regularization parameter for the logistic regression.
parser = argparse.ArgumentParser()
parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)#define the data_folder parameter for referencing the path of the registerd datafolder.
parser.add_argument('--data_folder', type=str, dest='data_folder', help='Data folder reference', default=0.01)
args=parser.parse_args()
r = args.reg
ex_data_folder = args.data_folder####

Define the estimator as —

####run = Run.get_context()estimator = SKLearn(source_directory=experiment_folder,
                      entry_script='iris_simple_experiment.py',
                      compute_target='local',
                      use_docker=False,
                      script_params = {'--reg_rate': 0.07, 
                                  '--data_folder':flower_data_ref} 
# assigned reference path value as   defined above.
                  )####

The ‘ — data_folder’ accepts the datastore folder reference; path where files are uploaded. The script will load the training data from the data reference passed to it as a parameter, hence we need to set up the script parameters to pass the file reference to run the experiment.

Datasets

Datasets are packaged data objects that are readily consumable for machine learning pipelines. It is the recommended way to work with data. Help in enabling data labeling, versioning, and drift monitoring (to be discussed in upcoming posts). Datasets are defined from the location of data stored already in the ‘DataStores’

Dataset Types— We can create two types of datasets

Tabular: a structured form of data mostly read from a table, CSV, RDBMS, etc… imported from the Datastores. Example — Dataframe for a regression problem.
File: This is for unstructured datatypes, the list of file paths can be used through Datastores. An example use case is reading images to train a CNN.

Create Dataset

The Datasets are first created from the Datastores and the need to be registered. The example below shows the creation of both — tabular and file datasets.

# Creating tabular dataset from files in datastore.
tab_dataset = Dataset.Tabular.from_delimited_files(path=(default_ds,'flower_data/*.csv'))tab_dataset.take(10).to_pandas_dataframe()# similarly, creating files dataset from the files already in the datastore. Useful in scenarios like image processing in deeplearning.file_dataset = Dataset.File.from_files(path=(default_ds,'flower_data/*.csv'))for fp in file_dataset.to_path():
    print(fp)

Register Datasets

Once the Datasets are defined, they are required to be attached to the AML workspace. Also, with this, the meta-data such as name, description, tags, and version of the dataset is appended. Versioning of Datasets allows us to keep track of the dataset on which the experiment is trained. Versioning is also useful if we want to train the model on any specific version. Following are the ways to register the Tabular and File datasets:

# register tabular dataset
tab_dataset = tab_dataset.register(workspace=ws, name='flower tab ds', description='Iris flower Dataset in tabular format', tags={'format':'CSV'}, create_new_version=True)
#register File Dataset
file_dataset = file_dataset.register(workspace=ws, name='flower Files ds', description='Iris flower Dataset in Files format', tags={'format':'CSV'}, create_new_version=True)

bqqUbaz.png!web

Registered Datasets in Azure Machine Learning Service

print("Datasets Versions:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)Output >>
Datasets Versions:
	 flower Files ds version 1
	 flower tab ds version 1

Experiment with the Datasets

In the training script, that trains a classification model, the tabular dataset is passed to it as input by reading the dataset as a Pandas Dataframe

data = run.input_datasets['flower_ds'].to_pandas_dataframe()

Use the inputs parameter of the SKLearn estimator to pass the registered dataset, which is to be consumed by the training script.

inputs=[tab_dataset.as_named_input(‘flower_ds’)]

Also, using pip_packages additional parameters to enable the runtime environment to provision required package to support AML pandas operations. Since the script will need to work with a Dataset object, you must include either the full azureml-sdk package or the azureml-dataprep package with the P andas extra library in the script’s computing environment.

pip_packages=[‘azureml-dataprep[pandas]’]

Conclusion

Hereby, we covered one of the most important aspects of AML service, i.e., data management for the models and experiments. Data is a fundamental element in any machine learning workload, therefore, here we learned how to create and manage datastores and datasets in an Azure Machine Learning workspace, and how to use them in model training experiments.

In the next post, we will touch base upon the concepts related to various computing environments supported by AML service. Till then stay tuned!

References

[1] Notebook & Code — Azure Machine Learning — Working with Data , Kaggle.

[2] Azure ML Service, Data reference guide , Official Documentation, Microsoft Azure.

[3] Azure Machine Learning Service Official Documentation, Microsoft Azure.

Azure Machine Learning Service — Where is My Data?