SETTING UP AN EXPERIMENTATION DATA SCIENCE ENVIRONMENT IN AZURE
It can be daunting to navigate different tools and answering a simple question, “where to start?”. If you have some data on your hands and you want to do some data science and you want to do it NOW, this post will explain how to “switch on” the infrastructure you need.
Cloud providers offer a big variety of options to choose from when it comes to data science. While some of the generic tools can be adapted to your data science needs, there has been a surge in machine learning specific offerings. From optimized virtual machines to no-code drag-and-drop tools. The amount of options means greater flexibility in setting up the environment for your specific needs. But it can also be daunting to navigate different tools and answering a simple question, “where to start?”. If you have some data on your hands and you want to do some data science and you want to do it NOW, this post will explain how to “switch on” the infrastructure you need.
Just so we are on the same page. This post assumes some familiarity with common data science tools such as Jupyter notebooks and command-line interfaces. Also, we are not aiming to provide an exhaustive introduction into Azure or its components. Nor are we trying to re-print official documentation (which will be heavily referenced). We want the reader to be able to get messy with code in 20 minutes. This is the way we like to work on our projects as it allows the emerging needs to inform the infrastructural decisions and not the other way around. If that sounds like you, let’s begin.
Machine Learning Workspace
The quickest thing you can do in a cloud environment is to create a virtual machine to run some code on. We could have ended the post here, but it wouldn’t be exciting would it? Well, neither are bare virtual machines, it’s 2020! Let’s take advantage of all the work that developers have put in to make our life easier. For example, people at Microsoft Azure have created a machine learning framework called Azure Machine Learning (duh), which we are going to set our foot in today.
Azure ML is a resource that is set to help you manage your data science workflow end-to-end. To create an Azure ML workspace, go to your Azure Portal and start typing ‘machine learning’ in the search field.
The first result is what we are looking for. After choosing a name, region and subscription for your workspace, click create and you should see the UI similar to the one below. This is the home for the full machine learning workflow. Inside the workspace you will notice different tabs: Experiments, Pipelines, Compute, Models and Images. Each of them has an intended use and deserve its own post. However, we promised you a quick start. So without further ado, let’s get our hands dirty with some Azure notebook virtual machines!
Azure Notebook VMs
The next step is to create a compute resource. Go to the Compute tab. There are four different compute types available to you. Make sure you have ‘Compute Instances’ tab selected and click ‘New’. You will then be prompted to fill in the details for your VM. Pick a name, preferred region, type and size of the machine.
At the bottom of the ‘New Compute Instance’ page there is a toggle for enabling SSH access. Choosing this option will allow you to access your virtual machine directly from your terminal and operate the VM’s terminal. It can be useful if you have a big training script to run and you don’t need the Jupyter interface. When you move the toggle, you will be asked for an SSH key which will be used for the connection. If you are not familiar with SSH technology, you can use Azure official guidance on creating SSH key-value pairs. Click create and wait for the virtual machine to be added to the list of compute instances.
These compute instances used to be called Notebook VM because Jupyter (or RStudio) was the only entry point for it. Even then, running notebooks was not the limit of the virtual machine. You could use Jupyter-integrated terminal, for example, to run .py files. Now you have an SSH interface which pretty much gives you the full access to the machine, and Jupyter entry point is a nice extra. But for data science this ‘extra’ is very important. So, let’s start with that. When your VM is created, it will start automatically. All you have to do is to pick an ‘Application URI’. Our personal favorite is JupyterLab.
You are then presented with a familiar UI of the JupyterLab. The storage that is loaded in the file browser is the storage account associated with the workspace we’ve created. To find it, go to the resource group which hosts the machine learning workspace and find the storage account with the name that includes the name of the workspace. Inside this storage account’s File Shares there is a directory whose name starts with ‘code-‘, the folder ‘Users’ that you see in your Jupyter lab is there. You can use this Azure portal interface to upload files there or download outputs after you’ve run some tests.
Now that you are inside the Jupyter Lab, you can use the terminal there to clone your code from GitHub and manipulate git. To make sure you have all your dependencies, Conda is already installed on the machine, so you can create the environment identical to the one you used elsewhere or create a new environment that you can later export. Beware of operating system incompatibilities. If you export your Conda environment .yaml file on windows, it will NOT work to create an identical environment on Linux (Azure Notebook VMs are Linux-based). You can either manually edit the file, or use pip freeze to export the dependencies and recurrently install them on the new machine.
Now that your data is in place and you have created your environment, you are ready to explore your data and play with the notebooks.
After you’ve done everything you want, you can end the session by simply closing the browser window with the lab. If you have left a notebook to run, you can start this URI again from the portal, but make sure you leave your virtual machine on. If you are not leaving anything to run on the background, make sure you stop the virtual machine to save the resources.
Obviously, we all love Jupyter, but we don’t always need the notebooks, sometimes we just need to set a script to run. While it can easily be done via Jupyter terminal, there is another, less bulky way to connect to the VM — SSH. In the same compute overview, press on the ‘SSH’ URI option, it will provide you with the info about the VM and the connection string at the bottom. Copy this connection string and run it from a terminal on your local machine. Enter a passphrase if you created one with the SSH key and you have virtual machine’s terminal at your fingertips.
You can set a script to run and simply end the SSH session. Just bear in mind that to let the script run, even after you have logged out you need to add ‘nohup’ (no hang-up) at the beginning of the command and ‘&’ at the end to run in the background. So to run a ‘train.py’ script, you should run ‘nohup python train.py &’. Then you can use Cntrl-D to end the SSH session (or type ‘exit’ into the terminal) and come back later to rip the results of your runs.
A brief note on experiment and model tracking
So far, we have learned how to create an Azure Machine Learning workspace and use it in its core, but quite limited function — run code on the cloud. However, this is just the tip of the iceberg. To take the full advantage of the AML framework, we encourage you to explore a python library called Azure Machine Learning SDK. One of the things it allows you to do is to track metrics throughout your experiments. If automatic experiment tracking is not yet a part of your workflow, we encourage you to make it such. Results of the experiment will be piling up pretty quickly. Without a neat way to track them, you will end up with a ‘soup’ of parameters, losses and accuracies which are hard to get through straight away, let alone go back in time to see why certain decisions were made.
Tracking with AML, like tracking with MLflow allows you to organize your work in experiments. Each containing several runs supposedly tracking the same metrics. The UI allows you to have a quick look at the relationships between different variables you are tracing and dig deeper into the details if you need.