Azure ML 机器学习: Compute Instance, Computer Cluster, Inference Cluster的创建以及获取

大大源码 • 2023年4月18日 pm10:40 • 未分类

Azure ML: Compute Instance, Computer Cluster, Inference Cluster的创建以及获取

解释如何在 Azure ML Python SDK 以及 Azure Portal 上创建与获取Compute Instance, Computer Cluster, Inference Cluster。

文章目录

Azure ML: Compute Instance, Computer Cluster, Inference Cluster的创建以及获取

1 Azure Compute Instance

Azure Compute Instance，官方的中文翻译是Azure 机器学习计算实例。其实就是虚拟机，我们平时的代码都可以在上面跑。

在Azure Portal上，我们需要登录进Azure Machine Learning Studio，然后点击Compute。在Compute Instance这一栏，点击Create，如下图：

在这里插入图片描述

我们需要注意三点，一个是，每一个选择的VM价格不一样，如果我们要用到GPU，不一定需要在这里选择，因为费用比较贵，可以在compute clusters里选择对应的GPU；第二个，如果自己账户quote不够，那么需要申请扩充，否则无法选择一些VM；第三，由于VM是计费的，所以当你新建了一个VM后，不使用的时候记得关掉，否则会扣钱。

当然，我们也可以通过Python SDK来新建Compute Instance：

import datetime
import time
from azureml.core.compute import ComputeTarget, ComputeInstance
from azureml.core.compute_target import ComputeTargetException
from azureml.core.workspace import Workspace

# Choose a name for your instance
# Compute instance name should be unique across the azure region
compute_name = "XXX"
ws = Workspace.from_config()
compute_config = ComputeInstance.provisioning_configuration(
    vm_size='STANDARD_D3_V2',
    ssh_public_access=False )
instance = ComputeInstance.create(ws, compute_name, compute_config)
instance.wait_for_completion(show_output=True)

我们通过下面的代码来获取某个Compute Instance：

from azureml.core.compute import ComputeTarget, ComputeInstance
from azureml.core.compute_target import ComputeTargetException
from azureml.core.workspace import Workspace
compute_name = "XXX"
ws = Workspace.from_config()
instance = ComputeInstance(workspace=ws, name=compute_name)

2 Azure Compute Cluster

Azure Compute Cluster，官方的中文翻译是Azure 机器学习计算集群。可以使用 Azure 机器学习计算群集在云中的 CPU 或 GPU 计算节点群集之间分配训练或批量推理过程。

创建Compute Cluster，我们一般不通过Azure Portal，因为Compute Cluster一般在跑算法之前创建，或者调用。这里我们选择Azure Python SDK进行创建：

from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.workspace import Workspace
cluster_name = "XXX"
ws = Workspace.from_config()
compute_config = AmlCompute.provisioning_configuration(
    vm_size="Standard_NC6",
    idle_seconds_before_scaledown=600,
    min_nodes=0,
    max_nodes=4,
)
compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute_target.wait_for_completion(
    show_output=True, min_node_count=None, timeout_in_minutes=20
)

我们需要根据自己的需求和选择的区域来选择哪种GPU。这里选择的是Standard_NC6。

我们通过下面的代码来获取某个Compute Instance：

from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
cluster_name = "XXX"
compute_target = ws.compute_targets[cluster_name]

3 Azure Inference Cluster

我们可以通过启用 Azure Arc 的 Kubernetes 进行服务部署。而新建的 Kubernetes 实例会被保存在Azure Inference Cluster中。

创建AKS集群的代码如下：

from azureml.core.compute import ComputeTarget, AksCompute
from azureml.exceptions import ComputeTargetException
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
# Choose a name for your cluster
aks_name = "XXX"
# Provision AKS cluster with a CPU machine
# 4 cores, 16GB RAM, 100GB storage, southcentralus
prov_config = AksCompute.provisioning_configuration(vm_size="Standard_D4_v3")
# Create the cluster
aks_target = ComputeTarget.create(
    workspace=ws, name=aks_name, provisioning_configuration=prov_config
)
aks_target.wait_for_completion(show_output=True)

在我们的例子中，我们创建了一个带有CPU机器的AKS集群，它包含4个内核、16GB RAM和100GB存储。

我们通过下面的代码来获取某个AKS实例：

from azureml.core.compute import ComputeTarget
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
aks_name = "XXX"
aks_target = ComputeTarget(workspace=ws, name=aks_name)