Setting up AWS SageMaker with Terraform

What is AWS SageMaker?

AWS SageMaker is a fully managed service that is essentially a one-stop shop for all the different applications that are needed to set up Machine Learning development environments, experiments and deploying models.

The biggest benefit is that it removed the need to have multiple different apps like kubeflow and mlflow set up and all the boilerplate needed to make those work in an MLOPs workflow.

In this article, I'll explain how to set up a bare-bones development environment in SageMaker using Terraform to orchestrate.

Basic setup

It all starts with a main.tf file.

terraform {
	required_providers {
		aws = {
			source = "hashicorp/aws"
			version = "5.38.0"
		}
	}
}

### Everything else is down here

There are two SageMaker-specific resources that we are going to set up domain and user-profile.

But before we can do that we need to set up some boilerplate, namely vpc, subnet, security_group and role.

Let's start with a role.

// BEGIN: Create a simple role
	resource "aws_iam_role" "simple_role" {
	name = "simple_role"
	path = "/"
	assume_role_policy = data.aws_iam_policy_document.simple_role.json
	managed_policy_arns = ["arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"]
}

  

data "aws_iam_policy_document" "simple_role" {
	statement {
		actions = ["sts:AssumeRole"]
		principals {
			type = "Service"
			identifiers = ["sagemaker.amazonaws.com"]
		}
	}
}
// END: Create a simple role

This is a very simplistic role, for the production environment you will need to apply the principle of least privilege.

Next, we are adding a vpc using the aws_default_vpc resource, this calls the default vpc that exists in every account. If the default vpc was deleted it will be recreated.

// BEGIN: Create a simple vpc
resource "aws_default_vpc" "simple_vpc" {
	tags = {
		Name = "Default VPC"
	}
}
// END: Create a simple vpc

Now that we have the vpc we can add the security groups, which we are going to set up using default resources.

// BEGIN: Create a simple subnet
resource "aws_default_subnet" "simple_subnet" {
	availability_zone = "eu-central-1c" # substitute for your own region
	tags = {
		Name = "Default subnet for eu-central-1"
	}
}
// END: Create a simple subnet

// BEGIN: Create a simple security group
resource "aws_default_security_group" "simple_security_group" {
name_prefix = "simple-security-group"
vpc_id = aws_default_vpc.simple_vpc.id

	ingress {
		protocol = -1
		self = true
		from_port = 0
		to_port = 0
	}
	
	egress {
		from_port = 0
		to_port = 0
		protocol = "-1"
		cidr_blocks = [aws_default_subnet.simple_subnet.cidr_block]
	}
}
// END: Create a simple security group

Again, these are only for show, in production create proper security groups that don't allow everything to go in and out.

Now we can start with setting up SageMaker itself.

First, we need to create the domain, the domain is best through a project folder. It's the top-level entity that contains all the users, apps(notebooks) and models that go into a machine-learning project.

// BEGIN: Create a SageMaker Domain
resource "aws_sagemaker_domain" "simple_domain" {
	domain_name = "simple-domain"
	auth_mode = "IAM"
	vpc_id = aws_default_vpc.simple_vpc.id
	subnet_ids = [aws_default_subnet.simple_subnet.id]
	
	default_user_settings {
			execution_role = aws_iam_role.simple_role.arn
		}
}
// END: Create a SageMaker Domain

Now that the domain is created we can add the user-profiles.

The user-profile is the SageMaker lever iam-user, it can use roles and be given exclusive access to certain resources within the SageMaker domain.

// BEGIN: Create a SageMaker UserProfile
resource "aws_sagemaker_user_profile" "simple_user_profile" {
	domain_id = aws_sagemaker_domain.simple_domain.id
	user_profile_name = "simple-user-profile"
	user_settings {
		execution_role = aws_iam_role.simple_role.arn
	}
}
// END: Create a SageMaker UserProfile

The SageMaker Studio, now SageMaker Studio Classic is created automatically once the domain and user-profile are created. The default instance type is ml.t3.medium, to use a larger one you have to enable the larger instances in the Service Quotas and then set the default instance type to the one you want. SageMaker supports CPU and GPU clusters, here's the list.

When using those instances make sure to assign it an image that is optimized for the workload, you can find the managed SageMaker images here.

The SageMaker > Studio Classic console view should look like this.

Studio Classic

Now that we've set up all of the basics we can spin the app up, using terraform.

First, we need to initialize the project.

terraform init

The format it.

terraform fmt

Run plan to stage the changes.

terraform plan

Apply to create the resources.

terraform apply

To clean up the project, run the destroy command.

terraform destroy

Notes:

The AWS Provider used in this tutorial wasn't very stable, it's entirely possible that the code above won't work in a couple of weeks.
I'll check in next month to see if a more stable version exists by that time.