Perfect Spot Instance’s Imperfections | part-II

Hello friends, if you are reading this blog, I assume that you have gone through the first part of my blog . However, if you haven’t, I suggest you to go through the link before reading this blog.

Now let’s recall the concept of first part of this blog that we are going to implement in this part.

We will create all the components related to this project(as shown in figure above). We will also go through how best we can create spot fleet request wisely by choosing the parameters that fits for our purpose and prone to less interruption. I assume that you have previously created VPC, Subnets(at least two), Internet Gateway and have associated it with VPC, Target Group, AMI(nginx server running at 80), Launch Configuration with the same AMI, ASG with launch configuration you created, and associating the tags for the instance(on-demand), also associating the target group you created, Load Balancer(listening on http protocol and directing to target group you created), route53(optional -where one address is mapped to Load Balancer DNS name).

I have made my server public if you want you can make them private. After this we are going to create IAM role, Spot fleet request, Lambda function, and then Cloudwatch rules for our purpose.

 

Let’s create IAM Role

We are going to create two IAM roles. One for Lambda  and other for Spot Fleet Request.

  1. Go to IAM console. Select Roles from left side navigation pane.
  2. Click on Create role and follow the screenshot below.

Here we are creating Role for Lambda function.

  1. Click on Next:Permissions.
  2. Check on Administrator access to allow lambda to access all AWS services.
  1. Click on Next:Tags.
  2. Add tags if you want, then click on Next:Review.
  3. Type your Role Name and then click on Create role. Role is created now.
  4. Now again come to IAM console, select Roles from navigation pane.
  5. Click on Create Role and then follow screenshot below.

This role is for spot fleet request.

  1. Follow the steps from 3 to 7. 

So, by this point we have two IAM Roles let’s say Lambda_role and Fleet_role . 

Now we are going to create interesting part which deals with the process to place spot fleet request and also explains each and every related factors which powers us to customize our spot fleet requests according to your needs wisely.

Create Spot Fleet Request 


Move to EC2 dashboard,click on Spot Requests. Then click on Request Spot Instances.

  1. You will see something like below:

Load Balancing Workloads: For instances of same size in any Availability zone.
Flexible Workloads: For instances of any size in any Availability zone.
Big Data Workloads: For instances of same size in single Availability zone.

Now decide among these three options based on your requirements and if you are just learning then leave the choice to default.

The last option ‘Defined duration workloads’ is a bit different. This option provides you a new way to reserve spot instances but for a maximum of 6 hours only with the choices vary from 1 to 6  hours. This will ensure that you will not be interrupted for 1 to 6 hours after you opted to run your workloads for defined duration and because you won’t be interrupted you will have to pay slightly higher price than the spot instances.

So, for this option AWS has another pricing category called Spot Block. Under this model, pricing is based on the requested duration and the available capacity, and is typically 30% to 45% less than On-Demand, with an additional 5% off during non-peak hours for the region. Observe the differences below.

Let us start with a brief comparison of categories into which we can launch our instances.

Instance Types Spot Instance Price Spot Block Price for
1 hour
Spot Block Price for
6 hours
a1.medium $0.0084 per Hour $0.012 per Hour $0.016 per Hour
a1.large $0.0168 per Hour $0.025 per Hour $0.033 per Hour
c5.large $0.0392 per Hour $0.047 per Hour $0.059 per Hour
c1.medium $0.013 per Hour $0.065 per Hour  $0.085 per Hour
t2.micro $0.0035 per Hour $0.005 per Hour $0.007 per Hour

2. Next we will configure our spot instances.

Launch Template: If you have ‘Launch Template’ you can select it from here. One advantage of using launch template is you will have an option for choosing a part of your total capacity as on-demand instances. If you don’t have launch template you can go with AMI and specify all other parameters.
AMI: Don’t have launch template?? Choose AMI but with this option you won’t be having features to choose a part of your total capacity as on-demand instances.
Minimum Compute Unit: Specify how much capacity you need either in terms of vCPU and memory or as instance types. As the name suggests this is the minimum capacity that we need for our purpose. AWS will choose similar instances based on this option.
Then, you will have options to choose vpc, Availability zone, key-pair name. On additional configuration you can choose security groups, IAM instance profile, user data, tags and many more configurations.

3. Next section is for defining target capacity.

Total target capacity: Specify how much capacity is needed. If you have chosen launch template then you can specify how much of total capacity you want as on-demand.
Maintain target capacity: Once selecting this feature AWS will always maintain your target capacity.Suppose AWS takes your one instance back then under this option it will automatically place request for one spot-instances in order to maintain target capacity. Selecting this option allows you to modify target capacity even after spot-fleet request has been created.
Maintain target cost for Spot: This option allows you to set the maximum hourly price you want to pay for spot instances and it’s optional.

4. Next section is about customizing your spot fleet request.

If you don’t care much about how AWS is going to fulfill your request leave the things to default and proceed to next step, or else uncheck the ‘Apply recommendations’ field which will appear like below: 

As i have chosen c3.large as minimum compute unit, AWS chooses similar instances like c3.large or it can even choose instances with more memory than specified but not lesser under almost similar prices. You can also choose more instances by selecting Select instance types, this will strengthen your probability of getting spot instances. Now let’s see how!

Suppose you ordered for only t2.small instance. You will get that only when this instance is available in the Availability Zone you specified. So to increase your chances, you should specify maximum Availability Zone and maximum instance types which is required for your workload. This will increase the number of pools and ultimately the chances of success.

     Picture showing the instance pool an Availability Zone can contain.

Instance Pools: You can say it a bag full of same instance types.
Every Availability Zone contains instance pools. So suppose you choose two availability zones and a total of three instances types. You, then, will have 6 instance pools three from each availability zone. Maximum the number of instance pools, maximum is the chance of getting spot instances.
Fleet allocation strategy: AWS allows you to choose allocation strategy i.e., the strategy through which your capacity is going to be fulfilled. One thing to be considered is that AWS will try to evenly distribute your capacity across your specified availability zones.
Lowest price: Instances will be made available to you from the pools with the Lowest Price. Suppose you choose 2 Availability Zones and your capacity is 6 then 3 instances from each Availability Zone comes from pools with lowest cost.
Diversified across n number of pools: Suppose your capacity is 20 and 2 Availability Zones are selected and also you have chosen diversified across 2 instance pools. So 10 instances from each Availability Zone will be provided with further restriction to choose 2 pools to fulfill the capacity of 10 from each Availability Zone. This will make at least one pool available for you even if the other pool is unavailable and hence reducing the risk of interruption.
Capacity Optimized: This option provides you the instances among those pools which are highly available. Suppose your capacity is 20 and 2 Availability Zones are selected. So 10 instances from each Availability Zone will be provided from the pools which is highly available and is going to be highly available in future too.

5. Next section is for choosing the price and other additional configurations.

First of all, Uncheck the Apply defaults. Then, customize your additional settings under the heads:

IAM fleet role: This role allows you to provide tags to your spot instances. Choose the default one or you can create your own.
Maximum price: This option allows you to choose default maximum price which is on-demand price or to set your maximum price you want to pay for an instance per hour.
Next you can specify your spot fleet request validation. Check Terminate instances on request expire.
Load balancing: If you check Receive traffic from one or more load balancer you will get to choose the classic load balancer under which you want to launch your instance or you can choose the target group under which you want to launch your instance.

This is how you can help yourself customize spot fleet request based on your requirements. Lastly i will advise you to visit the link below once when you are on to deciding your suited instance types. After visiting this link you can make estimates of your saving.

https://aws.amazon.com/ec2/spot/instance-advisor/

 

Let’s create Lambda function for our purpose

We are going to create two lambda function for our purpose. Follow the steps below:

  1. Create an IAM role ready to allow lambda function to modify ASG, i prefer to ready an IAM role with admin permission because we are going to require IAM role many times throughout this project.
  2. Head to AWS Lambda console. On the left side navigation pane click on function.
  3. Click on Create function tab.
  4. Leave the default selection Author from scratch.
  5. Enter function name of your choice.
  6. Select Runtime as Python 3.7 .
  7. Expand Choose or create an execution role and select the role you have created for this project.
  8. Click on Create function
  9. Scroll down and put the below code on lambda_function.py.

    Copy code from below

import json
import boto3

def lambda_handler(event, context):
    # TODO implement
    client = boto3.client('autoscaling')
    response = client.set_desired_capacity(
        AutoScalingGroupName='spot',
        DesiredCapacity=1
    )

This code increases the desired capacity of ASG with name ‘spot’ to 1.

This is triggered when AWS issues interruption notice. As a result this launches an on-demand instance to maintain total capacity of two.


10. Upon the upper right corner click on save to save the function and come again to AWS Lambda console and one more time select function and then Create function.

11. Give your function name and execution role same as did previously and select Create function to write code to maintain spot capacity to two always.

12. Refer following snap.

Copy code from below

import boto3         ## Python sdk
import json

## In this part code is checking if number of spot instances is >= 2,
##then set ASG desired capacity to 0 to terminate running on-demand instances running.
def asg_cap(fleet, od):
    print('in function',fleet)
    print('in function',od)
    if fleet >= 2 and od > 0:
        client = boto3.client('autoscaling')
        response = client.set_desired_capacity(
            AutoScalingGroupName='spot',
            DesiredCapacity=0
        )

##Beginning of the execution
def lambda_handler(event, context):
    cnt = 0
    ec3 = boto3.resource('ec2')
    fleet = 0
    od = 0
    instancename = []
    fleet_ltime = []
    od_ltime = []
    for instance in ec3.instances.all():    ##looping all instances
        print (instance.id)
        print (instance.state)
        print (instance.tags)
        print (instance.launch_time)
        abc = instance.tags                ##get tags of all instances
        ab = instance.state                ##get state of all instances
        print (ab['Name'])
        if ab['Name'] == 'running':        ## checks for the instances whose state is running 
            cnt += 1
            for tags in abc:
                if tags["Key"] == 'Name':  ## checks for tag key is 'Name'
                    instancename.append(tags["Value"])
                    inst = tags["Value"]
                    print (inst)
                    if inst == 'fleet':    ## checks if tag key 'Name' has value 'fleet'. Change 'fleet' to your own tag name       
                        fleet += 1
                        fleet_ltime.append(instance.launch_time)
                    if inst == 'Test':     ## checks if tag key 'Name' has value 'Test'. Change 'Test' to your own tag name
                        od += 1
                        od_ltime.append(instance.launch_time)
                    
    print('Total number of running instances: ', cnt)
    print(instancename)
    print('Number of spot instances: ', fleet)
    print('Number of on-demand instances: ', od)
    print('Launch time of Fleet: ', fleet_ltime)
    print('Launch time of on-demand: ', od_ltime)
    
    if od > 0:
        dt_od = od_ltime[0]
    else:
        dt_od = '0'
        
    if fleet > 1:
        dt_spot = fleet_ltime[0]
        dt_spot1 = fleet_ltime[1]
    elif (fleet > 0) and (fleet < 2):
        dt_spot = fleet_ltime[0]
        dt_spot1 = '0'
    else:
        dt_spot = '0'
        dt_spot1 = '0'
        
        
    if dt_od != '0':
        if dt_spot != '0':
            if dt_od > dt_spot:
                if dt_spot1 != '0':
                    if dt_od > dt_spot1:
                        print('On-Demand instance is Launched')
                        # do nothing
                    else:
                        print('Spot instance is Launched')
                        asg_cap(fleet, od)
                else:
                    print('Only 1 spot instance exist')
            else:
                print('1Spot instance is Launched')
                asg_cap(fleet, od)
        else:
            print('No spot instance exist')
    else:
        print('No On-Demand instance exist')
        
        
    ## modify the spot fleet request capacity to two    
    client1 = boto3.client('ec2')
    response = client1.modify_spot_fleet_request(
        SpotFleetRequestId='sfr-92b7b2f1-163b-498a-ae7c-7bd1b4fdb227', ##replace with your spot fleet rquest 
        TargetCapacity=2
    )

13. Save this function.

Till now we have created two lambda function and now we are going to create Cloudwatch Rules which will call lambda function on interruption and state change to running of ec2 instances on our behalf.

Let’s create Cloudwatch Rules.

Steps to create Cloudwatch Rules.

  1. Go to Cloudwatch console.
  2. Select Rules from left side navigation pane.
  3. Click on Create Rule.
  4. Follow the screenshot below.

We are creating rules here to trigger on interruption notice by AWS.

On left side there is Targets Area, there select Lambda function and then select the first function which is increasing the desired capacity of ASG to 1.
Save the first Rule.

  1. Now let’s create another cloudwatch rule. 
  2. Create Rule. Follow screenshot.

After this add target as Lambda function and function name to the one which we created secondly a little lengthy one.

Save the second Rule.

Now to verify this automation go to spot fleet request you created with target capacity two. Select that request, click Action tab and click on Modify capacity and replace 2 with 1 there. This will terminate one spot instance and before that it is going to send interruption notice. Observe the changes on Auto scaling group, instance, and spot request dashboards. Wait for couple of minutes, if everything is right and according to our plan then again you will be having two spot instances under your bag.

If you are not having two spot instances at the end, then something is not right. You need to cross-check to verify:

  1. Check the name of the Auto scaling group you created. Copy it’s name and now go to the first lambda function you created which increase the desired capacity of ASG to 1 and check if the function contain the correct name of Auto scaling group, if not paste the name you copied against AutoScalingGroupName section.
  2. Check the tag Name value of spot instances and note somewhere. Also check the tag Name of On-demand instance which you configured while creating Auto scaling group and note that too. Now go to second lambda function you created and go to line number 40, here tag Name value of spot instance is given under single quotes, check if that matches with yours. Now at line number 43 check if tag Name of on-demand instance matches with yours.
  3. Go to Spot request and copy the spot fleet request id you created and go to line number 94 of the second lambda function and make sure the id under single quotes matches with yours.

Now test again, hopefully it will work now. If still you are facing problem or you have not created manually all the above stuffs then no need to worry.

I have created terraform code which will create the whole infra needed for this project. However after running this terraform code successfully you will have to make some changes so that your infra functions properly and for that you need to follow the steps below. Below is the link of github repository, clone the repo and post that follow the steps stated below:

Link: https://github.com/sah-shatrujeet/infra_spot_fleet_terraform.git

  1. Make sure you have terraform version 0.12.8 installed. 
  2. You must have AWS CLI configured too. 
  3. Before running the terraform code go to the folder where you have cloned the repo, then go to infra-spot/infra/infra and open vars.tf on your favourite editor.
  4. Go through the files and change the default set values as per your choice.
  5. Run the terraform apply by moving into the folder infra-spot/infra/infra .  Follow last three step to assure your infra is going to automate properly.
  6. Go to the Lambda function console, select first_function and check if the code contains same Autoscaling group name as the name of the Autoscaling group created with terraform. If not match the same with yours.
  7. On lambda function console select second_function and repeat the previous step and also check the Tag Name of the on-demand and spot instances, they are ‘Test’ and ‘fleet’ respectively in the terraform code. However you should make sure that the Tag Name of on-demand and spot instances matches with the tag mentioned in code.
  8. Lastly head on to spot request and note the request id and match with second_function SpotFleetRequestId part.

I believe terraform will do everything right for you. If you are still facing problems, i will be happy to resolve your queries.

 

Good to know

AWS reports shows that the average frequency of interruptions across all regions and instance types is <5% .

For any instance types on-demand price is maximum and we can bid at a maximum of 10 times of on-demand price.

When you will implement spot instances automation for your project then you will come across different scenario, you might need to monitor more events and trigger action based on that. Unfortunately, cloudwatch do not have all the events of AWS covered. But AWS CloudTrail solve this. CloudTrail knows and covered everything you performed on AWS. To use this on cloudwatch you will have to enable CloudTrail and then you can make a rule with service EC2 and Event Type AWS API Call via CloudTrail and then can add any specific operation that is not present as cloudwatch events. However i recommend you to first go through every details about CloudTrail before implementing that. If you have any queries implementing CLoudTrail, you can ask on comment section. I will be happy to help you.

While implementing spot instances for Database you may configure your spot instances such that volume of instance will not be deleted upon spot instance termination assuring that you are not going to loose any data.

 

Conclusion

With smart automation and monitoring we can have our production server on spot instances with guaranteed failover and high availability. However one who don’t want to run into any risk or the one who has no proper idea or resource  to automate the interruption can plan :

  1. Half or a portion of  the total production server on spot instances.
  2. Development server on spot instances without any worry.
  3. QA server over  spot instances.
  4. For more capacity prefer spot instances to ease the load on other main server.
  5. Prefer irregular short-term task on spot instances.

 

 

17 thoughts on “Perfect Spot Instance’s Imperfections | part-II”

      1. Thanks cmtopinka, you can post your queries or issues here too. I will be happy to help.

      2. I saw you closed your issue on github, you can post your issues you have here too.

  1. Working on experimenting with it as in v0.11 and also upgrading to current 0.12+. Need to make it workspace capable and insert terraform.workspace into names and tags. Not sure what’s going on with the issues I found during upgrade. I closed them because only an issue for upgrade. If I would suggest another issue/feature it would be more generally upgrade for 0.12+ and insert idea of workspace envs. But that’s not something I expect you to work on. When we get it working I’ll submit a pr if that’s ok. Or I can put in the issue and maybe we collaborate on it. Not sure how much commitment we have yet for this. It’s something we are interested in though. Excellent post by the way. I did work on a diagram for it if you are interested I can share it. Thanks!

  2. Something like this can work

    “`
    data “aws_ami” “ubuntu” {
    most_recent = true

    filter {
    name = “name”
    values = [“ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-*”]
    }

    filter {
    name = “virtualization-type”
    values = [“hvm”]
    }

    owners = [“099720109477”] # Canonical
    }

    “`

  3. Yes that works. Getting closer now. Currently have this error: (sorry no code blocks here?)

    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (8m30s elapsed)
    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (8m40s elapsed)
    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (8m50s elapsed)
    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (9m0s elapsed)
    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (9m10s elapsed)
    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (9m20s elapsed)
    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (9m30s elapsed)
    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (9m40s elapsed)
    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (9m50s elapsed)
    module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: Still creating… (10m0s elapsed)

    Error: Error applying plan:

    1 error occurred:
    * module.spot_fleet_request.aws_spot_fleet_request.cheap_compute: 1 error occurred:
    * aws_spot_fleet_request.cheap_compute: Error requesting spot fleet: Error creating Spot fleet request, retrying: InvalidSpotFleetRequestConfig: Parameter: SpotFleetRequestConfig.ValidUntil is invalid.
    status code: 400, request id: 882f7ff7-8c35-4149-9e43-6e8caeaa213e

  4. I think this will work

    valid_until = “${timeadd(timestamp(), “24h”)}”

    Is 24h reasonable?

    A more difficult problem…

    On subsequent applies I’m getting

    Error: Error refreshing state: 1 error occurred:
    * module.archivetozip1.data.archive_file.zip1: 1 error occurred:
    * module.archivetozip1.data.archive_file.zip1: data.archive_file.zip1: error archiving file: could not archive missing file: od_running.py

    attempting destroy at that point results in same error. Workaround is create an empty file and then destroy, then remove the empty file. What’s a fix for this? Thanks

  5. I’d also be interested in hearing how you went about putting this together? Spot fleet first and then built incrementally? Were you able to borrow some pieces. Would be a good story in maybe a 3rd post if you can recall.

  6. !!!!!

    module.listner.aws_lb_listener.alb_listing: Creation complete after 0s (ID: arn:aws:elasticloadbalancing:us-east-1:…b-sf/ab386c6d7405898b/245bd0f9d005d934)

    Apply complete! Resources: 33 added, 0 changed, 0 destroyed.

    !!!!!!!!

    1. Excellent. I can see my spots now in the console. It appears they are the same region? Is it possible to spread them across multiple regions? Or is it good enough to have them in different availability zones?

      How can I experiment with this to see the reserve on-demand instance come into action? Is it possible to take down a spot instance intentionally that will trigger the rule?

      1. Going overboard on questions here… Would it make sense to spread the fleet across regions or availability zones? Is it more likely that multiple instances in the same region/zone could be interrupted at the same time due to capacity? And do I care since I have the on-demand reserve? Maybe not. What’s been your experience with the frequency that an on-demand instance is needed?

Leave a Reply