Wednesday, October 11, 2023

How fast can I launch multiple OCI compute instances using Java SDK? #JoelKallmanDay

When I saw Tim Hall's blog post about #JoelKallmanDay it touched my heart. If you are somehow interested with Oracle APEX you probably know who Joel Kallman is. He meant a lot to the community. He is missed by the people all around the world, by the people he didn’t meet face to face. I wish this blog post was about APEX, maybe next year...

This one was waiting in my stash for long time as I wasn't happy about the dirty POC code and didn't have the time to refactor it. Actually it is about a really niche and cool requirement and second part of something I've posted in the past .

So let me start with a little context. Can you imagine how the big e-commerce platforms get ready for their peak seasons? This is about a software house who is highly specialized in load testing e-commerce applications. They have their own platform where e-commerce users prepare their test scenarios, and launch hundreds of thousands individual web agents to test the application for like 30 minutes. Under the hood the actual testing platform provisions tens (or hundreds) of compute instances, deploy the test code and runs it. Once the desired testing duration ends, all the compute instances are terminated. Perfect use-case that's possible only on cloud! This was one of the cool use-cases I've seen. Although the application is polyglot, they have chosen Java to code the instance creation part. So here we start.

My starting point is as usual OCI Online Documentation SDK for Java. The documents have links to Maven repository where I can just include the dependencies in my POM file . And there is an Oracle GitHub repository with quick start, installation and examples that will get me started in minutes. I've quickly located example code for creating a compute instance. The sample code is huge, it is creating everything from scratch, not only the compute instance but also VCN, subnet, gateways, etc. It is a comprehensive example, kudos to the team.

1I need a very quick test on how fast I can create instances. So here is a simplified test code which is getting all required inputs from environment variables (region, AD, subnet, compartment, image, and shape identifiers already that already exist), original sample is using waiters so I keep it just to see how convenient to wait my instances to reach a certain state (running)

And if I just test it with 5 instances to be created, the output is:

-----------------------------------------------------------------------
ocid1.instance.oc1.uk-london-1.... created in 36244 ms
ocid1.instance.oc1.uk-london-1.... created in 32845 ms
ocid1.instance.oc1.uk-london-1.... created in 32102 ms
ocid1.instance.oc1.uk-london-1.... created in 31995 ms
ocid1.instance.oc1.uk-london-1.... created in 62075 ms
Total execution time in seconds: 196
-----------------------------------------------------------------------

I am provisioning instances one by one and waiting for the instance to transition into RUNNING state. It took around ~30 seconds to provision a compute instance and see it in running state. Not bad at all. But this is not good enough, for extreme cases my customer needs tens of instances, can we do better?

2So I think I don't need to wait for the compute instance to reach running state before provisioning the other one, as long as I have the OCIDs of instances, I can come back to check the state later.

This time since expecting to wait less, I test it with 10 instances. Here is the output:

-----------------------------------------------------------------------
ocid1.instance.oc1.uk-london-1.... created in 2427 ms
ocid1.instance.oc1.uk-london-1.... created in 878 ms
ocid1.instance.oc1.uk-london-1.... created in 1041 ms
ocid1.instance.oc1.uk-london-1.... created in 982 ms
ocid1.instance.oc1.uk-london-1.... created in 971 ms
ocid1.instance.oc1.uk-london-1.... created in 772 ms
ocid1.instance.oc1.uk-london-1.... created in 743 ms
ocid1.instance.oc1.uk-london-1.... created in 754 ms
ocid1.instance.oc1.uk-london-1.... created in 972 ms
ocid1.instance.oc1.uk-london-1.... created in 812 ms
Total execution time in seconds: 12
-----------------------------------------------------------------------

This is a lot better, it is down to ~1 second per instance from 30 seconds per instance. I wonder if this can get any better. It is still synchronous call, one by one.

3What happens if we make it asynchronous? For this purpose I am using AsyncHandler which enables you with callback functions. Compute client also takes a different form: ComputeAsyncClient, input is the same. I do some concurrent processing with Futures , just to see if threads are done and collect the compute instance OCIDs

I again test it with 10 instances. Here is the output:

-----------------------------------------------------------------------
work requested in 391 ms
work requested in 14 ms
work requested in 10 ms
work requested in 9 ms
work requested in 7 ms
work requested in 7 ms
work requested in 5 ms
work requested in 6 ms
work requested in 7 ms
work requested in 4 ms
test-9 - ocid1.instance.oc1.uk-london-1....
test-10 - ocid1.instance.oc1.uk-london-1....
test-1 - ocid1.instance.oc1.uk-london-1....
test-4 - ocid1.instance.oc1.uk-london-1....
test-5 - ocid1.instance.oc1.uk-london-1....
test-6 - ocid1.instance.oc1.uk-london-1....
test-7 - ocid1.instance.oc1.uk-london-1....
test-8 - ocid1.instance.oc1.uk-london-1....
test-3 - ocid1.instance.oc1.uk-london-1....
test-2 - ocid1.instance.oc1.uk-london-1....
Total execution time in seconds: 2
-----------------------------------------------------------------------

As you can see from the output, there is no order because it is asynchronous and randomly created depending on thread execution order. It is blazing fast, took 2 seconds in total to create 10 instances!

Notes

What if I get greedy and try a larger batch? Then I get an error message because of request throttling protection.

Here is a little script to clean-up that can be used during tests.




References:
1. OCI Documentation: SDK for Java
2. Oracle GitHub Repository: SDK for Java
3. Oracle GitHub Repository: CreateInstanceExample.java
4. Tutorial: java.util.concurrent.Future
5. OCI Documentation: Request Throttling
6. OCI Documentation: Finding Instances

Monday, October 9, 2023

Back to the basics: How to clone boot volume cross tenancy including Free Tier

In this blog I try to write about unusual things, not always possible though. I prefer to write beacuse mostly for myself to remember what was the solution, second to share with friends and customers. This is one of the interesting ones.

The question is "One of my ex-employees has a demo environment in his Free Tier tenancy (which means seeded credits already spent/expired) and I want to move the compute instance (Always Free Micro Shape) to my paid company tenancy". If you take a close look at the documents you will find out that the block volume can be replicated accross data centers and regions . Volume backups are regional but you might also copy accross regions . But this is only possible within the tenancy. To be honest, this is strange because some customers use OCI Cloud with Organizations , parent/child relationship of their tenancies. But Free Tier is a blocker.

Next thing that comes to my mind is creating a custom image, and export/import image using Object Storage as explained here .

But as you see, since it's a Free Tier tenancy now we don't have the limit and the motivation.

So while searching for alternative, talking to PM I came accross this undocumented feature . Basically the solution playbook is saying if you setup proper policies in both tenancy (define the other tenancy and authorize it to access the resources), then using the cli or API you can clone a volume from one tenancy to another. Or restore a volume backup from tenancy to the other. So here is what I did.

1I have created the following policy in source Free Tier tenancy, the policy defines the target tenancy and authorize a group in target tenancy to clone a volume

Define tenancy NewTenancy as $TARGET_TENANCY_OCID
Define group NewTenancyIdentityGroup as $TARGET_TENANCY_GROUP_OCID
Admit group NewTenancyIdentityGroup of tenancy NewTenancy to use volumes 
in tenancy where ANY { request.operation='CreateVolume', 
request.operation='GetVolume', request.operation='CreateBootVolume', 
request.operation='GetBootVolume' }

2I have created the following policy in target tenancy, the policy defines the source tenancy and authorize a group to clone a volume, very similar to first one.

Define tenancy OldTenancy as $SOURCE_TENANCY_OCID
Endorse group NewTenancyIdentityGroup to use volumes 
in tenancy where ANY { request.operation='CreateVolume', 
request.operation='GetVolume', request.operation='CreateBootVolume', 
request.operation='GetBootVolume' }

3Then invoked the API with CLI to clone the boot volume in source tenancy ($BOOT_VOLUME_ID) with my profile connected to target tenancy

oci bv boot-volume create --profile=cross_tenancy_user_profile --debug \
--region=eu-frankfurt-1 --source-boot-volume-id $BOOT_VOLUME_ID  \
--display-name Cross-Tenancy-vm-e2micro-5 --compartment-id $COMPARTMENT_ID

Notes
1. Don't forget using compartments for your cli command
2. Also make sure the Group you are using in your target tenancy and the profile user can create block volume
3. If you get 404 - NotAuthorizedOrNotFound error message, most likely related to your policies
4. Policies are replicated to other regions from Home region, if you are working on a different region than your home region, take that into consideration
5. For same AD use clone, for different AD use backup restore
6. Although this seems to be the only way to copy block volume from a Free Tier without converting it to a paid tenancy, this feature can be very useful for moving large boot volumes and for bulk operations to move multiple boot volumes. It will be definitely easier than using Object Storage to export/import images which has a size limitation also.
7. Just imagine what other interesting use cases can be achieved with this admit/endorse policy setup


References:
1. Solution Playbook: Migrate Oracle Cloud Infrastructure volume data across tenancies
2. OCI CLI Command Reference : boot-volume » create
3. OCI Block Volume Documentation: BYOI Best Practices
4. OCI Block Volume Documentation: Copy Block Volume

Wednesday, October 4, 2023

Back to the basics: Should I use security list or network security group or both to secure my OCI deployment?

Today I was in a customer call, it was a pretty straightforward scenario. The session turned into hands-on pretty fast, and I love it, sharing the curiosity and eagerness to solve the problem with technical people, few things can match that feeling. And as always we came to the point where we delve into the troubleshooting. So here we go...

The requirement is simple, deploy an Ubuntu server to host a demo application over HTTP port 80, a small VM in a public subnet with a public IP Address and supporting security rules. VCN is created with the wizard, and it comes with a Default Security List which is populated with 3 stateful ingress rules:

First rule enables SSH access to my host, the other two ICMP rules are there for debugging and they don't enable a ping response. All of them are stateful. And this is the Egress part:

There is one stateful egress rule which enables outgoing traffic to any destination with any protocol on any port. State will be important as we will find out later...

Security list is attached to subnet and enforced at all VNICs in the subnet. So setting general rules with security list makes sense, however we also need to open HTTP 80 port for one server and we don't want this for all servers in the subnet. For this purpose we use network security groups which is another type of virtual firewall that Oracle recommends over security lists. You can use security lists and network security groups together. How do the rules apply? At simplest: a union of all rules are applied to VNIC. Security list is tied to subnet so applies to all VNICs in the subnet, NSG is attached to individual VNIC, so it's granular. Here are the rules in our NSG:

First rule is allowing incoming TCP traffic on port 80 from any source, the second rule is allowing outgoing TCP traffic. And rules are stateless , which means connection tracking is disabled. Why would I want that? Maybe I am expecting high traffic, or maybe I was greedy and wanted everything at once.

Overall architecture can be simplified like this:

So we SSH into our Ubuntu server using our public IP, also add linux firewall rules by updating iptables as explained in detail on this tutorial .

All set, for a really quick dirty test, let's run python

And it works, but we quickly find out there is another problem. We can't access Ubuntu repositories to update the packages or install new ones. Although IPv6 in the error message is distracting, it doesn't work with IPv4 either. It is a problem with accessing the internet.

So after some debugging, we soon realize the problem is having overlapping stateful and stateless rules. Our stateful egress rule on the security list should be providing all the access we need towards internet. But it doesn't, why? Because our stateless egress rule in NSG is overlapping and overriding the SL as stateless has precedence over stateful. This is what documentation exactly warns us about.

If for some reason you use both stateful and stateless rules, 
and there's traffic that matches both a stateful and stateless rule 
in a particular direction (for example, ingress), the stateless rule 
takes precedence and the connection is not tracked. You would need 
a corresponding rule in the other direction (for example, egress, 
either stateless or stateful) for the response traffic to be allowed.

Lessons learned
1. Use stateful rules (which is default) unless I have a good reason to use stateless
2. If using stateless ingress always exactly match it with an egress rule, don't use a broader rule
3. Don't use overlapping stateless and stateful rules, as the stateless rule takes precedence and the connection is not tracked, thus acting different than expected.

How did we fix it?
On our NSG, we converted ingress rule from stateless to stateful, and removed egress rule as it's not needed anymore.

If we wanted to use stateless rules, then a viable solution will be restricting egress rule to exactly match the ingress thus protocol TCP, source port 80 and destination port any.


References:
1. OCI Security Rules: Stateful Versus Stateless Rules
2. Developer Tutorials: Free Tier: Install Apache and PHP on an Ubuntu Instance
3. Enabling Network Traffic to Ubuntu Images: Enabling Network Traffic to Ubuntu Images

Featured

Putting it altogether: How to deploy scalable and secure APEX on OCI

Oracle APEX is very popular, and it is one of the most common usecases that I see with my customers. Oracle Architecture Center offers a re...