Developing for the Cloud in the Cloud

A BigData Development Environment with Docker in Amazon AWS

In this article I show how you can use this project in GitHub I developed to build your own integrated development environment entirely in the Amazon AWS cloud.

Why you may need it?

I am a developer, and I work daily in Integrated Development Environments (IDE), such as Intellij IDEA or Eclipse. These IDEs are desktop applications. Since the advent of Google Documents, I have seen more and more people moving their work from the desktop versions of Word or Excel to the cloud, using an online equivalent of a word processor or a spreadsheet application.

There are many obvious reasons using a cloud for keeping your work. Today, compared to the traditional desktop business applications, some web applications do not have a significant disadvantage in functionalities. The content is available everywhere there is a web browser, and that usually means just everywhere. Collaboration and sharing are easier, and losing files is less likely.

Unfortunately, these cloud advantages are not as common in the world of software development as is for business applications. There are some attempts to provide an online IDE, but they are nowhere close to the traditional IDEs.

That is a sort of a paradox: while we are still bound to our desktop for daily coding, the software now spawns on multiple servers. Developers needs to work with stuff they cannot keep any more on their computer. Indeed, laptops are no more increasing their power: having more than 16GB of RAM on a laptop is rare and expensive, and newer devices like tablets have even less power.

However, even if it is not yet possible to replace classical desktop applications for software development, it is indeed possible to move your entire development desktop in the cloud. The day I realized it could be possible not to have all the software I work with on my desktop machine, noticing the availability of web version of terminals and VNC, I started to work to move everything in the cloud. Eventually, I developed a build kit for creating the environment in an automated way.

In this article I present a set of scripts to build a cloud based development environment for Scala and big data applications, running in Amazon AWS, and comprising of a web accessible desktop with IntelliJ IDE, Spark, Hadoop and Zeppelin as services, and also command line tools like a web based SSH, SBT and Ammonite. The kit is freely available on GitHub, and I describe here the procedure to use it to build your instance. You can use it to build your environment, and even customize it to your particular needs. It should not take you more than 10 minutes to have it all up and running.

What is in the “BigDataDevKit”?

My primary goal developing the kit was my development environment should became something I can just fire up, with all the needed services and servers, work with, and then destroy when they are no more needed. This is especially important when you work with different projects, some of them involving a large number of servers and services, like it is when you work on big data projects.

My ideal cloud-based environment should:

  • Include all the usual development tools, most important a graphical IDE.
  • Put all the servers and services I needed at my fingertips.
  • Be easy and fast to create from scratch, and expandable, to add more services.
  • Be entirely accessible only using a web browser.
  • Optionally, allow access with specialized clients (VNC client and SSH client).

Leveraging modern cloud infrastructure and software, the power of modern browsers, a widespread availability of broadband, and the invaluable Docker, I was able to create a development for Scala and big data development that replaces my development laptop, for the better.

Currently, I can work at any time, either from a Macbook Pro, a Surface Tablet, or even an iPad (with a keyboard), although admittedly the last option is far from ideal. All these devices are just a client; the desktop and all the servers are in the cloud.

My current environment is built using following online services:

  • Amazon Web Services for the servers.
  • GitHub for storing the code.
  • Dropbox to save files.

I also use a couple of free services, like DuckDns for dynamic IP addresses and Let’s encrypt to get free SSL certificate.

In this environment I currently have:

  • A graphical desktop with Intellij idea, accessible via a web browser.
  • Command line tools like SBT and Ammonite command line tools, web accessible too.
  • Hadoop for storing files and running MapReduce jobs.
  • Spark Job Server for scheduled jobs.
  • Zeppelin for a web-based notebook.

Most important, the web access is fully encrypted with HTTPS, both for web-based VNC and SSH, and there are multiple safeguards to avoid losing data, a concern that is, of course, important when you do not “own” anymore the content on your physical hard disk. Note that getting a copy of all your work on your computer is automatic and very fast. If you lose everything because someone stole your password, you have a copy on your computer anyway. Of course, if you configured everything correctly.

Using a Web Based Development Environment

Now, let’s start describing how the environment works. When in the morning I start my work, the first thing is to log into the Amazon AWS console, where I see all my instances. Usually, I have many development instances configured for different projects, and I keep the unused ones turned off, to save billing when the instance is not needed. After all, I can work only on one project at the time. Well, sometimes I work on two, but rarely more.

So, I select the instance I want to work with, start it and I wait for a little, or I go to grab a cup of coffee. That is not different than turning on your computer. It usually takes just a bunch of seconds to have the instance up and running. Once I see the green icon, I open a browser, and I go to a well known URL. Note this is my URL, if you create a kit, you will get your unique URL.

Since AWS assigns a new IP to each machine when you start them, I configured a dynamic DNS service, so you can always use the same URL to access to your server, even if you stop and restart it. You can even bookmark it in your browser. Furthermore, I use HTTPS, with valid keys to get the total protection of your work from sniffers, in case you need to manage passwords and other sensible data.

Once loaded, the system will welcome you with a Web VNC web client, NoVNC. Just log in and a desktop is shown. I used a minimal desktop intentionally. Just a menu with the applications and the only luxury was a virtual desktop (since I open a lot of windows when I develop). For mail and some other application, I still rely on other applications, nowadays mostly other browser tabs.

In the virtual machine, I have what I need to develop big data applications. First and before all, there is an IDE. In the build, I put IntelliJ Idea community edition. Also, there is the SBT build tool and a Scala REPL, Ammonite.

The key feature of this environment, however, are the backed services, deployed as containers in the same virtual machine. In particular, I have:

Note, those URLs are fixed, but accessible only with the browser within the environment. You can see their web interfaces in the following screenshot.

Each service runs in a separate Docker container. Without becoming too technical, you can think there are another three separated servers inside your virtual machine. The beauty of using Docker, however, is you can add services, and even add virtual machines, two or three, instead of one. Using Amazon containers, you can just scale your environment easily.

Last, but not least important, you have a web terminal available. Simply access to your URL with HTTPS and you will be welcomed with a terminal in a web page over HTTPS.

In the screenshot above you can see, I am listing the containers, which are the three servers plus the desktop. This command line shell gives you access to the virtual machine holding the containers, allowing you to manage them. Think about it like your servers are “in the matrix” (virtualized within containers), but this shell gives you an escape outside the “Matrix” to manage servers, and desktop. From here, you can restart the containers, access the filesystem of the containers and do all the other manipulations allowed by Docker. I will not discuss in detail Docker here, but there is a vast amount of documentation on Docker website available.

How to setup your instance

Do you like all of this so far, and you want your instance? It is easy and cheap. You can get it for just the cost of the virtual machine on the Amazon, plus the storage. The kit in the current configuration requires a 4GB of RAM to have all the services running. If you are careful to keep the virtual machine only when you need it, and you work, say, 160 hours a month, a virtual machine at current rates will cost 160 * $0.052 = $8 the month. You will have to add the cost of the storage. I use around 30GB, but everything should be kept under $10.

In this math is not included the cost of an (eventual) Dropbox (Pro) account, if you want to backup more than 2GB of code. This sum up another 15$ per month, but it provides invaluable safety for your data. Also, you need a private repository, either a paid GitHub or another service like Bitbucket which offers free private repositories.

I want to stress that if you use it only when you need it, it is cheaper than a dedicated server. Yes, everything mentioned here can be setup on a physical server, but since I work with big data I need a lot of other AWS services for many purposes, so I thought it was logical to have everything in the same place.

Assuming all of this it is within your budget, let’s see how to do the whole setup.

Prerequisites

Before starting to build the virtual machine, first you need to register following four services:

Of them, the only one where you need to put your credit card is Amazon. DuckDns is entirely free, while DropBox gives you 2GB of free storage, which can be enough for many tasks. Let’s Encrypt is also free, and it is used internally when you build the image to sign your certificate. Besides them, I also recommend a repository hosting service too, like GitHub or Bitbucket, if you want to store your code, but it is not required in the setup.

To start, navigate to the GitHub BigDataDevKit repository.

Scroll the page and copy the script shown in the image in your text editor of choice:

This script is needed to bootstrap the image. You have to change it and provide some values to the parameters. Carefully, change only the text within the quotes. Note you cannot use characters like the quote itself, the backslash or the dollar sign in the password, unless you quote them. This problem is usually relevant only in the password. If you want to play safe, just avoid a quote, dollar sign, and backslashes.

The PASSWORD parameter is a password you choose to access the virtual machine with a web interface. The EMAIL parameter is your email, and will be used when you register an SSL certificate. Providing your email is mandatory, and it is the only requirement to get a free SSL Certificate from Let’s Encrypt.

To get the values for TOKEN and HOST, go to the DuckDNS site and log in. You need to choose an unused hostname on DuckDns since many are not available as is a shared free service.

Look at the image to see where you have to copy the token and where you have to add an hostname. You have to click on add domain button to have the hostname reserved to you.

Configuring your instance

Assuming you got all the parameters and edited the script, you are ready to launch your instance. Log in the Amazon Web Service management interface, go to the EC2 Instances panel and click on “Launch Instance”.

In the first screen, you have to choose the image. The script is built around the Amazon Linux, so there are no other options available. You have to select Amazon Linux, which luckily is the first option in the QuickStart list.

In the second screen, you need to choose the instance type. Given the size of the software running, there are multiple services, you need at least 4GB of memory, so I recommend you select the t2.medium instance type. You could trim it down, using the t2.small if you shut down some services, or even the micro if you just want only the desktop.

In the third screen, you have to click the “Advanced Details” and paste the script you already configured in the previous step. I also recommend you enable the protection against termination, to avoid that with an accidental termination of the virtual machine you lose all your work.

Next step is to configure the storage. The default for an instance is 8GB, and it is not enough to contain all the images we will build. I recommend thus you increase to 20GB. Also, while it is not needed, I suggest another block device of at least 10GB. The script will mount the second block device as a data folder; you can make a snapshot of its contents, terminate the instance then recreate it using a the snapshot, and recovering all the work.

Furthermore, a custom block device is not removed when you terminate the instance so you will have a double protection against accidental removal of your data. To increase even more your safety, your data can be backed automatically up with Dropbox.

The fifth step is just giving a name to the instance. Pick your own. The sixth step offers a way to configure the firewall. By default only SSH available but we also need HTTPS, so do not forget to add also a rule opening HTTPS. You can open HTTPS to the world, or better only to your IP to avoid others can access your desktop and shell, which is however still protected with a password.

Once done with this last configuration, you can launch the instance. You will notice the initialization the first time can take quite a bit (a few minutes) since the initialization script is running and it will also do some lengthy tasks like generating an HTTPS certificate for you, using Let’s Encrypt.

When you finally will see in the management console running with a confirmation, and no more “Initializing”, you are ready to go.

Assuming all the parameters were correct, you can navigate to https://YOURHOST.duckdns.org.

Replace YOURHOST with the hostname you chose, and do not forget it is an HTTPS, and not HTTP, site, so your connection to the server is encrypted, but you have to write https// in the URL. The site will also present a valid certificate requested to Let’s Encrypt. If there are problems getting the certificate, the initialization script will generate a self-signed certificate. So you will be still able to connect with an encrypted connection, but the browser will warn you it is an unknown site, and the connections are insecure. It should not happen, but you never know.

Assuming everything was fine, you will access to a web terminal Butterfly. You can log in using the user app and the password you put in the setup script.

Once logged in, you have a bootstrapped virtual machine, which also includes Docker and other goodies, like a Nginx Frontend, Git, the Butterfly Web Terminal, and so on. You can complete the setup building the Docker images for your development environment.

Now, type the following commands:

1
2
3
git clone https://github.com/sciabarra/BigDataDevKit
cd BigDataDevKit
sh build.sh

The last command will also ask you to type a password for the Desktop access. Once done, it will start to build all the images. Note the build will take a considerable amount of time, around 10 minutes, but you can see what is happening because everything the build does is shown on the screen.

Once the build is complete, you can also install Dropbox with the following command:

1
/app/.dropbox-dist/dropboxd

The system will show a link you will have to click to enable Dropbox. You will need to log into Dropbox and then you are done. Whatever you put in the Dropbox folder is automatically synced between all your Dropbox instances.

Once done, you can restart the virtual machine, and access your environment at the https://YOURHOST.dyndns.org/vnc.html URL.

You can stop your machine when you do not need to work with it, and then restart it when you resume work. The access URL stay the same. This way, you will pay only for the time you are using the virtual machine, plus some extra monthly for the occupied storage.

Preserving your data

The following discussion requires some knowledge of how Docker and Amazon works. If you do not want to understand the details, just keep in mind following a simple rule: in the virtual machine, there is an /app/Dropbox folder available, whatever you place in /app/Dropbox is preserved, and everything else is disposable and can go away. To further improve security, store your precious code also in a version control system.

Now, if you want to understand better, read on. If you followed my directions in virtual machine creation, the virtual machine is protected for termination, so you cannot destroy it accidentally. If you explicitly decide to terminate it, the primary volume will be destroyed. Since it contains all the images, all the docker images will be lost, including all the changes you made.

However, since the folder /app/Dropbox is mounted as a Docker Volume for containers, it is not part of the Docker images, and it is written outside. In the virtual machine, the folder /app is mounted in the Amazon Volume you created, that is also not destroyed even when you explicitly terminate the virtual machine. To remove the volume, you have to remove it explicitly.

Do not confuse Docker volumes, that are a Docker logical entity, with Amazon Volumes, that is a somewhat physical entity. What happens is that the /app/Dropbox Docker volume is placed inside the /app Amazon volume.

The Amazon Volume is not automatically destroyed when you terminate the virtual machine, so whatever is placed in it will be preserved, until you also explicitly destroy the volume. Furthermore, whatever you put in the Docker volume is stored outside of the container, so it is not destroyed when the container is destroyed. If you enabled Dropbox, as recommended, all your content is copied to the Dropbox servers, and to your hard disk if you sync Dropbox with your computer. Also, the source code is recommended to be stored in a version control system.

So, if you place your stuff in version control system under the Dropbox folder, to lose your data all of this must happen:

  • You explicitly terminate your virtual machine.
  • You explicitly remove the data volume of the virtual machine.
  • You explicitly remove the data from Dropbox, including history.
  • You explicitly remove the data from the version control system.

I hope your data are safe enough.

I keep a virtual machine for each project, and when I finished my work, I keep the unused virtual machines turned off. Of course, I had all my code on GitHub and backed up in Dropbox. Furthermore, when I stop working on a project, I take a snapshot of the Amazon block before removing the virtual machine entirely. This way, whenever a project resume, for example for maintenance, all I need to do is start a new virtual machine recovering the snapshot. All my data get back in place, and I can resume working with them.

Optimizing access

First, if you have direct internet access, not mediated by a proxy, you can use native SSH and VNC clients. Direct SSH access is important if you need to copy files in and out of the virtual machine. However, for file sharing, you should consider Dropbox as a simpler alternative.

The VNC web access is invaluable, but it can be sometimes slower that a native client. You have access to the VNC server on the virtual machine using port 5900. You have to open it explicitly because by default it is not opened. I recommend, if you open it, open only to your IP address, because the internet is full of “robots” which scans the internet looking for services they can hook into, and VNC is a frequent target of those robots.

Conclusion

This article should show how you can leverage modern cloud technology to implement an effective development environment. While a machine in the cloud cannot be a complete replacement for your working computer or a laptop, it is good enough in my experience for doing development work, where the important thing is having access to the IDE. In my experience, with the current internet connection, it is fast enough to work with.

Being in the cloud, server access and manipulation is much more quickly than having them locally. You can also quickly increase (or decrease) memory, fire up another environment, create an image, and so on. In the cloud, there are also plenty of other services, which I will not discuss here. You have a datacenter at your fingertips, and when you work with big data projects, well, you also need robust services and lot of space. That is what a cloud provides.