Couple weeks ago Google finally published a technical paper describing Borg, their cluster management system that they built over the last ten years or more and that runs all Google services.
There are several interesting concepts in the paper, one of them of course being that they run everything in containers. Whether they use Docker or not is unknown. Some parts of their workloads probably still use LMCTFY, - Let Me Contain That For You-. What struck me is that they say to not be using full virtualization. It makes sense in terms of timeline, considering that Borg started before the advent of hardware virtualization. However, their Google Compute Engine offers VM as a Service, so it is fair to wonder how they are running their VMs. This reminded me of John Wilkes talk at MesosCon 2014. He discussed scheduling in Borg (without mentioning it) and at 23 minutes in his talk, mentions that they run VMs in containers.
Running VM in containers does make sense when you think in terms of a cluster management system that deals with multiple type of workloads. You treat your IaaS (e.g GCE) as a workload, and contain it so that you can pack all your servers and maximize utilization. It also allows you to run some workloads on bare-metal for performance.
Therefore let's assume that GCE is just another workload for Google and that it runs through Borg.
Since Borg laid out the principles for Kubernetes, the cluster management system designed for containerized workloads and open sourced by Google in June 2014. You are left asking:
"How can we run VMs in Kubernetes ?"
This is where Rancher comes to our help to help us prototype a little some-some. Two weeks ago, Rancher announced RancherVM, basically a startup script that creates KVM VMs inside Docker containers (not really doing it justice calling it a script...). It is available on GitHub and super easy to try. I will spare you the details and tell you to go to GitHub instead. The result is that you can build a Docker image that contains a KVM qcow image, and that running the container starts the VM with the proper networking.
Privilege gotcha
With a Docker image now handy to run a KVM instance in it, using Kubernetes to start this container is straightforward. Create a Pod that launches this container. The only caveat is that the Docker host(s) that you use and that form your Kubernetes cluster need to have KVM installed and that your containers will need to have some level of privileges to access the KVM devices. While this can be tweaked with Docker run parameters like --device and --cap-add, you can brute force it in a very unsecure manner with --privilege. However Kubernetes does not accept to run privileged containers by default (rightfully so). Therefore you need to start you Kubernetes cluster (i.e API server and Kubelet with the --allow_privilege=true option).
If you are new to Kubernetes, check out my previous post where I show you how to start a one node Kubernetes "cluster" with Docker compose. The only modification that I did from that post, is that I am running this on a Docker host that also has KVM installed, that the compose manifest specifies --allow_pivileged=true in the kubelet startup command, and that I modify the /etc/kubernetes/manifests/master.json by specifiying a volume. This allows me not to tamper with the images from Google.
Let's try it out
Build your RancherVM images:
$ git clone https://github.com/rancherio/vm.git
$ cd vm
$ make all
You will now have several RancherVM images:
$ sudo docker images
REPOSITORY TAG ...
rancher/vm-android 4.4 ...
rancher/vm-android latest ...
rancher/ranchervm 0.0.1 ...
rancher/ranchervm latest ...
rancher/vm-centos 7.1 ...
rancher/vm-centos latest ...
rancher/vm-ubuntu 14.04 ...
rancher/vm-ubuntu latest ...
rancher/vm-rancheros 0.3.0 ...
rancher/vm-rancheros latest ...
rancher/vm-base 0.0.1 ...
rancher/vm-base latest ...
Starting one of those will give you access to a KVM instance running in the container.
I will skip the startup of the Kubernetes components. Check my previous post. Once you have Kubernetes running you can list the pods (i.e group of containers/volumes). You will see that the Kubernetes master itself is running as a Pod.
$ ./kubectl get pods
POD IP CONTAINER(S) IMAGE(S) ...
nginx-127 controller-manager gcr.io/google_containers/hyperkube:v0.14.1 ...
apiserver gcr.io/google_containers/hyperkube:v0.14.1
scheduler gcr.io/google_containers/hyperkube:v0.14.1
Now let's define a RancherVM as a Kubernetes Pod. We do this in a YAML file
apiVersion: v1beta2
kind: Pod
id: ranchervm
labels:
name: vm
desiredState:
manifest:
version: v1beta2
containers:
- name: master
image: rancher/vm-rancheros
privileged: true
volumeMounts:
- name: ranchervm
mountPath: /ranchervm
env:
- name: RANCHER_VM
value: "true"
volumes:
- name: ranchervm
source:
hostDir:
path: /tmp/ranchervm
To create the Pod use the kubectl CLI:
$ ./kubectl create -f vm.yaml
pods/ranchervm
$ ./kubectl get pods
POD IP CONTAINER(S) IMAGE(S) ....
nginx-127 controller-manager gcr.io/google_containers/hyperkube:v0.14.1 ....
apiserver gcr.io/google_containers/hyperkube:v0.14.1
scheduler gcr.io/google_containers/hyperkube:v0.14.1
ranchervm 172.17.0.10 master rancher/vm-rancheros ....
The RancherVM image specified contains RancherOS. The container will start automatically but of course the actual VM will take couple more seconds to start. Once it's up, you can ping it and you can ssh to the VM instance.
$ ping -c 1 172.17.0.10
PING 172.17.0.10 (172.17.0.10) 56(84) bytes of data.
64 bytes from 172.17.0.10: icmp_seq=1 ttl=64 time=0.725 ms
$ ssh rancher@172.17.0.10
...
[rancher@ranchervm ~]$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[rancher@ranchervm ~]$ sudo system-docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
229a22962a4d console:latest "/usr/sbin/entry.sh 2 minutes ago Up 2 minutes console
cfd06aa73192 userdocker:latest "/usr/sbin/entry.sh 2 minutes ago Up 2 minutes userdocker
448e03b18f93 udev:latest "/usr/sbin/entry.sh 2 minutes ago Up 2 minutes udev
ff929cddeda9 syslog:latest "/usr/sbin/entry.sh 2 minutes ago Up 2 minutes syslog
Amazing ! I can feel that you are just wondering what the heck is going on:)
You want to kill the VM ? Just kill the pod:
$ ./kubectl delete pod ranchervm
Remember that a Pod is not a single container but could contain several ones as well as volumes.
Let's go a step further, and scale the number of VMs by using a replication controller.
Using a Replication Controller to scale the VM
Kubernetes is quite nice, it builds on years of experience with fault-tolerance at Google and provides mechanism for keeping your services up, scaling them and rolling new versions. The replication Controller is a primitive to manage the scale of your services.
So say you would like to automatically increase or decrease the number of VMs running in your datacenter. Start them with a replication controller. This is defined in a YAML manifest like so:
id: ranchervm
kind: ReplicationController
apiVersion: v1beta2
desiredState:
replicas: 1
replicaSelector:
name: ranchervm
podTemplate:
desiredState:
manifest:
version: v1beta2
id: vm
containers:
- name: vm
image: rancher/vm-rancheros
privileged: true
volumeMounts:
- name: ranchervm
mountPath: /ranchervm
env:
- name: RANCHER_VM
value: "true"
volumes:
- name: ranchervm
source:
hostDir:
path: /tmp/ranchervm
labels:
name: ranchervm
This manifest defines a Pod template (the one that we created earlier), and set a number of replicas. Here we start with one. To launch it, use the kubectl binary again:
$ ./kubectl create -f vmrc.yaml
replicationControllers/ranchervm
$ ./kubectl get rc
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
ranchervm vm rancher/vm-rancheros name=ranchervm 1
If you list the pods, you will see that your container is running and hence your VM will start shortly.
$ ./kubectl get pods
POD IP CONTAINER(S) IMAGE(S) ...
nginx-127 controller-manager gcr.io/google_containers/hyperkube:v0.14.1 ...
apiserver gcr.io/google_containers/hyperkube:v0.14.1
scheduler gcr.io/google_containers/hyperkube:v0.14.1
ranchervm-16ncs 172.17.0.11 vm rancher/vm-rancheros ...
Why is this awesome ? Because you can scale easily:
$ ./kubectl resize --replicas=2 rc ranchervm
resized
And Boom, two VMs:
$ ./kubectl get pods -l name=ranchervm
POD IP CONTAINER(S) IMAGE(S) ...
ranchervm-16ncs 172.17.0.11 vm rancher/vm-rancheros ...
ranchervm-279fu 172.17.0.12 vm rancher/vm-rancheros ...
Now of course, this little test is done on one node. But if you had a real Kubernetes cluster, it would schedule these pods on available nodes. From a networking standpoint, RancherVM can provide DHCP service or not. That means that you could let Kubernetes assign the IP to the Pod and the VMs would be networked over the overlay in place.
Now imagine that we had security groups via an OVS switch on all nodes in the cluster...we could have multi-tenancy with network isolation and full VM isolation. While being able to run workloads in "traditional" containers. This has some significant impact on the current IaaS space, and even Mesos itself.
Your Cloud as a containerized distributed workload, anyone ???
For more recipes like these, checkout the Docker cookbook.