Why Adopting Kubernetes for Application Portability Is Not a Good Idea

This post originally appeared on the Gartner Blog Network.

I often discuss with clients in infrastructure and operations on whether their organizations should adopt Kubernetes to make their applications portable. If you also have this question, the answer is: no.

Actually, the full story: adopt Kubernetes for the many benefits it provides to application development and architecture and get portability as a side effect. But do not make portability your primary driver for adopting the technology. This thesis is well expanded in the document that Richard Watson, Alan Waite and I have crafted during lockdown this spring: “Assessing Kubernetes for Hybrid and Multicloud Application Portability” (paywall).

Kubernetes or not, application portability always comes at a price that you must be willing to pay – the “portability tax”. Gartner’s advice is to make this decision application by application, based on the likelihood that it will be moved in the future, and how fast that needs to happen. In fact, non-portable applications may still be moved, it will just require more time to execute the transformation.

What is the likelihood that applications change infrastructure provider through their lifespan?

Inquiries show that this likelihood is actually very low. Once deployed in a provider, applications tend to stay there. This is due to data lakes being hard – and expensive – to port and, therefore, end up acting as centers of gravity. The figure below shows the Gartner pyramid of portability, which illustrates basic motivations (at the bottom) and more strategic ones (at the top) for designing portable applications.

For each of your application, ask yourself why portability is important to you. Is it to guarantee survivability? To increase your negotiation leverage with the cloud provider? To mitigate vendor lock-in? The higher you are in the pyramid, the least likely it is that you’ll have those needs.

Kubernetes facilitates portability because it helps standardize our software development life cycle and, most importantly, our operating model. However, it also adds management overhead to our organization, it forces us to engage with commercial vendors and to completely rearchitect our applications. Implementing portability with Kubernetes also requires avoiding any dependency that ties the application to the infrastructure provider, such as the use of cloud provider’s native services. Often, these services provide the capabilities that drove us to the cloud in the first place.

In conclusion, the portability tax is high. Make sure to pay it only for applications that truly need it and that are likely to switch infrastructure provider at some point. For all the others, don’t choose Kubernetes on the basis of a universal portability principle, just because it “sounds right”. On the contrary, adopt Kubernetes for agilityscalability and for modernizing your application architectures.

More on this topic in “Assessing Kubernetes for Hybrid and Multicloud Application Portability” (paywall). Should you want to discuss more, feel free to schedule an inquiry call with myself or Alan Waite by emailing inquiry@gartner.com or through your Gartner representative.

Follow me on Twitter (@meinardi) or connect with me on LinkedIn for further updates on my research. Looking forward to talking to you!

The 3 reasons why Docker got it right

Containers have been around for a while. But why did they finally get their well deserved popularity only with the rise of Docker? Was it just a matter of market maturity, or something else? Having worked at Joyent, I had the luck of being in the container business before Docker was even invented, and I would like to give you my take on that.

A brief history of containers

We hear this again and again in compute science: what we think has been recently invented by some computing visionary has actually its roots typically decades ago. It happened with hardware virtualisation (emulation), with the cloud client-server de-centralization (mainframes) and, yes, also with containers.

If you also started to hack with Unix back in the early 90’s you’ll certainly remember chroot. How many times I’ve used that to make sure my process wasn’t messing around with the main OS environment. And you’ll probably remember FreeBSD jails, that was adding all that required kernel-level isolation to implement the very first OS-level virtualisation system.

Sun Microsystem also believed strongly in containers and developed what they called “zones“, definitely the most powerful and well thought container system. But despite Sun believed in containers more than it did on hardware level virtualisation, the market moved towards the latter, not because it was the right approach but simply because it allowed the guest OS to stay untouched. Unfortunately Sun never managed to see much of the results of zones as nobody knows what really happened to them after the acquisition from Oracle. Luckily another company, Joyent, picked up the legacy of OpenSolaris with its SmartOS derivative. SmartOS is now used as the foundation of the Joyent Cloud with an improved version of zones at the very core of it.

At the same time, yet another company, Parallels (now Odin), stewarded OpenVZ, a Linux open-source project for OS-level virtualisation. The commercial version of it was called Virtuozzo and Parallels sold it as their virtualisation system of choice.

Since late 2000’s, Joyent and Parallels have been pioneering the container revolution but nobody talked about them as much as it’s now being done for Docker. Let’s try to understand why.

Positioning of containers

The easy conclusion would be that the market wasn’t just ready yet. We all know how timing is important when releasing something new and I’m sure this also played a role with containers. However, in my view, that’s not the main reason.

Let’s look at how these two companies were selling their container technology. Joyent made it all around performance and transparency: if you’re using a container instead of a virtual machine (i.e. hardware level virtualised) you can get an order of magnitude of performance increase, as well as total transparency and visibility of the underlying hardware. That’s absolutely spot on and relevant. But apparently it wasn’t enough.

Parallels made it all around density. Parallels’ target market was hosting companies and VPS providers, those who’s selling a single server for something like four bucks a month. So, if you’re selling a container instead of a virtual machine, you’ll be able to squeeze twice or three times the amount of servers on the same physical host. Therefore you can keep your prices lower and attract more customers. Given that you’re not reserving resources to a specific container, higher density is a real advantage that can be achieved without affecting performance too much. Absolutely true but again, it did not resonate too loud.

The need to lower the overhead

In the last few years, we also witnessed the desperate need to lower the overhead. Distributed system caused server sprawl. Thousands of under utilised VMs running what we call micro services, each with a heavy baggage to carry: a multi-process, multi-user full OS, whose features are almost totally useless to them. Therefore the research in lowering the overhead: from ZeroVM (acquired by Rackspace) to Cloudius Systems, that tried to rewrite the Linux kernel, chopping off those features that weren’t really necessary to run single process instances.

And then came Docker

Docker started as delivery model for the infrastructure behind the dotCloud PaaS, it was using containers to deliver something else. It was using containers to deliver application environments with the required agility and flexibility to deploy, scale and orchestrate. When Docker spun off, it added also the ability to package those environment and ship them to a central repository. Bingo. It turned containers as a simple mean to do something else. It wasn’t the container per se, it was what containers unlocked: the ability to package, ship and run isolated application environments in a fraction of a second.

And it was running on Linux. The most popular OS of all times.

Why Docker got it right

All of this made me think that there are three main reasons behind the success of Docker.

1. It used containers to unlock a totally new use case

The use case that container unlocked according to Joyent, Parallels and Docker were all different: performance of a virtual server in the case of Joyent, density of virtual servers in the case of Parallels and application delivery with Docker. They all make a lot of sense but the first two were focused on delivering a virtual server, Docker moved on and used containers to deliver applications instead.

2. It did not try to compete against virtual machines

Joyent and Parallels tried to position containers against virtual machines. You could do something better with containers when using a container instead of a virtual machine. And that was a tough sale. Trying to address the same use case as what everybody already acknowledged as the job of a VM was hard. It was right but it would have required much longer time to establish itself.

Docker did not compete with VMs and, as demonstration of that, most people are actually running Docker inside VMs today… even if Bryan Cantrill (@bcantrill), CTO of Joyent, would have something to say about it! Docker runs either on the bare metal or in a VM, it does not matter much when what you want to achieve is to build, package and run lightweight application environments for distributed systems.

3. It did not try to reinvent Unix but used Unix for what it was built for

Docker didn’t try to rewrite the Linux kernel. However it fully achieved the objective to reduce overhead. Containers can be used to run a single process with no burden to carry an entire OS. At the same time, the underlying host can make best use of its multi-process capabilities to effectively manage hundreds of containers.

Don’t get my wrong. I absolutely believe about the superiority of containers when compared to virtual machines. I think both Joyent and Parallels did an amazing job spreading out their benefits like no other. However, I also recognise in Docker the unique ability to have made them shine much brighter than anyone has ever done before.

In conclusion, co-opting with the established worlds of virtual machines and Linux to exploit the largest reach, while adding fundamental value to them was the reason behind Docker’s success. At the same time, looking at containers from an orthogonal perspective, not as the goal but as a mean to achieve something different than delivering a virtual server, is what landed containers on the mouth of everyone.

Docker: not just containers. Thoughts from DockerCon Europe

Developers. Developers. Developers. I guarantee this was the most spoken word at DockerCon Europe 2014, the hottest software conference that just took place in Amsterdam last week. I was so lucky to get a ticket (as it sold out in a couple of days!) and be part of this amazing event that, despite a few complaints heard regarding too much of a “marketing love fest”, offered a lot in understanding market directions, trends and opportunities for software vendors.

So what is Docker? A container technology? No. Well, yes, but there is more to Docker. Despite being known as container technology, Docker is mainly a tool for packaging, shipping and running applications. A piece of infrastructure is now a simple means to do something else and requires no infrastructure skills to consume it. With containers now mainstream, the industry has now completed a further step towards making developers the main driver of IT infrastructure demand.

But at DockerCon, Docker employees appointed the project as a “platform” with the goal of making it easy to build and run distributed applications. A platform made of different components that are “included, but removable”. In fact, during one of the keynote sessions, Solomon Hykes (@solomonstre), creator of the Docker project, announced three of these new components that are now available alongside the well-known Docker engine:

  • Docker Machine
  • Docker Swarm
  • Docker Compose

As the community demanded, these three components have not been incorporated in the same binary as the container engine. But with this launch, Docker is now officially stepping into orchestration, clustering and scheduling.

Apart from the keynote, many of the breakout sessions were run by Docker partners, showing lots of interesting projects and more building blocks for creative engineers. In other sessions, organizations like ING Bank, Société Générale and BBC, explained how they use Docker and its benefits, including how Docker helps build their continuous delivery pipeline. Besides adopting the required technology stack, continuous delivery was also described as a fundamental organizational change that companies need to go through eventually. To this point, my most popular tweet during the two days has been a simple quote from Henk Kolk, Chief Architect at ING Bank Netherlands (@henkkolk):

Here’s my paraphrased version of Kolk’s session – Break the silos, empower engineers, build small product development teams and ship decentralized micro services. Cultural and organizational change has been described as important as the revolution in software architecture or cloud adoption. There can’t be one without the other. So you’d better be ready, educated and embrace it.

Docker Machine

The project that caught most of our attention at Flexiant was Docker Machine. It enables Docker to create machines into different clouds directly from the command line. My colleague Javi (@jpgriffo), author of krane.io, has been looking at it since it was a proposal and during the announcement of Docker Machine, we managed to send the very first pull request for the inclusion of a driver for Flexiant Concerto into the project, ahead of VMware and GCE. If Flexiant Concerto driver will be merged over the next days, Docker users will be able to go from “Zero to Docker” (as it was pitched by its author Ben Firshman – @bfirsh) in any cloud, with a single consistent driver. Exciting! We’re absolutely proud of this and we believe we have much more to give to the Docker community, given our expertise in cloud orchestration. Be prepared for more pull requests to follow.

The Risk

Docker has been blowing minds since the first days of the famous video (21 months ago!). It makes so much sense that it’s been adopted with a speed we’ve never seen in any open source project before. Even those who do not understand it are trying to jump on the bandwagon just to leverage its brand and market traction. This doesn’t come without risks. With a large community, an eco-system with important stakes and a commercial entity behind (Docker, Inc.) there will be conflicts of interests, with “overstepping” onto the domain of those partners that helped make Docker what is today. We’ve already seen this with the CoreOS launch of Rocket a couple of days ago.

Docker, Inc. needs to drive revenue and, despite seeing Solomon Hykes make a lot of effort to keep an impartial and honest governance over his baby, I’m sure it’s not going to be a painless process. Good luck Solomon!

The Opportunity

High risks usually mean high potential return. The return here can be high, not just for Docker, Inc., but for the whole world of IT. Learning Docker and understanding its advantages can drive the development of applications in a totally different way. Not having to create a heavy resource-wasting virtual machine (VM) for everything will boost the rise of micro services, distributed applications and, by reflection, cloud adoption. With this, comes scalability, flexibility, adaptability, innovation and progress. I don’t know if Docker will still be such a protagonist over the next year or two, but what I know is that it will have fundamentally changed the way we build and deliver software.

This post originally appeared on Flexiant.com.

Why developers won’t go straight to the source

I’m so excited. On last Wednesday Flexiant has announced the acquisition of the Tapp technology platform and business. I met the guys behind it quite a while ago and I have never refrained from remarking how great their technology is (see here). I recognized a trend in their way of addressing the cloud management problem and I’m so glad to be part of, right now.

Disclaimer. I am currently working for Flexiant as Vice President Products. I have endorsed this acquisition and I am fully behind the reasons and convinced of the potential of it. This is my personal blog and whatever you read here has not been agreed with my employer in advance and therefore it represents my very personal opinion.

Right after the acquisition (read more about it here) we’ve heard tremendous noise on social networks and the press. David Meyer (@superglaze) of GigaOm in particular wrote up a few interesting comments and he picked up well the reasoning behind it, but he also ended the article with an open question:

“This [the Tapp technology platform] would help such players [Service Providers] appeal to certain developers that are currently just heading straight for EC2 or Google.
 
Of course, this is ultimately the challenge for the likes of Flexiant – can anything stop those developers going straight to the source? That question remains unanswered.”

Well, I’d like to answer that question and say why I’m actually convinced there is a lot of value to add for multi-cloud managers.

Much has been written these days from the business side of the acquisition and I don’t have anything meaningful to add. Instead, I would like to raise a few interesting points from a technology point of view (that’s my job, after all) and unveil those values that are maybe not so obvious at the first sight.

Multi-cloud management

Multi-cloud management per se has a very large meaning spectrum. There are multi-cloud managers brokerage, therefore primarily on getting you the best deal out there. Despite this is a good example about how to provide a “multi-cloud” value, I’m still wondering how they can actually find a way to compare apples with oranges. In fact, cloud infrastructure service offerings are so different and heterogeneous that being simply a cloud broker will make it extremely difficult to succeed, deliver real value and differentiate. So, point number one: Tapp isn’t a cloud brokerage technology platform.

Other multi-cloud managers deliver value by adding a management layer on top of existing cloud infrastructures. This management layer may be focused on specific verticals like scaling Internet applications (e.g. Rightscale) or providing enterprise governance (e.g. Enstratius, now Dell Multicloud Manager). By choosing a vertical, they can address specific requirements, cut off the unnecessary stuff from the general purpose cloud provider and enhance the user experience of very specific use cases. That’s indeed a fair point but not yet what Tapp is all about.

So why, when using Tapp, developers won’t “go straight to the source”? Well, first of all, let’s make clear that developers are already at the source. In fact, to use any multi-cloud manager you need an AWS account or a Rackspace account (or any other supported provider account). You need to configure your API keys in order to enable to communication with the cloud provider of choice. So if someone is using your multi-cloud manager, it means that he prefers it over the management layer provided by the “the source”.

The cloud provider lock-in

One of the reasons behind Amazon’s success is the large portfolio of services they rolled out. They’re all services that can be put together by end users to build applications, letting developers focus just on their core business logic, without worrying too much about queuing, notifying, load balancing, scaling or monitoring. However, whenever you use one of the tools like ELB, Route53, CloudWatch or DynamoDB you’re locking yourself into Amazon. The more you use multi-tenant proprietary services that exist only on a specific provider, you won’t be able to easily migrate your application away.

You may claim to be “happy” to be locked in a vendor who’s actually solving your problems so well, but there are a lot of good reasons (“Why Cloud Lock-in is a Bad Idea“) to avoid vendor lock-in as a principle. Many times, this is one of the first requirements of those enterprises that everyone is trying to attract into the cloud.

Deploying the complete application toolkit

Imagine there could be a way to replicate those services onto another cloud provider by building them up from ground up on top of some virtual servers. Imagine this could be done by a management layer, on demand, on your cloud infrastructure of choice. Imagine you could consume and control those services using always the same API. That would enable your application to be deployed in a consistent manner across multiple clouds, exclusively relying on the possibility to spin up some virtual servers, which you can find in every cloud infrastructure provider.

This is what Tapp is about. And the advantages of doing that are not trivial, these include:

1. Independency, consistency and compatibility

This is the obvious one. For instance, a user can click a button to deploy an application on Rackspace and another button to deploy a DNS manager and a load balancer. These two would provide an API that is directly integrated into the control panel and therefore consumable as-a-service. Now, the exact same thing can be also done on Amazon, Azure, Joyent or any other supported provider, obtaining the exact same result. Cloud providers became suddenly compatible.

2. Extra geographical reach

Let’s say you like Joyent but you want to deploy your application closer to a part of your user base that lives where Joyent doesn’t have a data center. But look, Amazon has one there and, despite you don’t like its pricing, you’re ready to make an exception to gain some latency advantages to serve your user base. If your application is using some of the Joyent proprietary tools, it would be extremely difficult to replicate it on Amazon. Instead, if you could deploy the whole toolkit using just some EC2 instances, then it all becomes possible.

3. Software-as-a-(single)-tenant

If multi-tenancy has been considered as a key point of Cloud Computing, I started to believe that maybe as long as an end user can consume an application as-a-service, who cares if it’s multi-tenant or single-tenant.

If you can deploy a database in a few clicks and have your connector as a result, does it really matter if this database is also hosting other customers or not? Actually, single-tenancy would become the preferred option1 as he would not have to be worried about isolation from other customers, noisy neighbors, et al. Tony Lucas (@tonylucas) wrote about this before on the Flexiant blog and I think he’s spot on, there is a “third” way and that’s what I think it’s going mainstream.

The Tapp’s way

The Tapp technology platform was built to provide all of that. A large set of application-centric tools, features and functions2 that can be deployed across multiple clouds and consumed as-a-service.

Of course it’s not just about tools. It’s also about the application core, whatever it is. The Tapp technology solves also that consistency problem by pushing the application deployment and configuration into some Chef recipes, as opposed to cloud provider-specific OS images or templates3. Every time you run those recipes you get the same result, in any cloud provider. In fact, to deploy your application you’ll just need the availability of vanilla OS images, like Ubuntu 14.04 or Windows 2012 R2 that, honestly, are offered by any cloud provider.

All those end users who want to deploy applications without feeling locked in a specific providers, today had only one way of doing it: DIY (“do-it-yourself”). They would have to maintain and operate OS images, load balancers, DNS servers, monitors, auto-scalers, etc. That’s a burden that, most of the time, they’re not ready to take. They don’t want to spend time deploying all those services that end up being all the same, all the time. Tapp takes away that burden from them. It deploys applications and service toolkits in an automated fashion and provides users just with the API to control them. And this API is consistent, independently from the chosen cloud provider. This is the key value that, I believe, will prevent developers from going straight to the source.

1. Multi-tenancy would be the preferred option for the Service Provider because this would translate into economies of scale. However, economies of scale often obtain cost optimisation and end user price reduction and, therefore, it can be considered an indirect advantage for end customers as well.

2. Tapp features include: application blueprinting with Chef, geo-DNS management and load balancing, networking load balancing, auto-scaling based on application performance, application monitoring, object storage and FDN (file delivery network).

3. It worths mentioning that pushing the deployment of application into configuration management tools like Chef or Puppet significantly affects the deployment time. That’s why it’s strongly advised to find the optimal balance between what is built-in the OS image and what is left to the configuration management tool.

Why the developer cloud will be the only one

It has been a while since my last blog. My new engagement with Flexiant is keeping myself so busy with customers that I have plenty of reasons to think a lot but not enough time to write what I’m thinking about. Customers are indeed one the most interesting sources for people like me who always try to identify common patterns in behaviour and decisions (that are the “technology trends”, this blog’s previous title).

Yesterday, James Urquhart (@jamesurquhart) wrote about “traditional IT buyers” versus “developers” consuming cloud services, with reference to the different strategical approach taken by VMware in its vCloud Hybrid Service proposition. Just the fact that we have to word “traditional” while speaking about one of the most innovative sector of our industry, says it all about what will be the winning approach between the two.

Indeed, I concur with James when he says that VMware’s strategy will be successful in the next few years but will likely fail on the long term. Following his brilliant observations, I still feel like adding something more for all those companies who are trying (or who want) to intercept some cloud computing business by running after enterprise IT departments.

The IT infrastructure demand

First off, IT departments don’t want to get to the cloud. This is known and has been said enough, mainly due to their self-preservation instinct. But they also can’t resist this revolution. So they’re interpreting the cloud opportunity just as the shift of their data centre location, demanding the exact toolset they’ve been using in their on-premises deployment. At the other side, service providers are trying to listen to those customers and give them what they are asking for. Usually, nothing gets real because the offer hardly meets the demand, mainly due to the actual unwillingness of the IT crowd to outsource data centre control. And even if all their requirements seem to reveal the existence of a real market opportunity, I believe they’re actually driving a false demand.

Let’s take a look at the chart below. It’s far from being accurate as it’s not based on real numbers. Lines are straight as I simply wanted to help visualising the trend of what I think it’s happening.

On the overall demand of IT infrastructure, the one represented by the green area is generated by IT professionals, while the blue one is generated by developers or applications (yes, don’t forget that applications are now capable of driving IT infrastructure demand all by themselves, based on workload triggers).

Today we’re at that point in time where those two areas overlap, meaning that while there is still demand of infrastructure by IT departments, it is showing a declining pattern. According to James, VMware is really going to win the biggest slice of that declining green area but will fall short because the green area is going to disappear. And most service providers who are evolving from their colo/hosting business really seem to run after the same green demand, fighting for a shrinking market, while others are just making billions thanks to developers and their applications. And we hear traditional (ah! This word again) service providers calling that type demand simply “test and dev”?

Test and dev or software-consumable infrastructure?

We really need to move away from thinking the Amazon cloud is good for test and dev and not for production. So many times I heard the objection from providers that Amazon is just for developers because of they lack of built-in HA in the infrastructure. Needless to remind how many transaction-sensitive companies are making billions on that cloud.

People who say that are actually missing the point. Developers like and use the Amazon cloud because it is software-consumable. Because their applications can spin up a server to handle extra capacity at the same way they handle their core business logic. Also, they can handle infrastructure failure by replicating data-stores, maybe in different geographic sites, and managing failover in case of outage of the underlying infrastructure. This is the real potential of IaaS and not just the OPEX vs CAPEX scenario, not only the commoditisation of computing and storage resources, but the ability to fully automate the infrastructure provisioning and configuration as part of any application business logic.

Build and launch your developer cloud

Now that we know what developers like about the cloud, should we forecast enterprise developers stop deploying applications on-premises and finally do that in the cloud, thus becoming part of the blue demand? That’s what many people may think, but then, I’ve come across this from James Urquhart again:

I was asked to speak at the Insight Integrated Systems Real Cloud Summit in Long Beach, Calif. […] The RealCloud audience was primarily medium size businesses (between 500 and 10,000 employees), and I jumped at the chance to meet a segment of the IT industry with which I rarely interact.

About half way through my talk […] I came to a point I thought was very important to most software developers. On a whim, I asked this audience how many of them saw custom software development as a key part of their IT strategy. I expected about half the room of 100 or so to respond positively.

One hand went up at the back of the room. (It turns out that was someone from NASA’s Jet Propulsion Laboratory. Well, duh.)

Boom. Any discussion about why developers were bypassing IT to gain agility in addressing new models was immaterial here. The idea that Infrastructure as a Service and Platform as a Service were going to change the way software was going to be built and delivered just didn’t directly apply to these guys.

Well, that explains the success of packaged software so far. But does that mean that the demand for hosting packaged software will gradually move to the cloud? I don’t believe so. Instead, I believe that packaged software will simply be re-invented via SaaS model. It has been already demonstrated how that model can be extremely successful and broad in its adoption (Salesforce.com, Workday, etc).

I believe enterprises will eventually move from in-house hosted packaged software directly to consuming SaaS without passing through intermediate steps, like running the same packaged software in some “enterprise” or “IT” IaaS clouds. Everyone’s waiting for enterprises to start moving their workload, but eventually they will make a big jump all at once. Maybe even without letting their IT departments contribute to that choice (as it’s mainly a business decision).

In the end, if you’re a service provider who aims at finally getting the enterprise workload to the cloud, you’d probably better focus on standing up a software-consumable cloud (a.k.a. “developer cloud”) and go after all those SaaS companies, small and big, to get your slice of the real cloud market. Stop thinking your HA cloud is better than “test and dev” clouds (like Amazon?) and stop discarding the web-scale companies (as SaaS companies are sometimes referred to) from your target market but give them the importance they deserve. If you don’t do that, one day, you may lose the entire enterprise IT business at once.

SmartOS improves Node.js debugging experience (part 2)

This blog post is a two-parts series about debugging Node.js applications. The first part focuses on post-mortem debugging tools and practices, the second part illustrates how to debug latency bubbles in production using DTrace.

Debugging Node.js latency bubbles

Soft real-time systems

One thing that came out with Node.js is that it is extremely good for the new breed of applications: Internet facing, soft real-time systems.

A real-time system is one where the timeliness of the system is also its correctness, at some level. There is a clear distinction between hard real-time systems, where being late means failing, and soft real-time systems, where being late means systems just kind of “suck”.

With the rise of mobile, social and HTML5, we’ve seen more and more of this new breed of applications – DIRTy systems (data intensive real-time systems) – that are Internet facing, real-time systems that have a human in the loop. And when humans are in the loop the good news is that deadlines are soft (the system sucks but it doesn’t die – people will just complain), but the bad news is that the demand is typically non-linear.

Let’s imagine you’ve carefully built your real-time mobile application and suddenly a DJ from Cleveland tells all his listeners that they gotta go download your app and… boom! You get 100,000 people show up the same night, 400,000 more people at the end of the week and 1 million people at the end of the month. This happens, it has happened repeatedly, and it will happen again. We are seeing this trend accelerating, and the more computers will be in our pockets, the more we will have to cope with this.

And this is why it’s extremely difficult to deal with the challenge of scalability at the same time with the challenge of delivering data in real-time.

Debugging latency with DTrace

How do you debug these systems when they go wrong? How do you debug the latency bubbles that consist of failures in these kinds of systems?

Bryan Cantrill (@bcantrill) worked extensively in building real-time systems during his career and debugging them has always been a challenge for him. So he developed DTrace to dynamically instrument those systems, being able to walk them while they’re running, grabbing timestamps at different parts of the stack and correlating them to figure out where the latency is coming from.

The question was: how could we take DTrace into Node.js?

As was true for interpreting core dumps, in interpreted environments it’s extremely difficult to figure out from the bottom what is going on at the top of the stack. Bryan and team had a bunch of ideas and one of them was taken from other interpreted environments that instrument the actual VM wherever it’s doing a function call. It’s great and powerful (Erlang did a terrific job on that) but it is too fine grained.

Eventually, they decided to add USDT (Userland Statically Defined Tracing) probes at certain points of interests like HTTP requests, HTTP responses, GC and so on.

But how can we effectively use DTrace to debug our latency in Node.js? Let’s start by listing all the probes available for all my node processes by typing the following command in a SmartOS shell:

dtrace –n –l node*:::

And we’ll get an output like this:

SINDE Part 2 SC 1

Apart from the C++ name mangling, you can actually see the points of interests (USDT probes) named http-client-request, http-client-response, etc.

Let’s go enable all of them so that we can see in real time what our node processes are doing.

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~]# dtrace –n node*:::’{printf(“%d does %s…n”, pid, probename)}’ –q

On the left you can see the process IDs and on the right what they’re doing:

SINDE Part 2 SC 2

Let’s try to isolate the incoming HTTP activity by instrumenting only the http-client-request:

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~]# dtrace –n http-client-request’{printf(“%d does a %s to %s on %s”, pid, args[0]->method, args[0]->url, args[1]->remoteAddress)}’ –q

And we get some more information out of it:

SINDE Part 2 SC 3

If we want to see the code actually executed upon HTTP requests, we can generate a stack trace whenever they occur by using the ustack() function:

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~]# dtrace –n http-client-request’{printf(“%s:n”, args[0]->method); ustack()}’ –q

That prints out the stack backtrace:

SINDE Part 2 SC 4

We printed the actual called method “PUT” (args[0]->method) and right after the stack trace of what was executed upon the request.

But we’re now back to the other problem: what the hell is this? Bryan and team were in front of another challenge: how to turn all of this into V8 frames from the context of the kernel?

And Dave Pacheco (@dapsays), who doesn’t know the definition of impossible (see part 1 of this blog post), has solved this for JavaScript environment. This is how: when V8 starts, it expresses in an intermediate representation how to take one of these frames and turn it into an actual string, and all of that is downloaded into the kernel upon start of the virtual machine. Then, whenever there is a stack trace generated, this time by the jstack() function, the map table is evaluated and frames are turned into proper readable ones.

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~]# dtrace –n http-client-request’{printf(“%s:n”, args[0]->method); jstack()}’ –x jstrackstrsize=8k –q

Now we can see the actual JavaScript that was executed upon a GET:

SINDE Part 2 SC 5

As you may have realized, this is shining a very bright light to what was previously a total black hole. If you have a Node.js program misbehaving without this kind of technology you’re hosed.

During Node Summit back in January 2012, we heard practitioners talking about big problems of Node.js, and it was all about production debuggability. This is what Joyent has invested a lot into with SmartOS, even if the truth is that we did it to debug our own problems, and that’s true also for DTrace!

The remaining challenge was that USDT methodology was difficult to use with JavaScript. Fortunately, Chris Andrews developed the Node.js DTrace provider that allows you to define your own probes (the “points of interest”) entirely using JavaScript.

All of the above is available in Node.js since 0.6.7 and it’s there by default, you don’t have to do anything to enable it.

Visualizing latency

In terms of visualizing latency, another colleague from Joyent – Brendan Gregg (@brendangregg) – has done a terrific job. One of the most common problems is Node.js programs using too much CPU. Brendan hunted it by profiling the CPU at regular intervals, taking the stack traces, aggregating them by smashing the results together, re-sorting and displaying them as a “flame graph”:

SINDE Part 2 SC 6

The stack shows both JavaScript and C++ frames in a way you can easily identify where your program is spending most of the CPU time. And it’s good to know is that all the tools to generate flame graphs are on GitHub and they’re open source, you can already use them in production to find important bugs or latency bubbles throughout your Node.js code.

Conclusion

SmartOS is Joyent’s foundation for the NodeStack, but Node.js runs everywhere. We, at Joyent are not binding Node.js to work only on a particular platform. We’re committed to invest further in SmartOS in a way to make it the natural choice for your production Node.js environment. And we’re going to do this by giving you great technology that allows you to understand your Node.js app in a way you can’t on any other platform.

SmartOS is an open-source project and it can be consumed as-a-service on top of the Joyent Public Cloud where all the above mentioned tools are enabled by default.

But now I would like to hear from you: how you debug your Node.js applications today? Do you consider debugging in production being one of the biggest Node.js challenges?

End of part 2. You can watch all NodeStack videos, including the one by Bryan Cantrill that was summarized here, by registering yourself for free on the conference website.

SmartOS improves Node.js debugging experience (part 1)

This blog post is a two-parts series about debugging Node.js applications. The first part focuses on post-mortem debugging tools and practices, the second part illustrates how to debug latency bubbles in production using DTrace.

Recently I was a spectator of one of the very first online conferences about the rise of a new software stack built for mobile and web applications. The new stack is NodeStack and is comprised of Node.js, MongoDB and SmartOS.  It is intended to replace the now surpassed LAMP stack, as modern applications have to deal with real-time response delivered at scale, required when interacting with an exploding number of mobile devices.

NodeStack conference featured a fabulous talk by Bryan Cantrill (@bcantrill), Joyent SVP of Engineering, going through why Joyent’s operating system, SmartOS, really makes a difference within the stack.  His points were so prescient, or His talk was so good that I decided to give my contribution translating it into this blog post. If debugging your Node.js apps in production sounds like a dream come true, read on!

SmartOS, the foundation of the NodeStack

Does the foundation really matter? It’s often very tempting to dismiss the foundation and concentrate on the appearance of things but, just like with buildings, the foundation is really critical and it doesn’t necessarily matter when things are working as much as when things fail.

When your program fails you need the foundation – the Operating System – to really understand what happened. When your component has failed, it’s gone and all that’s left is inside the Operating Systems, like footprints on what used to be the component.

Sir Maurice Wilkes, the father of computing, built the first stored program computer back in 1949. The first programmer in history already realized that it wasn’t easy to get programs right.

SINDE Part 1 SC 1

SmartOS is Joyent’s open-source operating system, it is a derivative of illumos, the community driven fork of OpenSolaris that was born when the project was made proprietary. It is backed by many former Sun Microsystems engineers, and it is built to be the operating system for the cloud. You may want to check out www.smartos.org to get more information.

Debugging Node.js logic failures

First off, programs fail for internal logic errors. A bug can cause them to die, exit improperly or end up in infinite loops. To debug these kinds of failures you often need tight integration with the underlying OS.

A real use case

To give real examples, Bryan speaks to his own experience, since Joyent builds its entire software for orchestrating their cloud using Node.js. In the past, Bryan and his team, including Ryan Dahl (the creator of Node.js) and Dave Pacheco (@dapsays), were hitting a black hole by experiencing a non-systematic infinite loop inside their application right before deploying in production.

They were looking at the generated stack, which looked like this:

SINDE Part 1 SC 2

Obviously, you have no idea where you are in the code.

They eventually deployed the application in production and even though they expected to see the bug happening immediately, they didn’t see it for months. And here comes the difference between an amateur and a professional: an amateur happily says “the bug just went away!” while the professional knows that the bug is there and will hit him in his weakest moment. In fact, the bug actually appeared when a customer was watching a demo of the software.

Bryan and team decided they would have to write something new to help debug the software.

mdb and v8.so

Historically, we have always looked at core dumps for post-mortem debugging. It’s a very old idea commonly used to debug operating systems, databases, web servers, etc. It is really great because it allows for asynchronous debugging: after a failure, you can restart your system immediately and debug it in parallel.

The problem is that this has not worked well in interpreted environments.

The challenge for Bryan and team was to add support for post-mortem debugging for Node.js. Bryan thought it was basically impossible because it implies you are able to reconstruct the VM state.  From the bottom of the system (the Operating System), it is very difficult to determine what is happening further up the stack. And this is reinforced by the fact that no one has done it satisfactorily so far, neither Java, nor Python, Ruby, Erlang or PHP.  Bryan thought it was an impossible problem to solve. Dave Pacheco proved him wrong.

Among the anecdotes contained in The Soul of a New Machine by Tracy Kidder, there is one about a college hire joining an engineering team.  The senior engineers didn’t have time to look after him, so they gave him an impossible problem to solve (a simulator), just to make him kill some time.  But he came back after a couple of months saying, “the simulator’s done”. To their surprise, the senior engineers realized they didn’t tell him it was an impossible problem. He solved it because he didn’t know it was impossible.

In the same way, Dave Pacheco solved the problem of visualizing a stack trace for interpreted environments. The result is that now we have the v8.so dmod for mdb that can be used for debugging Node.js programs post-mortem.

Let’s take a look at how it works.

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~/dmod]# mdb corefile
> ::load v8.so
> ::jstack

After loading v8.so, the stack trace we have seen before looks like this, displaying all the actual JavaScript frames:

SINDE Part 1 SC 3

Now it is definitely much easier to identify that the source of the problem is inside the heatmap.js file.

But Dave went one step further. With his dmod, we can also take an arbitrary object and see what the actual arguments are, printed out as JSON. Now, if you look at the following output, you will notice something a bit suspicious, considering that the pathology was an infinite loop.

SINDE Part 1 SC 4

Note that “min” and “max” have the exact same value. The heatmap.js shouldn’t be called with such parameter values but, at the same time, the function should be able to handle this situation without generating an infinite loop. Both the caller and the called were fixed.

This is a concrete example of how to understand a production problem that couldn’t be debugged in any other way than with an effective post-mortem debugging tool.

Memory leaks

“Where is my memory going?” – during the broad adoption of Java in the mid 90’s, we’ve seen the rise of garbage collector problems. Since then, programmers still see their programs spending too much time doing GC. But that happens either because of actual garbage collection or because you’re actually not generating any garbage. In the second case, it means you’ve got a semantic leak: a data structure that you don’t care about that still has a reference somewhere and GC can’t collect it. You will focus on GC as the cause of the problem when it’s just a symptom of the problem. And it’s very easy to keep implicit references in JavaScript that result in heap growth that you don’t know where it’s coming from.

Walking the memory to find the source of leaks is not an easy task, but Dave helped solve another impossible problem. Bryan and team did it by scanning all memory looking for objects that were satisfying all the constraints that a proper JavaScript object should have.

The result is the ::findjsobjects mdb command, which scans the core file and prints out all the objects that are recognized and that can be visualized by piping their address into ::jsprint.

SINDE Part 1 SC 5

But to go hunting our memory leak source, we can even go further and print all JavaScript objects that match a certain object property signature.

> fc431cd1::findjsobjects | ::jsprint

End of part 1. To be continued here.