The BulkTracker Outage

I have been running the BulkTracker web app for keeping track of pkgsrc bulk package build results since about 2015.

After running without problems since the start (!!), the BulkTracker app had its first outage in November of 2021. It turns out that the function that renders the home page returns a 500 if it gets an error from Datastore. The error that was returned was:

rpc error: code = ResourceExhausted desc = Quota exceeded.

After complaining on Twitter, I fixed the handler to return at least a degraded home page in case of Datastore errors. But why was there an error in the first place? According to the “Quotas” page on the Google Cloud Console, there are no Datastore-related quotas that can be exhausted – essentially, you can use as much as you need as long as you pay for it. I fixed the handler to return at least a degraded home page in case of Datastore errors.

But why was there an error in the first place? I filed an issue (bsiegert/BulkTracker#26) and paused ingestion for the time being.

After several days, I ended up finding a daily spending limit for App Engine. This setting is deprecated and thus kind of hidden in the UI. It turns out that the storage costs were eating almost the entire daily budget! As soon as one build was ingested in a day, the spending limit would be reached, and Cloud Datastore just fails all requests at that point. Not a good failure mode, if you ask me!

After deleting the spending limit, I watched the situation for a day, then I tentatively re-enabled ingestion of new bulk builds. It turns out that solved it! I have since deleted a bunch of old builds, so that there is not as much stored data.

Unfortunately, I never implemented automatic expiration of results, so they have been accumulating ever since. I built a small tool that deletes the detailed data for the n oldest builds in the database, and I notice it would run fairly slowly: deleting a set of 20,000 records would take about 3 minutes. This seems like a lot. The next post in this series will talk about how I made this faster.

Background: Datastore

As an App Engine application, using the App Engine datastore (a NoSQL database) was a logical choice at the time. A SQL database may have been more appropriate for the workload, as I later found out, but the pricing for Google Cloud SQL is prohibitive for a small application.

Small side note: Through my job I learned how App Engine Datastore was actually implemented, and it was horrifying.

Over the years, App Engine Datastore became first Cloud Datastore – an offering independent from App Engine that can be used with any Cloud application – then something called “Cloud Firestore in Datastore Mode”. Firestore is the Firebase NoSQL database; its API is very different from the Datastore one, but there is a shim layer that provides the Datastore API on top of it. The migration was transparent and flawlessly done by the Google Cloud folks – kudos.

Some Statistics

Most of the space in the BulkTracker datastore is used by indices, since all queries must use an index. So in early December, before I started deleting old data, the storage contained:

data for 8800 bulk builds
the oldest build was from June 2015
total size of the data itself was about 30 GiB in 165,000,000 records
total index size was another whopping 300 GiB

The Refinery, an Analogy for Distributed Systems

under cloud personal

Back when I was in Engineering school, my first-year internship happened in a refinery. In retrospect, this turned out to be extremely relevant for my current job in tech.

The subject of my internship was the optimization of an existing process. The unit had been planned on paper by an engineer on another continent, installed according to specs, and it turned out to be … not working so well.

Don’t believe Tech is unique

A refinery is a distributed system. There are specs and basically internal contracts on each sub-unit regarding the quantity it should process per day, what the requirements for inputs and the desired output characteristics are. Instead of queries, the inputs and outputs are, you know, oil and gas.

There are continuous and batch processes. Just like in tech, the interface between these is the subject of a lot of literature and ops knowledge.

In tech, services have availability and latency SLOs. In a refinery, there are input and output SLOs (plus specs like purity, sulfur content, water content, etc.).

In tech, there are error budgets. In a refinery, you have emission budgets as a limiting factor. You may only send x amount of NOx or SOx or CO2 into the air over the course of the day. You may only go over the target value for n hours per month, otherwise the company pays a fine. The water that leaves the grounds may only be so-and-so polluted and have at most y degrees of temperature, otherwise there is another fine. And so on.

And just like in tech, contractors do the darndest things, although in tech, you rarely get a truckload of methanol dumped into your waste water stream.

Safety and Postmortem culture

In a tech company, we might say that prod is on fire when there is an incident, but in a refinery, things might literally be on fire or explode.

I quickly learned to be distrustful of recently renovated spaces, because that means it may have been blown up recently. My intern office was recently renovated; at some point, they showed me the photos of the incident that had all but destroyed the building the previous year.

The oil and gas industry has had a strong safety culture for years. The postmortem culture that SRE is so proud of had been a thing in the industry for years. Thus, when there was a large fire in a refinery in Texas City (the refinery was run by a competitor!), we got hold of a copy of the postmortem analysis, so that we could learn from it and avoid making the same mistakes. There was a meeting where we went through it together, discussed and acknowledged various lessons.

I am not sure how blameless these postmortems are. The problem is that in a real incident, people can get hurt or killed. At some point, it involves responsibility in a legal sense. The director of a unit may be responsible for incidents resulting in damages to others, including in a criminal court. Fortunately (?) for tech, incidents usually do not have this kind of real-world impact.

Dev/Ops Divide

Maybe even more than in tech, there is a divide between “dev” (as in, process engineers and planners) and “ops” (actual operator crews who are literally opening and closing giant valves with their two hands).

As a budding chemical engineer, I interned in process engineering. I quickly found out that many process engineers don’t actually bother finding out how the units they planned perform day to day. On the other hand, the most helpful thing I could do was to sit with the ops in the control room, drink a lot of coffee, and listen to their mental models of the unit. They could perfectly explain it in terms of “when I open this valve too much, the temperature over here jumps up and I have to open this other valve to compensate”. But they cannot tell you why. The trick of the good engineer is to take what you know about physics, thermodynamics, etc. and translate this into an understanding of what actually happens.

Often, what actually happens is not what the planning says. In all cases, there are variables that you did not take into account (outside temperature anyone?).

In addition, some real-world boundary conditions are hard to express in correlation matrices and linear programming models. For example: “If you open this valve too much for too long, the bit in the middle will clog with corrosive white slime deposits”. What kind of coefficient is that?

Your Monitoring is lying

Just like in tech, your monitoring is lying to you! You could have a miscalibrated sensor, but what is more likely is that the sensor is placed in a way that it does not show you the true picture.

In our case, there was a laser measurement in a gas conduit where the laser sometimes passed through the stream of flowing gas and sometimes it didn’t. The wildly fluctuating output has no basis in the actual gas composition. In another part of the unit, a sensor was placed in such a way that it was disturbing the flow, so it was actually modifying the behavior of the system.

My two most important variables (the composition of the feed stock and the composition of ammonia) were not even available as measurements. In a refinery power plant, you burn whatever is left over at that moment. For gas, that may be pure hydrogen, pure butane or anything in between. And don’t get me started on the disgusting sludge they call heavy gas-oil.

As for the concentration of ammonia in the ammonia feed (yes, really), I asked for a manual measurement so I could have at least one data point. They told me it was impossible. I asked why. The answer was “when we open the storage drum, it catches fire. The contents are pyrophoric.” Yay.

Dependencies

The recommendation at the end of my internship was to switch off the unit and to tear it down. My manager agreed but it turned out to be impossible because other units had come to depend on this process to consume their input or output feeds.

So for all I know, the piece of crap unit I tried to optimize in the early 2000s is still there, making the surrounding air just a tiny bit dirtier.

Conclusion

Do not believe that tech is different from other, existing disciplines of engineering. There are other industries that have worked on distributed systems. Many software engineers I know are interested in the aerospace industry for a similar reason, which usually means binging on plane crash documentaries. But throughout many different industries, you can find solutions to many of the same issues that you have with your distributed cloud microservice mesh, or whatever.

Pkgsrc Buildbots

under pkgsrc Cloud NetBSD

After talking to Sijmen Mulder on IRC (thanks, TGV Wi-Fi!), I began thinking more about how you could automate the pkgsrc release engineers away.

The basic idea for a buildbot would be this:

Download and unpack latest pkgsrc.tar.gz for the stable branch.
Run the pullup script with the ticket number, then run whatever pullup script it outputs.
Figure out the package that this concerns (perhaps from filenames).
Go to the package in question, install its dependencies from binary packages.
Build (make package is probably enough, or perhaps also install?).
Upload build log to Cloud Storage.
Post an email to the pullup thread with status and a link to the log.

For extra points, do this in a fresh, ephemeral VM, triggered by an incoming mail.

You would also need a buildbot supervisor that receives mails (to know that it should build something) and that launches the VM. I know that Google App Engine could do it, as it can receive emails. But maybe Cloud Functions would be the way to go?

In any case, this would be a cool project for someone, maybe myself :)

Issues with Pull-up Ticket Tracking

This project is largely orthogonal to improvements in the pullup script. Right now, there are a number of issues with it that make it require manual intervention in many cases:

The tracker (req) doesn’t do MIME, so sometimes mails are encoded with base64 or quoted-printale. This breaks parsing the commit mails.
Sometimes, submitters of tickets insert mail-index.netbsd.org URLs instead of copies of the message.
Some pullup tickets include a patch instead of, or in addition to, a list of commits. For instance, this may happen when backporting a fix to an older release instead of pulling up a bigger update.
Sometimes, commit messages are truncated, or there are merge conflicts. This mostly happens when there has been a revbump before the change that is to be committed – in the majority of cases, the merge conflicts only concern PKGREVISION lines.

I am wondering how much we could gain, e.g. in terms of MIME support, from changing the request tracking software. admins@ uses RT, which has more features. Perhaps that could be brought to pullup tickets?

New Blog!

under Cloud

My new year’s resolution for 2018 has been to blog more. So I decided to create an actual blog!

It started with me closing my Amazon AWS account and writing about it. The posting was up as a Gist on Github, and I shared that URL. This does give a useful viewer and the ability to add comments, but it is hardly discoverable. Anyway, the post was somewhat widely circulated (the provocative title certainly helped) and even made it to Reddit.

After deciding to create a real blog, I started looking into how to do that:

Thinking about creating an actual blog (again). Blogger, GitHub pages, write something myself or ...?
Any tips?
— Photos of Kernel Panics (@bentsukun) 31. Dezember 2017

Now I finally have something up and running. The address is simple enough, and based on my Twitter handle:

https://bentsukun.ch/

In true Open Source tradition, the source code for the site is visible online at https://github.com/bsiegert/blog.

For now, I am quite happy with this setup.

The setup

Hugo with the purehugo theme,
Firebase Hosting,
domain at HostTech.

The site itself is created with the excellent Hugo, which has the added advantage of being written in Go :) The installation is as easy as

go get github.com/gohugoio/hugo

or by using the www/hugo package in pkgsrc. Hugo ends up creating a fully static web site, so simple static hosting is enough.

Google Cloud Storage supports hosting a static web site – that’s great, right? Well, not quite. There is no support for HTTPS in plain GCS. If you want HTTPS, apparently you need:

The GCS bucket that holds the files,
A Cloud Load Balancer instance in front
Cloud CDN for delivering content from the edge.

Or, you can use Firebase Hosting, also by Google. This is what I am doing. A simple firebase deploy uploads all the static files, and an SSL certificate is included in the offer.

I tried using Google Domains to register the domain but that service is not available in Switzerland, unfortunately. hosttech.ch to the rescue! It took me a while to figure out that (a) a DNS zone for the domain is included and (b) how to add entries to it.

To get Firebase Hosting to serve directly on the domain, it is necessary to add a google-site-verification TXT record into the DNS for it. And that’s it.

Leaving AWS

under NetBSD Cloud

Today, I deleted my Amazon AWS account.

I had been on AWS since about 2011. My usage was mainly for two things:

Saving large amounts of files (build logs and such) on S3;
Running NetBSD VMs on EC2.

EC2 is based on Xen, and NetBSD runs really well in PV (paravirtualized) mode on Xen. However, XSA-240 means that a malicious PV guest may crash (or even otherwise exploit) the hypervisor, with the recommended fix being to not run untrusted PV guests. Over night, Amazon disabled PV, making NetBSD VMs useless.

In general, EC2 has been moving away from Xen. The newer instance types already no longer supported PV; there are two higher-performance paravirtualized modes (PVH and PVHVM) that are preferred these days, and that NetBSD does not support. The newest machine types use a custom hypervisor based on KVM.

The way the PV change was rolled out highlighted another long-standing EC2 problem: instances would continue running until the server they ran on got rebooted, at which point they were migrated to a random machine. If the target machine had PV disabled, the VM simply did not come up again. I have had the same type of issue in the past, where your VM randomly landed on a “good” or a “bad” machine and did not come up on the bad one. There is no way (AFAIK) to constrain to a certain subset of servers, e.g. running a certain hypervisor version.

Also, of course, there was no warning or announcement, just that VMs stopped working all of a sudden. A bunch of people were completely caught by surprise when their service became unavailable. I hope you have monitoring!?

The Alternative

Which brings me to where I did take my workloads: Google Cloud Platform.

(This has nothing to do whatsoever with who my employer is. I pay for my GCP usage with my own money.)

These days, NetBSD (8+) runs great on Google Compute Engine. There is a script (that I created) to stage instances at https://github.com/google/netbsd-gce, though there are no official NetBSD images around. My S3 usage works equally well using Google Cloud Storage. And I have always been a fan of App Engine, particularly because of its great Go support. https://bulktracker.appspot.com/ runs on App Engine.

Conclusion

My general impression is: Features are roughly on par, prices on GCP are a bit cheaper, and the Google Cloud SDK and command-line tools are better. So rather than let old, unusable VM images continue to rot and pay Amazon 2$ a month for that bit of storage, I let go of that AWS account. Bye, Amazon.

benzblog

08 Jan 2022, 17:12

Background: Datastore

Some Statistics

04 Mar 2021, 14:06

Don’t believe Tech is unique

Safety and Postmortem culture

Dev/Ops Divide

Your Monitoring is lying

Dependencies

Conclusion

04 Feb 2019, 17:15

Issues with Pull-up Ticket Tracking

18 Jan 2018, 19:44

The setup

03 Dec 2017, 22:16

And done!

The Alternative

Conclusion