Scaled-up web software is hard!

The New York Times ran an article last Friday about the struggles people have had using all those vaccination-appointment web sites. The biggest complaint? The sites “crash” because they have too much traffic.

I’ve worked at a company that runs scaled-up web technology. We have a customer with a vast seasonal rush, with millions of users active at any time on rush days. We’ve been serving them for five years.

In each one of those five years, we and our customer have load-tested our stuff in preparation for that rush. We, and they, rent hundreds of servers for a few hours, and use them to hammer on our live systems. Every year this effort takes weeks of time and tens of thousands of dollars in server rental fees. The point is to “crash” our systems ahead of the rush. That way we we can find and fix the weaknesses and bottlenecks. Every year we find and fix new weaknesses. We’re proud of the fact that the public doesn’t see the crashes.

Government web sites, especially ones related to public health, have a harder problem to solve than we do. They have to be developed and rolled out in a hurry, so everybody can use them. They get day-one workloads far larger than our rush-day workloads. Now of course, their developers could take a month or two to load-test them and find their weaknesses. But, you tell the governor that the vaccine-reservation web site won’t be ready for another two months: I’m not telling her that.

Another thing is obvious.  Throwing money at these scaled-up web site problems doesn’t help make them work better at first rollout. Deloitte and Microsoft charge a lot for this kind of work, but they can’t do it much faster than anybody else. Quick and dirty fixes like waiting rooms only introduce more complexity and more points of bottleneck.

My point: the rollout of vaccination reservation web sites is not a colossal clusterf**k. It’s more-or-less normal. These huge-load web sites are like talking donkeys: we had best not complain they work badly, but instead marvel that they work at all. Let’s have some empathy for the hard-work people deploying this stuff.

I do have one hindsight regret:  Amazon Web Services should have started work on a reservation system the minute it became obvious that reservations would be needed. They should have done it in open source and free of charge for the whole world, and put their best people on it for the duration. They have long Black Friday e-commerce experience and so have lots of deep experience in this sort of thing. They also have the necessary equipment.

Leave a Comment