Samantha Schaevitz was in the home stretch of a fellowship at Huridocs, a human rights nonprofit, when she got the call. Schaevitz works on site reliability engineering at Google; they’re the ones who keep steady the ship when things get choppy. And by February of this year, as large portions of Asia shut down in an attempt to slow the spread of the novel coronavirus, Google Meet found itself taking on water. They needed Schaevitz back at work.
Google launched Meet in 2017 as an enterprise-focused alternative to its Hangouts chat service. (Google has been steadily phasing out Hangouts and pushing users to Meet and Chat, part of its forever-muddled messaging platform strategy.) As the coronavirus spread and more countries issued stay-at-home orders, people flocked to video chat services for work and to check in on family and friends. Google saw Meet undergo 30-times growth in the early months of the pandemic; soon enough, the service was hosting up to 100 million meeting participants each day. That’s a lot.
Amid all the profound changes people have made in response to Covid-19, the infrastructure that undergirds the internet experienced a shift in usage patterns, too, as people traded office hours for home isolation. The companies that handle these systems have mostly been able to manage users’ new needs. “You essentially took the peak and extended it over a far longer period of the day,” says Ben Treynor Sloss, Google vice president of engineering. “The usage went way up, but it was mostly that the use looked more like peak most of the day, rather than that the peaks went up dramatically.” Some services, though, saw usage spike well beyond normal.
Google prepares for emergencies on a regular basis through its disaster and incident response tests, or DIRT. In these exercises, around 10,000 employees at a time will simulate handling some sort of crisis, ranging from a localized natural disaster to a Godzilla attack. The Covid-19 pandemic, though, turned out to exceed even the company’s most dramatic scenarios.
“We had typically simulated a regional-level event,” says Treynor Sloss. “We’d never done DIRT for a global-level event, in part, if I’m being honest, because it didn’t seem likely.” There was also a practical concern: Convincingly mocking up an incident with worldwide impact would risk downgrading the experiences of actual Google users, a cardinal sin in the world of DIRT.
All of which meant that Schaevitz, who led the incident response for Google Meet, and the teams involved had to figure things out on the fly. Especially as it became clear that they were taking on far more new users than their most ambitious early projections.
“In the beginning, we started planning for a doubling of our footprint, which is already huge. That’s not the normal growth curve. We soon realized that wasn’t going to be enough,” says Schaevitz. “We kept trying to make progress on building more runway … so that we would have time to figure out a solution if things would arise on a longer time horizon rather than just every day waking up and being like, what’s newly on fire today?”
Complicating the challenge was that the Google engineers involved in the response were themselves working from home, spread across four offices in three countries. “All the people who worked on this—and this is a large number of teams—even the people working on it in the same place have actually never been in the room together since this started,” says Schaevitz, who is based in Zurich, Switzerland. On a technical level that proved manageable enough; as you might imagine, Google prioritizes web-based tools that can be accessed from anywhere. But coordinating the 24-hour-a-day operation remotely required setting up redundancies for more than just bandwidth. In a blog post detailing the response, Schaevitz described how everyone in an incident response role was assigned a “standby,” basically an understudy who could step in if the principal got sick or had to take time away. (An especially prudent measure during a global health crisis.)