March 23, 2022
How a handful of Airmen brought DevOps to USAF, then used it to save more than 123,000 lives.
Kessel Run was founded on a hypothesis that by bringing commercial DevOps practices to the Air Force, warfighters could get better software, faster, for less money. Our experience over the last four years has borne that out. Our critical change was putting the developers, the security operations, and IT operations together in a single team to make a DevSecOps [development, security, and operations]unit. Traditionally, dev responsibilities fall within Air Force Materiel Command and IT operations within the operational MAJCOMs; this split slows down real time delivery. In August 2021, Kessel Run showed that developing software solutions in real time has real world impact and helped save the lives of more than 123,000 people. We have the results from the experiment, and the data show clearly that the Kessel Run model, the DevSecOps unit, should be the standard for Air Force software factories.
It’s Aug. 24, 2021, and tension is in the air inside the Combined Air and Space Operations Center (CAOC) in Al Udeid Air Base, Qatar. There’s only a week left before the Taliban deadline to evacuate the remaining Americans, interpreters, and others who had helped the United States in Afghanistan over the two decades since 9/11. Many are young enough to only have vague memories of what life was like prior to the Americans’ arrival in 2001. It’s here, at the CAOC, where the airlift is being planned and managed.
The situation in Afghanistan is similarly tense. Around 6,500 people are at the airport waiting for a flight out of the country. A week earlier, desperation led people to chase after or hide in the wheel wells of departing C-17s. Horrifically, some fell to their deaths when the aircraft took off. Now, the crowds are drawing the attention of ISIS-K, which is planning an attack that will come only two days from now. The need for an orderly plan to evacuate as many people as possible was clear.
Back at the CAOC, the team of air planners is trying to use Kessel Run’s software to plan the missions that will ferry people out of Kabul on planes from the United States and many other countries. A proper plan is critical because the air traffic control at Hamid Karzai International Airport (HKIA) is not used to this level of traffic, so the planners must space out the arrivals and departures into very precise time slots.
However, the software isn’t working.
The team is going to the Slapshot website, but it’s not loading. This was a known risk. As a development Minimal Viable Product (MVP), Kessel Run’s applications were designed to accommodate the number of missions in steady state operations. The changes needed to scale to over five times as many missions were in the backlog, but deferred for higher priority work until next summer. Now, the evacuation has driven the mission count up 10 times in a matter of days. If this evacuation fails, thousands of people would be stranded as the Taliban take control of Kabul and the rest of Afghanistan. It’s a literal life and death situation.
In the midst of the flurry of action on the CAOC ops floor, a young government civilian calmly goes to his computer at the back of the room and submits a message to a team in the United States. It’s a similar message to others the team has sent before, but this time the stakes are much higher.
“We are experiencing intermittent loading issues with Slapshot. The exercise theater does not load. We need to call an outage so that we can fix the issue,” he said.
Back to October 2016—the beginning
Eric Schmidt, then executive chairman of Alphabet, Inc., Google’s parent company, served as a member of the Defense Innovation Board (DIB). The board worked to find ways innovation could address future challenges to DOD. As part of that work, the board went to the same Operations Center in Al Udeid where this story began.
The whiteboard on which tanker refueling operations were planned was like a game of Tetris. While one teammate enters data into an Excel spreadsheet, another moved magnetic pucks and laminated cards around the whiteboard. Kessel Run converted the manual system into an automated software program. USAF/courtesy
Famously, he saw Airmen planning refueling missions on a whiteboard with tape grids, magnetic pucks, and dry-erase marker lines connecting the puck together to define the plan.
When he later asked the Air Operations Center (AOC) commander what his biggest concern was, the commander said: “Well, frankly, … I don’t want them to erase my whiteboard.”
Shocked that an eraser created one of the biggest threats to the air war supporting operations across Iraq and Afghanistan, the board members pressed the team about why better tools don’t exist. He asked if they even had modern software. He was told, “Yes, but it doesn’t work.” This failed modernization effort left the AOC with the nearly the same system that was originally developed in the 1990’s—20 years and several software lifetimes ago.
This effort to modernize the AOC 10.1 software was called the AOC 10.2 program. The Capability Development Document of 2006 formalized key functional requirements for the AOC and in that same year Lockheed Martin was awarded a $589 million contract in order to “standardize, modernize, sustain, and transform” the AOCs. Under traditional acquisitions, this was pre-Milestone B “risk-reduction” activity. In 2013, Northrop Grumman won the development award and began work on the 10.2 program. By the fall of 2016, Northrop was already three years behind schedule and estimated development costs had ballooned from $374 million to $745 million. It was a decade after the requirements were identified and no code was delivered to the field for use. This is the scene that the Defense Innovation Board walked into when they visited the CAOC.
Raj Shah, a Managing Partner at Defense Innovation Unit (DIU) at the time, was with the Defense Innovation Board, and literally called Col. Enrique Oti, an officer at DIU, that night, and said that he would commit $1M of DIUx’s money. The goal was to get a new tanker planning tool and demonstrate DevOps can deliver solutions faster than traditional waterfall and Joint Capabilities Integration and Development System (JCIDS) development.
Capt. Gary Olkowski demonstrates “Jigsaw,” the digital tanker planning tool built for the Combined Air Operations Center at Al Udeid Air Base, Qatar. Developed in just four months, Jigsaw paid for itself within six months of being deployed. USAF
The Air Force team, led by Oti, sat side by side with Airmen in the CAOC to design a warfighter-friendly tool. The resulting Jigsaw Tanker Planning Software turned an eight-hour task for six people into a three-hour activity. By April 2017, four months after work had started, the tanker planning tool was in use in Qatar.
Within six months, the Jigsaw application had essentially paid for itself. The efficiency it had created saved 400,000 to 500,000 pounds of fuel each week and required one less refueling aircraft. This saved the Air Force $750,000 to $1 million every week.
Remember, Raj only spent $1 million on the entire effort.
It’s no wonder that the 10.2 program received a stop-work order on April 19, 2017, and was terminated in July. The Air Force needed to do things differently to avoid the same outcomes. There was a team—also an experiment—that could lead that new approach. This new team of coders was going to build on Jigsaw’s success and modernize all of the AOC’s software.
Jigsaw was combined with the Targeting and GEOINT program office (also using DevOps to modernize their tools, led by Capt. Bryon Kroger), and the government team who was sustaining the AOC 10.1 system within PEO digital.
That team was named “Kessel Run,” both as an homage to “Star Wars” smuggler who could bring outside things (like DevOps) to a bunch of rebels, but also because Han made the Kessel Run in 12 parsecs, a hyperspace distance shorter than anyone else had done before. That was our mission: Shorten the time and distance it took to get to our destination.
Here are some of the ways that we close the gap.
Jigsaw began with our dev-teams sitting down with the users in the AOC to understand their value chain, and how they could be more effective and productive. We have continuous user interviews and track customer satisfaction scores. Getting software into users’ hands as soon as possible has led to our users coming up with new use cases. Today, we have our team members embedded at the 609th CAOC every day.
As we add new applications or features to our scope, we start with a discovery and framing session with our users. We don’t turn to requirements documents first or trust that documents written in 2006 represent the world as it is today. Instead, we work with the users to scope a MVP and then begin iterations of the build-test-learn cycle.
We map out their processes so we understand what the users need.
After that, we start designing the solution. We co-design with the users and start to map the data flows so we can see the interdependencies between applications and workflows.
Finally, the goal isn’t to provide a mock-up or a prototype, but to build an MVP that users can test and use to support operations. Having users test thin slices of the ultimate system starts the build-test-learn cycle and gives us constant feedback on our software to continuously learn what is working and where the gaps are.
This is very different from traditional acquisitions where only finalized systems are made available to users. It can create challenges since we release versions that we know don’t meet the entire set of needs, but can provide value that can grow over time. For example, our Kessel Run All Domain Operations Suite (KRADOS) in August had known issues around scaling for major operations.
More on that later.
Continuous delivery, in practice, means changes to the software are happening on a regular basis, multiple times a day, eventually adding up to major changes over time.
For example, Jigsaw has been used for every air refueling mission in the CAOC since December 2019 as a stand-alone application. Slapshot, the tool for planning the rest of the air missions, has also been used for every mission at the CAOC since December 2019. Again, it was used as a stand-alone application for over a year because we didn’t have the connection to a common data layer built yet. However, the integrated suite of 10 applications with a common data layer was released as an MVP in January 2021. It was accepted, and used for planning the Master Air Attack Plan at the CAOC, since May 2021.
Let’s dive a little deeper on how we make those changes to production software.
Continuous Innovation and Delivery
At Kessel Run, we have a different challenge from commercial software-as-a-service providers. We don’t have a single internet that we’re deploying to. We have 10 different environments because of users on unclassified, secret, and top secret networks—and the different variants of those networks for different coalition partners.
In order to manage deploying software to these regions, maintain version control, and reduce human touch points in the deployment process, we rely on automated continuous integration and continuous deployment pipelines.
That starts with our developer pipeline, which takes the application code from their workstations and puts it into the Gitlab repository where we maintain our code. When the dev-team thinks the changes are ready to deploy to our staging environment, they push the code through the CI pipeline along with a deployment manifest. The security release pipeline is part of this release, which includes code scanning, vetting dependencies, and putting the artifacts into our Nexus repository. Once there, they are available in the staging environment for testing and verifying integration with other applications and services.
When those changes are ready to be promoted to production, the immutable images are moved from Nexus into our purpose-built deployment manager (RADD) into the production environments. Our Continuous Deployment pipelines depend on whether the deployment is going to our AWS unclassified cloud, on-prem Secret, or Top Secret cloud.
We use these pipelines multiple times each day. On average, we deploy code through a deployment pipeline once every 3.3 hours. From the time the dev-team is ready to deploy, on average, it is only eight hours before the changes are available in production environments. Much of that time is spent moving artifacts from unclassified networks up to classified, which still requires burning CDs and rescanning on both sides of the air gap. We hope to have a cross-domain diode that will take the human touch point out of the process. That should speed the deployment times further and help us get to the self-service deployment using full automation.
Focus on applications in-production
While many teams see the job as finished when code gets into production, we see that the job is only partially done. While we haven’t yet established service level objectives, or agreements with our users, a point of pride for Kessel Run is our ability to service apps in production, respond to issues, and never have the same outage twice.
Our teams provide security support and monitor applications to ensure they are available. When we have an issue, on average we have it resolved in under 120 minutes. After every outage, we conduct a no-fault retro to identify root causes and assign fixes to the backlog.
That process begins with a report in our MatterMost channel for outages. That brings us back to Aug. 24, 2021, when our liaison officer at the 609th submitted the outage report.
DevOps in Practice
It was 2:49 a.m. in Boston. Remember, that we knew that the production version of our applications couldn’t handle the growth in mission counts that we saw in the evacuation effort. Now the software was being asked to do exactly that.
Meanwhile, the crowds outside HKIA grew and the deadline to get everyone out wasn’t going to change just because we had an outage in production.
Our platform team noticed spikes in latency seven minutes after the call was initiated. Along with the LNOs, the on-site platform team started collecting data on the classified network to help pinpoint the problem when the dev-teams in Boston get into the office.
The evacuation from Hamid Karzai International Airport, Kabul, Afghanistan, resucued more than 123,000 people. On-the-fly software updates made that possible. Sgt. Isaiah Campbell/USMC
At 4:03 a.m. in Boston, the outage team began the response and began to work with the product manager to determine potential fixes. At 7:09 a.m., the dev-teams joined the outage call and confirmed the root cause. There was a setback at 8:28 a.m. when the apps completely crashed and the LNOs notified the center’s operations floor.
Still, only 12 minutes later, the platform team has cleared the bin files that had taken up all available disk space after the app lost connection to the SQL database. By 9:07 a.m., the team had doubled the number of compute instances available to Slapshot. At 10:25 a.m., the development team added a “theaters” feature to the production version of Slapshot that cut the number of missions displayed into smaller chunks.
That afternoon, at 1:44 p.m., additional compute instances were shifted to the 609th Slapshot, and it looked like the issues had been mitigated. At 4:06 p.m., our liaison officers confirmed with users in the 609th that the issue was resolved and got feedback on the new theater feature. They had positive feedback and the outage call ended.
The call ended only 12 hours and 3 minutes after the product manager was woken up at 4 a.m. to start the call. In those 12 hours, the team was able to shift compute and store resources to United States Central Command’s apps to improve performance, fix the SQL database connection errors, clean out the bin files, and add new features to help slice the data and improve load times. Our dev-teams and IT ops teams worked together—from Boston and in Qatar—to identify the issues, propose solutions, and implement them in a single day.
The airlift was able to continue.
USAF set a record on Aug. 15, 2021, when one flight safely transported 823 Afghanistan
refugees on a single C-17 flight. USAF/courtesy
Interpreters who had helped Americans and our partners were moved to safety. Women and girls fearful of a life under the Taliban were brought to safety where they could pursue their dreams. All American forces were out of Afghanistan by the Aug. 31 deadline.
To me, those 12 hours are the defining moment for Kessel Run. What started with Eric Schmidt’s disbelief in how planning was being done five years ago became an experiment to show a government-led DevOps team could deliver software better than traditional government acquisition. For comparison, that five years is the same time it took 10.2 to go through “risk-reduction” and start a development contract. Since the DIB visited the CAOC, we’ve been in use by users for all but the first six months of those five years. We add new features every week and move from stand-alone apps to an integrated suite. Kessel Run has shown that the full promise of DevOps is not something to see in the future—it’s happening now.
When lives depended on us, when the world challenged us, our DevOps Team delivered the software solutions our warfighters needed. In doing so, we demonstrated why DevOps—why the Kessel Run model—is an imperative for the Air Force.
Col. Brian Beachkofski is commander of the Air Force Life Cycle Management Center’s Detachment 12, an agile software development lab known as “Kessel Run.”