20 Jun 2010

Amazon: Distributed Architecture, Distributed Teams

I love this piece on Amazon from Leading Lean Software Development, by Mary and Tom Poppendieck...

“Amazon.com is all about growth. Back in the late 1990’s, when the company was getting started, it didn’t worry much about scale. So Amazon’s systems grew into a massive front end and a massive back end – an architecture that was considered best practice at the time. But every holiday season, things nearly fell off a cliff. Around the year 2000, the company realized that it had hit a wall. Something different would have to be done.

So Amazon changed its architecture. There is no such thing, really, as a database anymore; there are only services that encapsulate both data and business logic. There is no direct database access from outside a service, and there’s no data sharing among the services. There are hundreds of services, and many application servers that aggregate data from services.

In order for this to work (=scale) services had to be decomposed into small, autonomous building blocks. Much like the Internet, there is no central control to fail and services run locally. They make decisions based solely on local information so they can keep on running no matter what is going on in the overall system. The reason for this distributed architecture is simple – CTO Werner Vogels says that “If you need to do something under high load with failures occurring and you need to reach agreement - you're lost.”

And guess what. Because the architecture was distributed, the organization could also be distributed. Amazon found that each service could have its own autonomous team that does it all – customer interaction, decisions on what to develop, choice of tools, programming, testing, deployment, operations, support. Everything. There are no handovers. Services interact with other services through well documented interfaces, such as an SLA with agreed-upon demand levels.

The size of a service team is no more people than can be fed with 2 pizzas – maybe eight people. Amazon.com has many, many 2-pizzas teams, each completely owning a service, cradle to grave. If an architectural feature is too big for a 2-pizza team, Amazon’s bias is to break the feature into smaller pieces, because effectiveness of the team is as important as architectural consistency. Teams stay together over time – a minimum of two years is expected – and own long term responsibility for everything about its service.

It took several years for Amazon.com to evolve to the new architecture and learn how to make it work. Dependencies still exist, of course, and the company has developed some well honed dependency management tools. Configuration management is a challenge, as is testing. But these challenges are creating opportunities for novel solutions. By-and-large, the low dependency architecture – both technical and organizational – works amazingly well.”