So your system gets a failure calling a downstream dependency. What do you do? Should you retry? How often? Should we wait? How long?

While those are all good questions, they often miss the point. And if you stop at those questions, 'interesting' system behaviors can result, often leading to overwhelmed systems and downtime or poor performance and unresponsiveness.

Starting with real world analogies, we'll tease out principles that can be applied in most any situation.