When the sh!t hits the fan

WebOp's disaster survival guide

Avishai Ish-Shalom (@nukemberg)

"but it's been up for 2983 days"

"It hasn't failed yet"

"Vendor MTBF is gazillion years"

All systems break eventually

Downtime is bad, but...

Incidents are also an opportunity

Cut the B.S., We want uptime!

Just call Rambo Ops when something breaks

Reliability is a long term effort


Be realistic

  • Relax business requirements and goals
  • Prove your assumptions and definitions
  • Simplify the system

Practice and plan

  • Production visibility
  • Improve rarely used skills
  • Decision lines

Simulate disasters

Systems behave differently under stress

  • Different operation procedures
  • Some bugs/behaviours manifest only in abnormal conditions
  • Faulty fallback syndrome

Improve your processes

  • Common is better
  • Independent is better
  • Contained is better
  • Simple is better

Your system is only as reliable as it's weakest part

Don't ignore the human factor

  • Standardize
  • Consolidate
  • Redundancy includes people
  • Manage team fatigue and wear

It ain't paranoya when they are really out to get you

Meanwhile, in real life

3AM, Nagios critical

Don't panic

Assign an incident mananger

  • Take the pressure off
  • Watch the clock and decision lines
  • Communicate and keep information flowing
  • Maintain an incident log
  • Prepare the handoff

Never act without evidence

Assumptions are the mother of all f*ckups

Collect data

Data needed for debugging, simulating and post mortem

Do one thing at a time

Communicate!

  • Chat services
  • Log the chat
  • No invitation needed

Dude, what just happened?

Post Mortem

root cause Contributing factors

It's either my fault or no-one's fault

Reliability is a loan shark

Techincal debt can be dangerous

Reliablity is about trust

Peace and uptime to all