When the sh!t hits the fan
WebOp's disaster survival guide
Avishai Ish-Shalom (@nukemberg)
"but it's been up for 2983 days"
"It hasn't failed yet"
"Vendor MTBF is gazillion years"
All systems break eventually
Incidents are also an opportunity
Cut the B.S., We want uptime!
Just call Rambo Ops when something breaks
Reliability is a long term effort
Be realistic
- Relax business requirements and goals
- Prove your assumptions and definitions
- Simplify the system
Practice and plan
- Production visibility
- Improve rarely used skills
- Decision lines
Simulate disasters
Systems behave differently under stress
- Different operation procedures
- Some bugs/behaviours manifest only in abnormal conditions
- Faulty fallback syndrome
Improve your processes
- Common is better
- Independent is better
- Contained is better
- Simple is better
Your system is only as reliable as it's weakest part
Don't ignore the human factor
- Standardize
- Consolidate
- Redundancy includes people
- Manage team fatigue and wear
It ain't paranoya when they are really out to get you
3AM, Nagios critical
Assign an incident mananger
- Take the pressure off
- Watch the clock and decision lines
- Communicate and keep information flowing
- Maintain an incident log
- Prepare the handoff
Never act without evidence
Assumptions are the mother of all f*ckups
Collect data
Data needed for debugging, simulating and post mortem
Communicate!
- Chat services
- Log the chat
- No invitation needed
Dude, what just happened?
Post Mortem
root cause Contributing factors
It's either my fault or no-one's fault
Reliability is a loan shark
Techincal debt can be dangerous
Reliablity is about trust