Instrumentation, Observability and Monitoring (IOM)
Terminology
Observability: Observability is the property of a system to answer either trivial or complex questions about itself. How easily you can find answers to your questions is a measure of how good the system’s observability is.
Monitoring: observing the quality of system performance over a time duration
Instrumentation: refers to metrics exposed by the internals of the system using code (a type of white-box monitoring)
Why IOM?
Analyzing long-term trends like
User growth over time
User time in the system
System Performance over time
Comparing over time or experiment groups
How much faster are my queries after new version of a library
Cache hit/miss ratio after adding more nodes
Is my site slower than it was last week?
Alerting - Something is broken, and somebody needs to fix it right now!
Building dashboards - Answer basic questions about your service - Latency, Traffic & Errors.
Conducting ad hoc retrospective analysis - Our latency just shot up; what else happened around the same time?
How to observe?
The three pillars:
Logs - A text that describes an event that happened at a certain time.
Metrics - Events with a name and a metric value, plus likely a low cardinality set of dimensions aggregated and stored for low cost, fast retrieval.
Tracing - A series of events with a parent/child relationship. Generally this tells the story of an entire user interaction and displayed in a Gantt-chart like view. This is usually achieved by adding a trace-id
Infrastructure Observability- Example include
Lay of the land e.g (Hosts/regions, For serverless - memory/CPU/invocation)
Message Bus stats (events published/captured, performance, limit, size..)
Networking stats
DB stats
Cache stats
- Service health and performance: Examples include
Service Uptime/health
P90/95 API performance, invocation count
P90/95 performance for capturing events by event_type,
Captured Events count by event_type
Performance breakdown by major blocks in the workflow (e.g time spend for DB operations, dependency api calls, actual business logic)
P90/95 for DB Queries
Error trends
Input/Output to/from the service
Product level observability.
80/20 rule for a product - Most used APIs/Most used features
Tracing workflow through out the system
Customer Experience: Examples include
Customer UI experience - Time spent on page, clicks by user by workflow
Workflow/API performance from the end user perspective.
UI Errors
Business level observability: Examples include
User growth over time
User time (session duration) in the system over time
New Feature adoption rate
Best Practices
Focus on what and why? The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause. E.g a spike in 500 error codes is a symptom but possibly caused by a db server refusing connections.
Don't be afraid to write extra code in your service to detect and expose possible causes.
Always think about the user impact: “How will these metrics show me the user impact?”
Make it useful, then actionable: When a monitor triggers an alarm, it should first and foremost be “useful”. Secondly, it should be “actionable”. There should be something you can do to resolve the alarm and also be a set of steps (post-resolution) to prevent that alarm from triggering again.
If the alarm isn’t actionable, then it just becomes noise.
Having consistent log format makes it easier to parse and aggregate information across multiple services.