Instrumentation, Observability and Monitoring (IOM)

January 18, 2021

Terminology

Observability: Observability is the property of a system to answer either trivial or complex questions about itself. How easily you can find answers to your questions is a measure of how good the system’s observability is.

Monitoring: observing the quality of system performance over a time duration

Instrumentation: refers to metrics exposed by the internals of the system using code (a type of white-box monitoring)

Why IOM?

Analyzing long-term trends like

User growth over time
User time in the system
System Performance over time

Comparing over time or experiment groups

How much faster are my queries after new version of a library
Cache hit/miss ratio after adding more nodes
Is my site slower than it was last week?

Alerting - Something is broken, and somebody needs to fix it right now!
Building dashboards - Answer basic questions about your service - Latency, Traffic & Errors.
Conducting ad hoc retrospective analysis - Our latency just shot up; what else happened around the same time?

How to observe?

The three pillars:

Logs - A text that describes an event that happened at a certain time.
Metrics - Events with a name and a metric value, plus likely a low cardinality set of dimensions aggregated and stored for low cost, fast retrieval.
Tracing - A series of events with a parent/child relationship. Generally this tells the story of an entire user interaction and displayed in a Gantt-chart like view. This is usually achieved by adding a trace-id

What to Observe?

Infrastructure Observability- Example include

Lay of the land e.g (Hosts/regions, For serverless - memory/CPU/invocation)
Message Bus stats (events published/captured, performance, limit, size..)
Networking stats
DB stats
Cache stats

Service health and performance: Examples include

Service Uptime/health
P90/95 API performance, invocation count
P90/95 performance for capturing events by event_type,
Captured Events count by event_type
Performance breakdown by major blocks in the workflow (e.g time spend for DB operations, dependency api calls, actual business logic)
P90/95 for DB Queries
Error trends
Input/Output to/from the service

Product level observability.

80/20 rule for a product - Most used APIs/Most used features
Tracing workflow through out the system

Customer Experience: Examples include

Customer UI experience - Time spent on page, clicks by user by workflow
Workflow/API performance from the end user perspective.
UI Errors

Business level observability: Examples include

User growth over time
User time (session duration) in the system over time
New Feature adoption rate

Best Practices

Focus on what and why? The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause. E.g a spike in 500 error codes is a symptom but possibly caused by a db server refusing connections.
Don't be afraid to write extra code in your service to detect and expose possible causes.
Always think about the user impact: “How will these metrics show me the user impact?”
Make it useful, then actionable: When a monitor triggers an alarm, it should first and foremost be “useful”. Secondly, it should be “actionable”. There should be something you can do to resolve the alarm and also be a set of steps (post-resolution) to prevent that alarm from triggering again.
If the alarm isn’t actionable, then it just becomes noise.
Having consistent log format makes it easier to parse and aggregate information across multiple services.

Microservices

Instrumentation, Observability and Monitoring (IOM)

Terminology

Why IOM?

How to observe?

Best Practices

Popular posts from this blog

Break functional and orchestration responsibilities for better testability

Microservices and Tech Stack - Lessons Learned

An Effective TTL-Caching Pattern