Instrumentation, Observability and Monitoring (IOM)

Terminology


Observability: Observability is the property of a system to answer either trivial or complex questions about itself. How easily you can find answers to your questions is a measure of how good the system’s observability is.


Monitoring: observing the quality of system performance over a time duration


Instrumentation: refers to metrics exposed by the internals of the system using code (a type of white-box monitoring)


Why IOM? 



  1. Analyzing long-term trends like

    1. User growth over time

    2. User time in the system

    3. System Performance over time

  2. Comparing over time or experiment groups

    1. How much faster are my queries after new version of a library

    2. Cache hit/miss ratio after adding more nodes

    3. Is my site slower than it was last week?

  3. Alerting - Something is broken, and somebody needs to fix it right now! 

  4. Building dashboards - Answer basic questions about your service - Latency, Traffic & Errors.

  5. Conducting ad hoc retrospective analysis - Our latency just shot up; what else happened around the same time?

How to observe?

The three pillars: 

  • Logs - A text that describes an event that happened at a certain time.

  • Metrics - Events with a name and a metric value, plus likely a low cardinality set of dimensions aggregated and stored for low cost, fast retrieval.

  • Tracing - A series of events with a parent/child relationship. Generally this tells the story of an entire user interaction and displayed in a Gantt-chart like view. This is usually achieved by adding a trace-id


What to Observe?




  1. Infrastructure Observability- Example include

    1. Lay of the land e.g (Hosts/regions, For serverless - memory/CPU/invocation)

    2. Message Bus stats (events published/captured, performance, limit, size..)

    3. Networking stats

    4. DB stats

    5. Cache stats

  2. Service health and performance: Examples include
    1. Service Uptime/health 

    2. P90/95 API performance, invocation count

    3. P90/95 performance for capturing events by event_type, 

    4. Captured Events count by event_type

    5. Performance breakdown by major blocks in the workflow (e.g time spend for DB operations, dependency api calls, actual business logic)

    6. P90/95 for DB Queries

    7. Error trends

    8. Input/Output to/from the service

  3. Product level observability.

    1. 80/20 rule for a product - Most used APIs/Most used features

    2. Tracing workflow through out the system

  4. Customer Experience: Examples include

    1. Customer UI experience - Time spent on page, clicks by user by workflow

    2. Workflow/API performance from the end user perspective.

    3. UI Errors

  5. Business level observability: Examples include

    1. User growth over time

    2. User time (session duration) in the system over time

    3. New Feature adoption rate

Best Practices

  • Focus on what and why? The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause. E.g a spike in 500 error codes is a symptom but possibly caused by a db server refusing connections.

  • Don't be afraid to write extra code in your service to detect and expose possible causes.

  • Always think about the user impact: “How will these metrics show me the user impact?

  • Make it useful, then actionable: When a monitor triggers an alarm, it should first and foremost be “useful”. Secondly, it should be “actionable”. There should be something you can do to resolve the alarm and also be a set of steps (post-resolution) to prevent that alarm from triggering again.

  • If the alarm isn’t actionable, then it just becomes noise.

  • Having consistent log format makes it easier to parse and aggregate information across multiple services.

Popular posts from this blog

Break functional and orchestration responsibilities for better testability

An Effective TTL-Caching Pattern

Microservices and Tech Stack - Lessons Learned