Skip to content

vanHeemstraSystems/learning-idp-observability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Learning IDP: Observability

This repository focuses on mastering Azure observability and monitoring services using Python and Azure SDK to build, manage, and automate monitoring infrastructure for Internal Development Platform (IDP) development.

🎯 Learning Objectives

By working through this repository, you will:

  1. Master Azure Monitor and Application Insights
  2. Implement Log Analytics and KQL queries
  3. Configure alerts and action groups
  4. Work with metrics and custom telemetry
  5. Implement distributed tracing
  6. Build monitoring dashboards
  7. Optimize observability for performance and cost

πŸ“š Prerequisites

  • Python 3.11 or higher
  • Azure subscription with monitoring access
  • Azure CLI installed and configured
  • Completed learning-idp-python-azure-sdk
  • Basic understanding of monitoring concepts
  • Git and GitHub account

πŸ—‚οΈ Directory Structure

learning-idp-observability/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ REFERENCES.md                      # Links to resources and related repos
β”œβ”€β”€ pyproject.toml                     # Python project configuration
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ requirements-dev.txt               # Development dependencies
β”œβ”€β”€ .python-version                    # Python version for pyenv
β”œβ”€β”€ .gitignore                         # Git ignore patterns
β”œβ”€β”€ .env.example                       # Environment variables template
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ concepts/
β”‚   β”‚   β”œβ”€β”€ 01-observability-overview.md
β”‚   β”‚   β”œβ”€β”€ 02-three-pillars.md
β”‚   β”‚   β”œβ”€β”€ 03-azure-monitor.md
β”‚   β”‚   β”œβ”€β”€ 04-application-insights.md
β”‚   β”‚   β”œβ”€β”€ 05-log-analytics.md
β”‚   β”‚   └── 06-distributed-tracing.md
β”‚   β”œβ”€β”€ guides/
β”‚   β”‚   β”œβ”€β”€ getting-started.md
β”‚   β”‚   β”œβ”€β”€ log-analytics-setup.md
β”‚   β”‚   β”œβ”€β”€ app-insights-integration.md
β”‚   β”‚   β”œβ”€β”€ alert-configuration.md
β”‚   β”‚   └── dashboard-creation.md
β”‚   └── examples/
β”‚       β”œβ”€β”€ basic-monitoring.md
β”‚       β”œβ”€β”€ custom-metrics.md
β”‚       β”œβ”€β”€ distributed-tracing.md
β”‚       β”œβ”€β”€ kql-queries.md
β”‚       └── alerting-strategies.md
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ authentication.py          # Azure authentication
β”‚   β”‚   β”œβ”€β”€ config.py                  # Configuration management
β”‚   β”‚   β”œβ”€β”€ exceptions.py              # Custom exceptions
β”‚   β”‚   └── logging_config.py          # Logging setup
β”‚   β”‚
β”‚   β”œβ”€β”€ azure_monitor/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ monitor_client.py          # Monitor operations
β”‚   β”‚   β”œβ”€β”€ metrics.py                 # Metrics management
β”‚   β”‚   β”œβ”€β”€ diagnostic_settings.py     # Diagnostic configuration
β”‚   β”‚   └── activity_logs.py           # Activity log queries
β”‚   β”‚
β”‚   β”œβ”€β”€ log_analytics/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ workspace_manager.py       # Workspace operations
β”‚   β”‚   β”œβ”€β”€ query_client.py            # KQL query execution
β”‚   β”‚   β”œβ”€β”€ saved_searches.py          # Saved search management
β”‚   β”‚   └── data_ingestion.py          # Custom data ingestion
β”‚   β”‚
β”‚   β”œβ”€β”€ application_insights/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ app_insights_manager.py    # App Insights operations
β”‚   β”‚   β”œβ”€β”€ telemetry_client.py        # Telemetry collection
β”‚   β”‚   β”œβ”€β”€ availability_tests.py      # Availability monitoring
β”‚   β”‚   └── live_metrics.py            # Live metrics stream
β”‚   β”‚
β”‚   β”œβ”€β”€ alerts/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ alert_rules.py             # Alert rule management
β”‚   β”‚   β”œβ”€β”€ action_groups.py           # Action group configuration
β”‚   β”‚   β”œβ”€β”€ smart_detection.py         # Smart detection rules
β”‚   β”‚   └── notification_manager.py    # Notification handling
β”‚   β”‚
β”‚   β”œβ”€β”€ dashboards/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ dashboard_manager.py       # Dashboard operations
β”‚   β”‚   β”œβ”€β”€ workbook_manager.py        # Workbook management
β”‚   β”‚   β”œβ”€β”€ visualization.py           # Chart and visualization
β”‚   β”‚   └── template_manager.py        # Dashboard templates
β”‚   β”‚
β”‚   β”œβ”€β”€ distributed_tracing/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ tracer.py                  # Distributed tracer
β”‚   β”‚   β”œβ”€β”€ span_processor.py          # Span processing
β”‚   β”‚   β”œβ”€β”€ correlation.py             # Correlation handling
β”‚   β”‚   └── sampling.py                # Sampling strategies
β”‚   β”‚
β”‚   β”œβ”€β”€ custom_telemetry/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ metrics_collector.py       # Custom metrics
β”‚   β”‚   β”œβ”€β”€ event_tracker.py           # Event tracking
β”‚   β”‚   β”œβ”€β”€ dependency_tracker.py      # Dependency monitoring
β”‚   β”‚   └── performance_counter.py     # Performance counters
β”‚   β”‚
β”‚   └── integrations/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ prometheus.py              # Prometheus integration
β”‚       β”œβ”€β”€ grafana.py                 # Grafana integration
β”‚       β”œβ”€β”€ opentelemetry.py           # OpenTelemetry
β”‚       └── datadog.py                 # DataDog integration
β”‚
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ 01_azure_monitor/
β”‚   β”‚   β”œβ”€β”€ 01_create_workspace.py
β”‚   β”‚   β”œβ”€β”€ 02_query_metrics.py
β”‚   β”‚   β”œβ”€β”€ 03_diagnostic_settings.py
β”‚   β”‚   β”œβ”€β”€ 04_activity_logs.py
β”‚   β”‚   └── 05_resource_health.py
β”‚   β”‚
β”‚   β”œβ”€β”€ 02_log_analytics/
β”‚   β”‚   β”œβ”€β”€ 01_workspace_setup.py
β”‚   β”‚   β”œβ”€β”€ 02_kql_queries.py
β”‚   β”‚   β”œβ”€β”€ 03_custom_logs.py
β”‚   β”‚   β”œβ”€β”€ 04_saved_searches.py
β”‚   β”‚   └── 05_data_export.py
β”‚   β”‚
β”‚   β”œβ”€β”€ 03_application_insights/
β”‚   β”‚   β”œβ”€β”€ 01_create_app_insights.py
β”‚   β”‚   β”œβ”€β”€ 02_telemetry_collection.py
β”‚   β”‚   β”œβ”€β”€ 03_availability_tests.py
β”‚   β”‚   β”œβ”€β”€ 04_custom_events.py
β”‚   β”‚   └── 05_performance_monitoring.py
β”‚   β”‚
β”‚   β”œβ”€β”€ 04_alerts/
β”‚   β”‚   β”œβ”€β”€ 01_metric_alerts.py
β”‚   β”‚   β”œβ”€β”€ 02_log_query_alerts.py
β”‚   β”‚   β”œβ”€β”€ 03_action_groups.py
β”‚   β”‚   β”œβ”€β”€ 04_smart_detection.py
β”‚   β”‚   └── 05_alert_processing.py
β”‚   β”‚
β”‚   β”œβ”€β”€ 05_dashboards/
β”‚   β”‚   β”œβ”€β”€ 01_create_dashboard.py
β”‚   β”‚   β”œβ”€β”€ 02_workbook_creation.py
β”‚   β”‚   β”œβ”€β”€ 03_custom_visualizations.py
β”‚   β”‚   β”œβ”€β”€ 04_dashboard_sharing.py
β”‚   β”‚   └── 05_automated_reports.py
β”‚   β”‚
β”‚   β”œβ”€β”€ 06_distributed_tracing/
β”‚   β”‚   β”œβ”€β”€ 01_basic_tracing.py
β”‚   β”‚   β”œβ”€β”€ 02_correlation_context.py
β”‚   β”‚   β”œβ”€β”€ 03_span_attributes.py
β”‚   β”‚   β”œβ”€β”€ 04_sampling_strategies.py
β”‚   β”‚   └── 05_trace_analysis.py
β”‚   β”‚
β”‚   β”œβ”€β”€ 07_custom_telemetry/
β”‚   β”‚   β”œβ”€β”€ 01_custom_metrics.py
β”‚   β”‚   β”œβ”€β”€ 02_custom_events.py
β”‚   β”‚   β”œβ”€β”€ 03_dependency_tracking.py
β”‚   β”‚   β”œβ”€β”€ 04_performance_counters.py
β”‚   β”‚   └── 05_business_metrics.py
β”‚   β”‚
β”‚   └── 08_integrations/
β”‚       β”œβ”€β”€ 01_opentelemetry_setup.py
β”‚       β”œβ”€β”€ 02_prometheus_exporter.py
β”‚       β”œβ”€β”€ 03_grafana_dashboard.py
β”‚       β”œβ”€β”€ 04_datadog_integration.py
β”‚       └── 05_hybrid_monitoring.py
β”‚
β”œβ”€β”€ templates/
β”‚   β”œβ”€β”€ dashboards/
β”‚   β”‚   β”œβ”€β”€ infrastructure_dashboard.json
β”‚   β”‚   β”œβ”€β”€ application_dashboard.json
β”‚   β”‚   └── security_dashboard.json
β”‚   β”œβ”€β”€ workbooks/
β”‚   β”‚   β”œβ”€β”€ performance_workbook.json
β”‚   β”‚   β”œβ”€β”€ error_analysis_workbook.json
β”‚   β”‚   └── usage_workbook.json
β”‚   β”œβ”€β”€ alerts/
β”‚   β”‚   β”œβ”€β”€ resource_health_alert.json
β”‚   β”‚   β”œβ”€β”€ performance_alert.json
β”‚   β”‚   └── availability_alert.json
β”‚   └── queries/
β”‚       β”œβ”€β”€ common_kql_queries.kql
β”‚       β”œβ”€β”€ security_queries.kql
β”‚       └── performance_queries.kql
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_monitoring_basics.ipynb
β”‚   β”œβ”€β”€ 02_kql_mastery.ipynb
β”‚   β”œβ”€β”€ 03_telemetry_analysis.ipynb
β”‚   β”œβ”€β”€ 04_alert_tuning.ipynb
β”‚   └── 05_dashboard_design.ipynb
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ setup_monitoring.sh            # Setup script
β”‚   β”œβ”€β”€ export_logs.py                 # Log export utility
β”‚   β”œβ”€β”€ alert_summary.py               # Alert reporting
β”‚   └── cost_analysis.py               # Monitoring cost analysis
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ conftest.py
β”‚   β”œβ”€β”€ unit/
β”‚   β”‚   β”œβ”€β”€ test_monitor_client.py
β”‚   β”‚   β”œβ”€β”€ test_query_client.py
β”‚   β”‚   β”œβ”€β”€ test_telemetry_client.py
β”‚   β”‚   └── test_alert_rules.py
β”‚   └── integration/
β”‚       β”œβ”€β”€ test_log_analytics.py
β”‚       β”œβ”€β”€ test_app_insights.py
β”‚       β”œβ”€β”€ test_alert_workflow.py
β”‚       └── test_distributed_tracing.py
β”‚
└── .github/
    └── workflows/
        β”œβ”€β”€ monitoring-test.yml        # Test monitoring
        β”œβ”€β”€ alert-validation.yml       # Validate alerts
        └── dashboard-deploy.yml       # Deploy dashboards

πŸš€ Getting Started

1. Clone the Repository

git clone https://github.com/vanHeemstraSystems/learning-idp-observability.git
cd learning-idp-observability

2. Set Up Python Environment

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
# On Linux/MacOS:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

3. Configure Azure Authentication

# Login to Azure
az login

# Set subscription
az account set --subscription "your-subscription-id"

# Create service principal with monitoring permissions
az ad sp create-for-rbac \
    --name "idp-monitoring-sp" \
    --role "Monitoring Contributor" \
    --scopes /subscriptions/{subscription-id}

# Configure environment variables
cp .env.example .env
# Edit .env with your credentials

4. Run Your First Example

# Create Log Analytics workspace
python examples/01_azure_monitor/01_create_workspace.py

# Run a simple KQL query
python examples/02_log_analytics/02_kql_queries.py

# Set up Application Insights
python examples/03_application_insights/01_create_app_insights.py

πŸ“– Learning Path

Follow this recommended sequence:

Week 1: Monitoring Fundamentals

Day 1-2: Azure Monitor Basics

  1. Read docs/concepts/03-azure-monitor.md
  2. Complete examples in examples/01_azure_monitor/
  3. Practice querying metrics and logs

Day 3-5: Log Analytics

  1. Study docs/concepts/05-log-analytics.md
  2. Work through examples/02_log_analytics/
  3. Master KQL query language

Day 6-7: Application Insights

  1. Read docs/concepts/04-application-insights.md
  2. Complete examples in examples/03_application_insights/
  3. Implement telemetry collection

Week 2: Alerting & Dashboards

Day 1-3: Alert Configuration

  1. Study docs/guides/alert-configuration.md
  2. Work through examples/04_alerts/
  3. Configure action groups and notifications

Day 4-7: Dashboard Creation

  1. Read docs/guides/dashboard-creation.md
  2. Complete examples in examples/05_dashboards/
  3. Build custom workbooks and visualizations

Week 3: Advanced Observability

Day 1-4: Distributed Tracing

  1. Study docs/concepts/06-distributed-tracing.md
  2. Work through examples/06_distributed_tracing/
  3. Implement end-to-end tracing

Day 5-7: Custom Telemetry

  1. Complete examples in examples/07_custom_telemetry/
  2. Implement business metrics
  3. Configure performance counters

Week 4: Integration & Production

Day 1-3: Tool Integration

  1. Work through examples/08_integrations/
  2. Integrate with Prometheus/Grafana
  3. Set up OpenTelemetry

Day 4-7: Production Readiness

  1. Optimize alert rules
  2. Implement cost management
  3. Build comprehensive monitoring

πŸ”‘ Key Azure SDK Packages

Monitoring Services

# Azure Monitor
azure-mgmt-monitor>=6.0.0           # Monitor management
azure-monitor-query>=1.3.0          # Query operations
azure-monitor-ingestion>=1.0.0      # Data ingestion

# Application Insights
azure-applicationinsights>=0.1.1    # App Insights query API
opencensus-ext-azure>=1.1.13        # OpenCensus integration
opentelemetry-sdk>=1.21.0           # OpenTelemetry

# Supporting Libraries
azure-identity>=1.15.0              # Authentication
azure-core>=1.29.0                  # Core functionality

πŸ’‘ Common Operations Examples

Create Log Analytics Workspace

from azure.identity import DefaultAzureCredential
from azure.mgmt.loganalytics import LogAnalyticsManagementClient
from azure.mgmt.loganalytics.models import Workspace, WorkspaceSku

credential = DefaultAzureCredential()
log_client = LogAnalyticsManagementClient(credential, subscription_id)

# Create workspace
workspace_params = Workspace(
    location='westeurope',
    sku=WorkspaceSku(name='PerGB2018'),
    retention_in_days=30,
    tags={
        'environment': 'production',
        'project': 'idp-monitoring'
    }
)

workspace = log_client.workspaces.begin_create_or_update(
    'my-rg',
    'my-workspace',
    workspace_params
).result()

print(f"Created workspace: {workspace.name}")
print(f"Workspace ID: {workspace.customer_id}")

Query Logs with KQL

from azure.identity import DefaultAzureCredential
from azure.monitor.query import LogsQueryClient, LogsQueryStatus
from datetime import timedelta

credential = DefaultAzureCredential()
logs_client = LogsQueryClient(credential)

# KQL query
query = """
AzureActivity
| where TimeGenerated > ago(1d)
| where OperationNameValue contains "write"
| summarize Count=count() by ResourceGroup, OperationNameValue
| order by Count desc
| limit 10
"""

# Execute query
response = logs_client.query_workspace(
    workspace_id=workspace_id,
    query=query,
    timespan=timedelta(days=1)
)

if response.status == LogsQueryStatus.SUCCESS:
    for table in response.tables:
        print(f"\nTable: {table.name}")
        print(f"Columns: {[col.name for col in table.columns]}")
        
        for row in table.rows:
            print(row)
else:
    print(f"Query failed: {response.partial_error}")

Configure Application Insights

from azure.identity import DefaultAzureCredential
from azure.mgmt.applicationinsights import ApplicationInsightsManagementClient
from azure.mgmt.applicationinsights.models import ApplicationInsightsComponent

credential = DefaultAzureCredential()
app_insights_client = ApplicationInsightsManagementClient(credential, subscription_id)

# Create Application Insights
app_insights_params = ApplicationInsightsComponent(
    location='westeurope',
    kind='web',
    application_type='web',
    workspace_resource_id=workspace.id,
    ingestion_mode='LogAnalytics',
    tags={
        'application': 'my-app',
        'environment': 'production'
    }
)

app_insights = app_insights_client.components.create_or_update(
    'my-rg',
    'my-app-insights',
    app_insights_params
)

print(f"Created App Insights: {app_insights.name}")
print(f"Instrumentation Key: {app_insights.instrumentation_key}")
print(f"Connection String: {app_insights.connection_string}")

Send Custom Telemetry

from opencensus.ext.azure.log_exporter import AzureLogHandler
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace.tracer import Tracer
from opencensus.trace.samplers import ProbabilitySampler
import logging

# Configure logging with Application Insights
logger = logging.getLogger(__name__)
logger.addHandler(AzureLogHandler(
    connection_string=connection_string
))

# Configure tracing
tracer = Tracer(
    exporter=AzureExporter(connection_string=connection_string),
    sampler=ProbabilitySampler(rate=1.0)
)

# Send custom event
logger.info('User login', extra={
    'custom_dimensions': {
        'user_id': 'user123',
        'login_method': 'oauth',
        'ip_address': '192.168.1.1'
    }
})

# Create trace
with tracer.span(name='process_order') as span:
    span.add_attribute('order_id', '12345')
    span.add_attribute('amount', 99.99)
    
    # Process order logic here
    logger.info('Order processed successfully')

Create Metric Alert

from azure.identity import DefaultAzureCredential
from azure.mgmt.monitor import MonitorManagementClient
from azure.mgmt.monitor.models import (
    MetricAlertResource,
    MetricAlertCriteria,
    MetricCriteria,
    MetricAlertAction
)

credential = DefaultAzureCredential()
monitor_client = MonitorManagementClient(credential, subscription_id)

# Define metric alert
alert = MetricAlertResource(
    location='global',
    description='Alert when CPU usage exceeds 80%',
    severity=2,
    enabled=True,
    scopes=[vm_resource_id],
    evaluation_frequency='PT5M',
    window_size='PT15M',
    criteria=MetricAlertCriteria(
        all_of=[
            MetricCriteria(
                name='HighCPU',
                metric_name='Percentage CPU',
                metric_namespace='Microsoft.Compute/virtualMachines',
                operator='GreaterThan',
                threshold=80,
                time_aggregation='Average'
            )
        ]
    ),
    actions=[
        MetricAlertAction(
            action_group_id=action_group_id
        )
    ],
    tags={
        'severity': 'high',
        'team': 'operations'
    }
)

# Create alert
alert_rule = monitor_client.metric_alerts.create_or_update(
    'my-rg',
    'high-cpu-alert',
    alert
)

print(f"Created alert: {alert_rule.name}")

Create Dashboard

from azure.identity import DefaultAzureCredential
from azure.mgmt.portal import PortalClient
from azure.mgmt.portal.models import Dashboard

credential = DefaultAzureCredential()
portal_client = PortalClient(credential, subscription_id)

# Define dashboard
dashboard_properties = {
    "lenses": {
        "0": {
            "order": 0,
            "parts": {
                "0": {
                    "position": {
                        "x": 0,
                        "y": 0,
                        "colSpan": 6,
                        "rowSpan": 4
                    },
                    "metadata": {
                        "type": "Extension/HubsExtension/PartType/MonitorChartPart",
                        "settings": {
                            "content": {
                                "chartType": "Line",
                                "metrics": [{
                                    "resourceId": vm_resource_id,
                                    "name": "Percentage CPU"
                                }]
                            }
                        }
                    }
                }
            }
        }
    }
}

dashboard = Dashboard(
    location='westeurope',
    tags={'environment': 'production'},
    properties=dashboard_properties
)

# Create dashboard
created_dashboard = portal_client.dashboards.create_or_update(
    'my-rg',
    'my-dashboard',
    dashboard
)

print(f"Created dashboard: {created_dashboard.name}")

Implement Distributed Tracing

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter

# Set up tracer provider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure Azure Monitor exporter
exporter = AzureMonitorTraceExporter(
    connection_string=connection_string
)

# Add span processor
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(exporter)
)

# Create spans
with tracer.start_as_current_span("parent-operation") as parent_span:
    parent_span.set_attribute("user.id", "user123")
    
    # Child operation 1
    with tracer.start_as_current_span("database-query") as db_span:
        db_span.set_attribute("db.system", "postgresql")
        db_span.set_attribute("db.operation", "SELECT")
        # Database query logic
    
    # Child operation 2
    with tracer.start_as_current_span("external-api-call") as api_span:
        api_span.set_attribute("http.method", "GET")
        api_span.set_attribute("http.url", "https://api.example.com")
        # API call logic

print("Distributed trace sent to Application Insights")

🎯 Best Practices

1. Use Structured Logging

import structlog

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

logger = structlog.get_logger()

# Log with structured data
logger.info("user_action",
    user_id="user123",
    action="login",
    ip_address="192.168.1.1",
    success=True
)

2. Implement Sampling

from opencensus.trace.samplers import ProbabilitySampler

# Sample 10% of traces in production
sampler = ProbabilitySampler(rate=0.1)

tracer = Tracer(
    exporter=AzureExporter(connection_string=connection_string),
    sampler=sampler
)

3. Use Log Levels Appropriately

# Critical: System-wide failures
logger.critical("Database connection failed")

# Error: Recoverable errors
logger.error("Failed to process payment", order_id=12345)

# Warning: Potential issues
logger.warning("API response slow", duration_ms=5000)

# Info: Business events
logger.info("Order completed", order_id=12345, amount=99.99)

# Debug: Detailed debugging
logger.debug("Cache hit", key="user:123")

4. Optimize Query Performance

# Use summarize for aggregations
query = """
requests
| where timestamp > ago(1h)
| summarize Count=count(), AvgDuration=avg(duration) by bin(timestamp, 5m)
"""

# Project only needed columns
query = """
requests
| project timestamp, name, duration, resultCode
| where duration > 1000
"""

# Use materialized views for frequent queries

πŸ”§ Development Tools

Monitoring Tools

# Install monitoring libraries
pip install opencensus-ext-azure
pip install opentelemetry-sdk
pip install prometheus-client

# KQL tools
pip install kqlmagic  # For Jupyter notebooks

# Testing
pip install pytest
pip install pytest-mock

Query Development

# Use Log Analytics in Azure Portal for query development
# Export queries to files for version control

πŸ“Š Observability Architecture

Three Pillars of Observability

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           LOGS                       β”‚
β”‚   - Structured logging               β”‚
β”‚   - Log aggregation                  β”‚
β”‚   - Search and analysis              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           METRICS                    β”‚
β”‚   - Time-series data                 β”‚
β”‚   - Dashboards                       β”‚
β”‚   - Alerting                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           TRACES                     β”‚
β”‚   - Distributed tracing              β”‚
β”‚   - Request flow                     β”‚
β”‚   - Performance analysis             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Monitoring Layers

Layer 4: Business Metrics
  - User engagement
  - Revenue metrics
  - Conversion rates

Layer 3: Application Metrics
  - Response times
  - Error rates
  - Dependencies

Layer 2: Infrastructure Metrics
  - CPU, Memory, Disk
  - Network traffic
  - Resource utilization

Layer 1: Platform Metrics
  - Azure service health
  - Resource availability
  - Service limits

πŸ”— Related Repositories

🀝 Contributing

This is a personal learning repository, but suggestions and improvements are welcome!

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Ensure all tests pass
  5. Submit a pull request

πŸ“„ License

This project is for educational purposes. See LICENSE file for details.

πŸ“§ Contact

Willem van Heemstra


Last updated: December 18, 2025 Part of the learning-internal-development-platform series

About

Learning IDP Observability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published