Learning IDP: Observability

This repository focuses on mastering Azure observability and monitoring services using Python and Azure SDK to build, manage, and automate monitoring infrastructure for Internal Development Platform (IDP) development.

References

🎯 Learning Objectives

By working through this repository, you will:

Master Azure Monitor and Application Insights
Implement Log Analytics and KQL queries
Configure alerts and action groups
Work with metrics and custom telemetry
Implement distributed tracing
Build monitoring dashboards
Optimize observability for performance and cost

📚 Prerequisites

Python 3.11 or higher
Azure subscription with monitoring access
Azure CLI installed and configured
Completed learning-idp-python-azure-sdk
Basic understanding of monitoring concepts
Git and GitHub account

🗂️ Directory Structure

learning-idp-observability/
├── README.md                          # This file
├── REFERENCES.md                      # Links to resources and related repos
├── pyproject.toml                     # Python project configuration
├── requirements.txt                   # Python dependencies
├── requirements-dev.txt               # Development dependencies
├── .python-version                    # Python version for pyenv
├── .gitignore                         # Git ignore patterns
├── .env.example                       # Environment variables template
│
├── docs/
│   ├── concepts/
│   │   ├── 01-observability-overview.md
│   │   ├── 02-three-pillars.md
│   │   ├── 03-azure-monitor.md
│   │   ├── 04-application-insights.md
│   │   ├── 05-log-analytics.md
│   │   └── 06-distributed-tracing.md
│   ├── guides/
│   │   ├── getting-started.md
│   │   ├── log-analytics-setup.md
│   │   ├── app-insights-integration.md
│   │   ├── alert-configuration.md
│   │   └── dashboard-creation.md
│   └── examples/
│       ├── basic-monitoring.md
│       ├── custom-metrics.md
│       ├── distributed-tracing.md
│       ├── kql-queries.md
│       └── alerting-strategies.md
│
├── src/
│   ├── __init__.py
│   │
│   ├── core/
│   │   ├── __init__.py
│   │   ├── authentication.py          # Azure authentication
│   │   ├── config.py                  # Configuration management
│   │   ├── exceptions.py              # Custom exceptions
│   │   └── logging_config.py          # Logging setup
│   │
│   ├── azure_monitor/
│   │   ├── __init__.py
│   │   ├── monitor_client.py          # Monitor operations
│   │   ├── metrics.py                 # Metrics management
│   │   ├── diagnostic_settings.py     # Diagnostic configuration
│   │   └── activity_logs.py           # Activity log queries
│   │
│   ├── log_analytics/
│   │   ├── __init__.py
│   │   ├── workspace_manager.py       # Workspace operations
│   │   ├── query_client.py            # KQL query execution
│   │   ├── saved_searches.py          # Saved search management
│   │   └── data_ingestion.py          # Custom data ingestion
│   │
│   ├── application_insights/
│   │   ├── __init__.py
│   │   ├── app_insights_manager.py    # App Insights operations
│   │   ├── telemetry_client.py        # Telemetry collection
│   │   ├── availability_tests.py      # Availability monitoring
│   │   └── live_metrics.py            # Live metrics stream
│   │
│   ├── alerts/
│   │   ├── __init__.py
│   │   ├── alert_rules.py             # Alert rule management
│   │   ├── action_groups.py           # Action group configuration
│   │   ├── smart_detection.py         # Smart detection rules
│   │   └── notification_manager.py    # Notification handling
│   │
│   ├── dashboards/
│   │   ├── __init__.py
│   │   ├── dashboard_manager.py       # Dashboard operations
│   │   ├── workbook_manager.py        # Workbook management
│   │   ├── visualization.py           # Chart and visualization
│   │   └── template_manager.py        # Dashboard templates
│   │
│   ├── distributed_tracing/
│   │   ├── __init__.py
│   │   ├── tracer.py                  # Distributed tracer
│   │   ├── span_processor.py          # Span processing
│   │   ├── correlation.py             # Correlation handling
│   │   └── sampling.py                # Sampling strategies
│   │
│   ├── custom_telemetry/
│   │   ├── __init__.py
│   │   ├── metrics_collector.py       # Custom metrics
│   │   ├── event_tracker.py           # Event tracking
│   │   ├── dependency_tracker.py      # Dependency monitoring
│   │   └── performance_counter.py     # Performance counters
│   │
│   └── integrations/
│       ├── __init__.py
│       ├── prometheus.py              # Prometheus integration
│       ├── grafana.py                 # Grafana integration
│       ├── opentelemetry.py           # OpenTelemetry
│       └── datadog.py                 # DataDog integration
│
├── examples/
│   ├── 01_azure_monitor/
│   │   ├── 01_create_workspace.py
│   │   ├── 02_query_metrics.py
│   │   ├── 03_diagnostic_settings.py
│   │   ├── 04_activity_logs.py
│   │   └── 05_resource_health.py
│   │
│   ├── 02_log_analytics/
│   │   ├── 01_workspace_setup.py
│   │   ├── 02_kql_queries.py
│   │   ├── 03_custom_logs.py
│   │   ├── 04_saved_searches.py
│   │   └── 05_data_export.py
│   │
│   ├── 03_application_insights/
│   │   ├── 01_create_app_insights.py
│   │   ├── 02_telemetry_collection.py
│   │   ├── 03_availability_tests.py
│   │   ├── 04_custom_events.py
│   │   └── 05_performance_monitoring.py
│   │
│   ├── 04_alerts/
│   │   ├── 01_metric_alerts.py
│   │   ├── 02_log_query_alerts.py
│   │   ├── 03_action_groups.py
│   │   ├── 04_smart_detection.py
│   │   └── 05_alert_processing.py
│   │
│   ├── 05_dashboards/
│   │   ├── 01_create_dashboard.py
│   │   ├── 02_workbook_creation.py
│   │   ├── 03_custom_visualizations.py
│   │   ├── 04_dashboard_sharing.py
│   │   └── 05_automated_reports.py
│   │
│   ├── 06_distributed_tracing/
│   │   ├── 01_basic_tracing.py
│   │   ├── 02_correlation_context.py
│   │   ├── 03_span_attributes.py
│   │   ├── 04_sampling_strategies.py
│   │   └── 05_trace_analysis.py
│   │
│   ├── 07_custom_telemetry/
│   │   ├── 01_custom_metrics.py
│   │   ├── 02_custom_events.py
│   │   ├── 03_dependency_tracking.py
│   │   ├── 04_performance_counters.py
│   │   └── 05_business_metrics.py
│   │
│   └── 08_integrations/
│       ├── 01_opentelemetry_setup.py
│       ├── 02_prometheus_exporter.py
│       ├── 03_grafana_dashboard.py
│       ├── 04_datadog_integration.py
│       └── 05_hybrid_monitoring.py
│
├── templates/
│   ├── dashboards/
│   │   ├── infrastructure_dashboard.json
│   │   ├── application_dashboard.json
│   │   └── security_dashboard.json
│   ├── workbooks/
│   │   ├── performance_workbook.json
│   │   ├── error_analysis_workbook.json
│   │   └── usage_workbook.json
│   ├── alerts/
│   │   ├── resource_health_alert.json
│   │   ├── performance_alert.json
│   │   └── availability_alert.json
│   └── queries/
│       ├── common_kql_queries.kql
│       ├── security_queries.kql
│       └── performance_queries.kql
│
├── notebooks/
│   ├── 01_monitoring_basics.ipynb
│   ├── 02_kql_mastery.ipynb
│   ├── 03_telemetry_analysis.ipynb
│   ├── 04_alert_tuning.ipynb
│   └── 05_dashboard_design.ipynb
│
├── scripts/
│   ├── setup_monitoring.sh            # Setup script
│   ├── export_logs.py                 # Log export utility
│   ├── alert_summary.py               # Alert reporting
│   └── cost_analysis.py               # Monitoring cost analysis
│
├── tests/
│   ├── __init__.py
│   ├── conftest.py
│   ├── unit/
│   │   ├── test_monitor_client.py
│   │   ├── test_query_client.py
│   │   ├── test_telemetry_client.py
│   │   └── test_alert_rules.py
│   └── integration/
│       ├── test_log_analytics.py
│       ├── test_app_insights.py
│       ├── test_alert_workflow.py
│       └── test_distributed_tracing.py
│
└── .github/
    └── workflows/
        ├── monitoring-test.yml        # Test monitoring
        ├── alert-validation.yml       # Validate alerts
        └── dashboard-deploy.yml       # Deploy dashboards

🚀 Getting Started

1. Clone the Repository

git clone https://github.com/vanHeemstraSystems/learning-idp-observability.git
cd learning-idp-observability

2. Set Up Python Environment

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
# On Linux/MacOS:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

3. Configure Azure Authentication

# Login to Azure
az login

# Set subscription
az account set --subscription "your-subscription-id"

# Create service principal with monitoring permissions
az ad sp create-for-rbac \
    --name "idp-monitoring-sp" \
    --role "Monitoring Contributor" \
    --scopes /subscriptions/{subscription-id}

# Configure environment variables
cp .env.example .env
# Edit .env with your credentials

4. Run Your First Example

# Create Log Analytics workspace
python examples/01_azure_monitor/01_create_workspace.py

# Run a simple KQL query
python examples/02_log_analytics/02_kql_queries.py

# Set up Application Insights
python examples/03_application_insights/01_create_app_insights.py

📖 Learning Path

Follow this recommended sequence:

Week 1: Monitoring Fundamentals

Day 1-2: Azure Monitor Basics

Read docs/concepts/03-azure-monitor.md
Complete examples in examples/01_azure_monitor/
Practice querying metrics and logs

Day 3-5: Log Analytics

Study docs/concepts/05-log-analytics.md
Work through examples/02_log_analytics/
Master KQL query language

Day 6-7: Application Insights

Read docs/concepts/04-application-insights.md
Complete examples in examples/03_application_insights/
Implement telemetry collection

Week 2: Alerting & Dashboards

Day 1-3: Alert Configuration

Study docs/guides/alert-configuration.md
Work through examples/04_alerts/
Configure action groups and notifications

Day 4-7: Dashboard Creation

Read docs/guides/dashboard-creation.md
Complete examples in examples/05_dashboards/
Build custom workbooks and visualizations

Week 3: Advanced Observability

Day 1-4: Distributed Tracing

Study docs/concepts/06-distributed-tracing.md
Work through examples/06_distributed_tracing/
Implement end-to-end tracing

Day 5-7: Custom Telemetry

Complete examples in examples/07_custom_telemetry/
Implement business metrics
Configure performance counters

Week 4: Integration & Production

Day 1-3: Tool Integration

Work through examples/08_integrations/
Integrate with Prometheus/Grafana
Set up OpenTelemetry

Day 4-7: Production Readiness

Optimize alert rules
Implement cost management
Build comprehensive monitoring

🔑 Key Azure SDK Packages

Monitoring Services

# Azure Monitor
azure-mgmt-monitor>=6.0.0           # Monitor management
azure-monitor-query>=1.3.0          # Query operations
azure-monitor-ingestion>=1.0.0      # Data ingestion

# Application Insights
azure-applicationinsights>=0.1.1    # App Insights query API
opencensus-ext-azure>=1.1.13        # OpenCensus integration
opentelemetry-sdk>=1.21.0           # OpenTelemetry

# Supporting Libraries
azure-identity>=1.15.0              # Authentication
azure-core>=1.29.0                  # Core functionality

💡 Common Operations Examples

Create Log Analytics Workspace

from azure.identity import DefaultAzureCredential
from azure.mgmt.loganalytics import LogAnalyticsManagementClient
from azure.mgmt.loganalytics.models import Workspace, WorkspaceSku

credential = DefaultAzureCredential()
log_client = LogAnalyticsManagementClient(credential, subscription_id)

# Create workspace
workspace_params = Workspace(
    location='westeurope',
    sku=WorkspaceSku(name='PerGB2018'),
    retention_in_days=30,
    tags={
        'environment': 'production',
        'project': 'idp-monitoring'
    }
)

workspace = log_client.workspaces.begin_create_or_update(
    'my-rg',
    'my-workspace',
    workspace_params
).result()

print(f"Created workspace: {workspace.name}")
print(f"Workspace ID: {workspace.customer_id}")

Query Logs with KQL

from azure.identity import DefaultAzureCredential
from azure.monitor.query import LogsQueryClient, LogsQueryStatus
from datetime import timedelta

credential = DefaultAzureCredential()
logs_client = LogsQueryClient(credential)

# KQL query
query = """
AzureActivity
| where TimeGenerated > ago(1d)
| where OperationNameValue contains "write"
| summarize Count=count() by ResourceGroup, OperationNameValue
| order by Count desc
| limit 10
"""

# Execute query
response = logs_client.query_workspace(
    workspace_id=workspace_id,
    query=query,
    timespan=timedelta(days=1)
)

if response.status == LogsQueryStatus.SUCCESS:
    for table in response.tables:
        print(f"\nTable: {table.name}")
        print(f"Columns: {[col.name for col in table.columns]}")
        
        for row in table.rows:
            print(row)
else:
    print(f"Query failed: {response.partial_error}")

Configure Application Insights

from azure.identity import DefaultAzureCredential
from azure.mgmt.applicationinsights import ApplicationInsightsManagementClient
from azure.mgmt.applicationinsights.models import ApplicationInsightsComponent

credential = DefaultAzureCredential()
app_insights_client = ApplicationInsightsManagementClient(credential, subscription_id)

# Create Application Insights
app_insights_params = ApplicationInsightsComponent(
    location='westeurope',
    kind='web',
    application_type='web',
    workspace_resource_id=workspace.id,
    ingestion_mode='LogAnalytics',
    tags={
        'application': 'my-app',
        'environment': 'production'
    }
)

app_insights = app_insights_client.components.create_or_update(
    'my-rg',
    'my-app-insights',
    app_insights_params
)

print(f"Created App Insights: {app_insights.name}")
print(f"Instrumentation Key: {app_insights.instrumentation_key}")
print(f"Connection String: {app_insights.connection_string}")

Send Custom Telemetry

from opencensus.ext.azure.log_exporter import AzureLogHandler
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace.tracer import Tracer
from opencensus.trace.samplers import ProbabilitySampler
import logging

# Configure logging with Application Insights
logger = logging.getLogger(__name__)
logger.addHandler(AzureLogHandler(
    connection_string=connection_string
))

# Configure tracing
tracer = Tracer(
    exporter=AzureExporter(connection_string=connection_string),
    sampler=ProbabilitySampler(rate=1.0)
)

# Send custom event
logger.info('User login', extra={
    'custom_dimensions': {
        'user_id': 'user123',
        'login_method': 'oauth',
        'ip_address': '192.168.1.1'
    }
})

# Create trace
with tracer.span(name='process_order') as span:
    span.add_attribute('order_id', '12345')
    span.add_attribute('amount', 99.99)
    
    # Process order logic here
    logger.info('Order processed successfully')

Create Metric Alert

from azure.identity import DefaultAzureCredential
from azure.mgmt.monitor import MonitorManagementClient
from azure.mgmt.monitor.models import (
    MetricAlertResource,
    MetricAlertCriteria,
    MetricCriteria,
    MetricAlertAction
)

credential = DefaultAzureCredential()
monitor_client = MonitorManagementClient(credential, subscription_id)

# Define metric alert
alert = MetricAlertResource(
    location='global',
    description='Alert when CPU usage exceeds 80%',
    severity=2,
    enabled=True,
    scopes=[vm_resource_id],
    evaluation_frequency='PT5M',
    window_size='PT15M',
    criteria=MetricAlertCriteria(
        all_of=[
            MetricCriteria(
                name='HighCPU',
                metric_name='Percentage CPU',
                metric_namespace='Microsoft.Compute/virtualMachines',
                operator='GreaterThan',
                threshold=80,
                time_aggregation='Average'
            )
        ]
    ),
    actions=[
        MetricAlertAction(
            action_group_id=action_group_id
        )
    ],
    tags={
        'severity': 'high',
        'team': 'operations'
    }
)

# Create alert
alert_rule = monitor_client.metric_alerts.create_or_update(
    'my-rg',
    'high-cpu-alert',
    alert
)

print(f"Created alert: {alert_rule.name}")

Create Dashboard

from azure.identity import DefaultAzureCredential
from azure.mgmt.portal import PortalClient
from azure.mgmt.portal.models import Dashboard

credential = DefaultAzureCredential()
portal_client = PortalClient(credential, subscription_id)

# Define dashboard
dashboard_properties = {
    "lenses": {
        "0": {
            "order": 0,
            "parts": {
                "0": {
                    "position": {
                        "x": 0,
                        "y": 0,
                        "colSpan": 6,
                        "rowSpan": 4
                    },
                    "metadata": {
                        "type": "Extension/HubsExtension/PartType/MonitorChartPart",
                        "settings": {
                            "content": {
                                "chartType": "Line",
                                "metrics": [{
                                    "resourceId": vm_resource_id,
                                    "name": "Percentage CPU"
                                }]
                            }
                        }
                    }
                }
            }
        }
    }
}

dashboard = Dashboard(
    location='westeurope',
    tags={'environment': 'production'},
    properties=dashboard_properties
)

# Create dashboard
created_dashboard = portal_client.dashboards.create_or_update(
    'my-rg',
    'my-dashboard',
    dashboard
)

print(f"Created dashboard: {created_dashboard.name}")

Implement Distributed Tracing

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter

# Set up tracer provider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure Azure Monitor exporter
exporter = AzureMonitorTraceExporter(
    connection_string=connection_string
)

# Add span processor
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(exporter)
)

# Create spans
with tracer.start_as_current_span("parent-operation") as parent_span:
    parent_span.set_attribute("user.id", "user123")
    
    # Child operation 1
    with tracer.start_as_current_span("database-query") as db_span:
        db_span.set_attribute("db.system", "postgresql")
        db_span.set_attribute("db.operation", "SELECT")
        # Database query logic
    
    # Child operation 2
    with tracer.start_as_current_span("external-api-call") as api_span:
        api_span.set_attribute("http.method", "GET")
        api_span.set_attribute("http.url", "https://api.example.com")
        # API call logic

print("Distributed trace sent to Application Insights")

🎯 Best Practices

1. Use Structured Logging

import structlog

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

logger = structlog.get_logger()

# Log with structured data
logger.info("user_action",
    user_id="user123",
    action="login",
    ip_address="192.168.1.1",
    success=True
)

2. Implement Sampling

from opencensus.trace.samplers import ProbabilitySampler

# Sample 10% of traces in production
sampler = ProbabilitySampler(rate=0.1)

tracer = Tracer(
    exporter=AzureExporter(connection_string=connection_string),
    sampler=sampler
)

3. Use Log Levels Appropriately

# Critical: System-wide failures
logger.critical("Database connection failed")

# Error: Recoverable errors
logger.error("Failed to process payment", order_id=12345)

# Warning: Potential issues
logger.warning("API response slow", duration_ms=5000)

# Info: Business events
logger.info("Order completed", order_id=12345, amount=99.99)

# Debug: Detailed debugging
logger.debug("Cache hit", key="user:123")

4. Optimize Query Performance

# Use summarize for aggregations
query = """
requests
| where timestamp > ago(1h)
| summarize Count=count(), AvgDuration=avg(duration) by bin(timestamp, 5m)
"""

# Project only needed columns
query = """
requests
| project timestamp, name, duration, resultCode
| where duration > 1000
"""

# Use materialized views for frequent queries

🔧 Development Tools

Monitoring Tools

# Install monitoring libraries
pip install opencensus-ext-azure
pip install opentelemetry-sdk
pip install prometheus-client

# KQL tools
pip install kqlmagic  # For Jupyter notebooks

# Testing
pip install pytest
pip install pytest-mock

Query Development

# Use Log Analytics in Azure Portal for query development
# Export queries to files for version control

📊 Observability Architecture

Three Pillars of Observability

┌─────────────────────────────────────┐
│           LOGS                       │
│   - Structured logging               │
│   - Log aggregation                  │
│   - Search and analysis              │
└─────────────┬───────────────────────┘
              │
┌─────────────▼───────────────────────┐
│           METRICS                    │
│   - Time-series data                 │
│   - Dashboards                       │
│   - Alerting                         │
└─────────────┬───────────────────────┘
              │
┌─────────────▼───────────────────────┐
│           TRACES                     │
│   - Distributed tracing              │
│   - Request flow                     │
│   - Performance analysis             │
└─────────────────────────────────────┘

Monitoring Layers

Layer 4: Business Metrics
  - User engagement
  - Revenue metrics
  - Conversion rates

Layer 3: Application Metrics
  - Response times
  - Error rates
  - Dependencies

Layer 2: Infrastructure Metrics
  - CPU, Memory, Disk
  - Network traffic
  - Resource utilization

Layer 1: Platform Metrics
  - Azure service health
  - Resource availability
  - Service limits

🔗 Related Repositories

learning-internal-development-platform - Main overview
learning-idp-python-azure-sdk - Azure SDK fundamentals
learning-idp-azure-security - Security monitoring
learning-idp-cicd-pipelines - Pipeline monitoring
learning-idp-platform-engineering - Platform observability

🤝 Contributing

This is a personal learning repository, but suggestions and improvements are welcome!

Fork the repository
Create a feature branch
Make your changes with tests
Ensure all tests pass
Submit a pull request

📄 License

This project is for educational purposes. See LICENSE file for details.

📧 Contact

Willem van Heemstra

GitHub: @vanHeemstraSystems
LinkedIn: Willem van Heemstra

Last updated: December 18, 2025 Part of the learning-internal-development-platform series

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
REFERENCES.md		REFERENCES.md

vanHeemstraSystems/learning-idp-observability

Folders and files

Latest commit

History

Repository files navigation