This repository focuses on mastering Azure observability and monitoring services using Python and Azure SDK to build, manage, and automate monitoring infrastructure for Internal Development Platform (IDP) development.
By working through this repository, you will:
- Master Azure Monitor and Application Insights
- Implement Log Analytics and KQL queries
- Configure alerts and action groups
- Work with metrics and custom telemetry
- Implement distributed tracing
- Build monitoring dashboards
- Optimize observability for performance and cost
- Python 3.11 or higher
- Azure subscription with monitoring access
- Azure CLI installed and configured
- Completed learning-idp-python-azure-sdk
- Basic understanding of monitoring concepts
- Git and GitHub account
learning-idp-observability/
βββ README.md # This file
βββ REFERENCES.md # Links to resources and related repos
βββ pyproject.toml # Python project configuration
βββ requirements.txt # Python dependencies
βββ requirements-dev.txt # Development dependencies
βββ .python-version # Python version for pyenv
βββ .gitignore # Git ignore patterns
βββ .env.example # Environment variables template
β
βββ docs/
β βββ concepts/
β β βββ 01-observability-overview.md
β β βββ 02-three-pillars.md
β β βββ 03-azure-monitor.md
β β βββ 04-application-insights.md
β β βββ 05-log-analytics.md
β β βββ 06-distributed-tracing.md
β βββ guides/
β β βββ getting-started.md
β β βββ log-analytics-setup.md
β β βββ app-insights-integration.md
β β βββ alert-configuration.md
β β βββ dashboard-creation.md
β βββ examples/
β βββ basic-monitoring.md
β βββ custom-metrics.md
β βββ distributed-tracing.md
β βββ kql-queries.md
β βββ alerting-strategies.md
β
βββ src/
β βββ __init__.py
β β
β βββ core/
β β βββ __init__.py
β β βββ authentication.py # Azure authentication
β β βββ config.py # Configuration management
β β βββ exceptions.py # Custom exceptions
β β βββ logging_config.py # Logging setup
β β
β βββ azure_monitor/
β β βββ __init__.py
β β βββ monitor_client.py # Monitor operations
β β βββ metrics.py # Metrics management
β β βββ diagnostic_settings.py # Diagnostic configuration
β β βββ activity_logs.py # Activity log queries
β β
β βββ log_analytics/
β β βββ __init__.py
β β βββ workspace_manager.py # Workspace operations
β β βββ query_client.py # KQL query execution
β β βββ saved_searches.py # Saved search management
β β βββ data_ingestion.py # Custom data ingestion
β β
β βββ application_insights/
β β βββ __init__.py
β β βββ app_insights_manager.py # App Insights operations
β β βββ telemetry_client.py # Telemetry collection
β β βββ availability_tests.py # Availability monitoring
β β βββ live_metrics.py # Live metrics stream
β β
β βββ alerts/
β β βββ __init__.py
β β βββ alert_rules.py # Alert rule management
β β βββ action_groups.py # Action group configuration
β β βββ smart_detection.py # Smart detection rules
β β βββ notification_manager.py # Notification handling
β β
β βββ dashboards/
β β βββ __init__.py
β β βββ dashboard_manager.py # Dashboard operations
β β βββ workbook_manager.py # Workbook management
β β βββ visualization.py # Chart and visualization
β β βββ template_manager.py # Dashboard templates
β β
β βββ distributed_tracing/
β β βββ __init__.py
β β βββ tracer.py # Distributed tracer
β β βββ span_processor.py # Span processing
β β βββ correlation.py # Correlation handling
β β βββ sampling.py # Sampling strategies
β β
β βββ custom_telemetry/
β β βββ __init__.py
β β βββ metrics_collector.py # Custom metrics
β β βββ event_tracker.py # Event tracking
β β βββ dependency_tracker.py # Dependency monitoring
β β βββ performance_counter.py # Performance counters
β β
β βββ integrations/
β βββ __init__.py
β βββ prometheus.py # Prometheus integration
β βββ grafana.py # Grafana integration
β βββ opentelemetry.py # OpenTelemetry
β βββ datadog.py # DataDog integration
β
βββ examples/
β βββ 01_azure_monitor/
β β βββ 01_create_workspace.py
β β βββ 02_query_metrics.py
β β βββ 03_diagnostic_settings.py
β β βββ 04_activity_logs.py
β β βββ 05_resource_health.py
β β
β βββ 02_log_analytics/
β β βββ 01_workspace_setup.py
β β βββ 02_kql_queries.py
β β βββ 03_custom_logs.py
β β βββ 04_saved_searches.py
β β βββ 05_data_export.py
β β
β βββ 03_application_insights/
β β βββ 01_create_app_insights.py
β β βββ 02_telemetry_collection.py
β β βββ 03_availability_tests.py
β β βββ 04_custom_events.py
β β βββ 05_performance_monitoring.py
β β
β βββ 04_alerts/
β β βββ 01_metric_alerts.py
β β βββ 02_log_query_alerts.py
β β βββ 03_action_groups.py
β β βββ 04_smart_detection.py
β β βββ 05_alert_processing.py
β β
β βββ 05_dashboards/
β β βββ 01_create_dashboard.py
β β βββ 02_workbook_creation.py
β β βββ 03_custom_visualizations.py
β β βββ 04_dashboard_sharing.py
β β βββ 05_automated_reports.py
β β
β βββ 06_distributed_tracing/
β β βββ 01_basic_tracing.py
β β βββ 02_correlation_context.py
β β βββ 03_span_attributes.py
β β βββ 04_sampling_strategies.py
β β βββ 05_trace_analysis.py
β β
β βββ 07_custom_telemetry/
β β βββ 01_custom_metrics.py
β β βββ 02_custom_events.py
β β βββ 03_dependency_tracking.py
β β βββ 04_performance_counters.py
β β βββ 05_business_metrics.py
β β
β βββ 08_integrations/
β βββ 01_opentelemetry_setup.py
β βββ 02_prometheus_exporter.py
β βββ 03_grafana_dashboard.py
β βββ 04_datadog_integration.py
β βββ 05_hybrid_monitoring.py
β
βββ templates/
β βββ dashboards/
β β βββ infrastructure_dashboard.json
β β βββ application_dashboard.json
β β βββ security_dashboard.json
β βββ workbooks/
β β βββ performance_workbook.json
β β βββ error_analysis_workbook.json
β β βββ usage_workbook.json
β βββ alerts/
β β βββ resource_health_alert.json
β β βββ performance_alert.json
β β βββ availability_alert.json
β βββ queries/
β βββ common_kql_queries.kql
β βββ security_queries.kql
β βββ performance_queries.kql
β
βββ notebooks/
β βββ 01_monitoring_basics.ipynb
β βββ 02_kql_mastery.ipynb
β βββ 03_telemetry_analysis.ipynb
β βββ 04_alert_tuning.ipynb
β βββ 05_dashboard_design.ipynb
β
βββ scripts/
β βββ setup_monitoring.sh # Setup script
β βββ export_logs.py # Log export utility
β βββ alert_summary.py # Alert reporting
β βββ cost_analysis.py # Monitoring cost analysis
β
βββ tests/
β βββ __init__.py
β βββ conftest.py
β βββ unit/
β β βββ test_monitor_client.py
β β βββ test_query_client.py
β β βββ test_telemetry_client.py
β β βββ test_alert_rules.py
β βββ integration/
β βββ test_log_analytics.py
β βββ test_app_insights.py
β βββ test_alert_workflow.py
β βββ test_distributed_tracing.py
β
βββ .github/
βββ workflows/
βββ monitoring-test.yml # Test monitoring
βββ alert-validation.yml # Validate alerts
βββ dashboard-deploy.yml # Deploy dashboards
git clone https://github.com/vanHeemstraSystems/learning-idp-observability.git
cd learning-idp-observability# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# On Linux/MacOS:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt# Login to Azure
az login
# Set subscription
az account set --subscription "your-subscription-id"
# Create service principal with monitoring permissions
az ad sp create-for-rbac \
--name "idp-monitoring-sp" \
--role "Monitoring Contributor" \
--scopes /subscriptions/{subscription-id}
# Configure environment variables
cp .env.example .env
# Edit .env with your credentials# Create Log Analytics workspace
python examples/01_azure_monitor/01_create_workspace.py
# Run a simple KQL query
python examples/02_log_analytics/02_kql_queries.py
# Set up Application Insights
python examples/03_application_insights/01_create_app_insights.pyFollow this recommended sequence:
Day 1-2: Azure Monitor Basics
- Read
docs/concepts/03-azure-monitor.md - Complete examples in
examples/01_azure_monitor/ - Practice querying metrics and logs
Day 3-5: Log Analytics
- Study
docs/concepts/05-log-analytics.md - Work through
examples/02_log_analytics/ - Master KQL query language
Day 6-7: Application Insights
- Read
docs/concepts/04-application-insights.md - Complete examples in
examples/03_application_insights/ - Implement telemetry collection
Day 1-3: Alert Configuration
- Study
docs/guides/alert-configuration.md - Work through
examples/04_alerts/ - Configure action groups and notifications
Day 4-7: Dashboard Creation
- Read
docs/guides/dashboard-creation.md - Complete examples in
examples/05_dashboards/ - Build custom workbooks and visualizations
Day 1-4: Distributed Tracing
- Study
docs/concepts/06-distributed-tracing.md - Work through
examples/06_distributed_tracing/ - Implement end-to-end tracing
Day 5-7: Custom Telemetry
- Complete examples in
examples/07_custom_telemetry/ - Implement business metrics
- Configure performance counters
Day 1-3: Tool Integration
- Work through
examples/08_integrations/ - Integrate with Prometheus/Grafana
- Set up OpenTelemetry
Day 4-7: Production Readiness
- Optimize alert rules
- Implement cost management
- Build comprehensive monitoring
# Azure Monitor
azure-mgmt-monitor>=6.0.0 # Monitor management
azure-monitor-query>=1.3.0 # Query operations
azure-monitor-ingestion>=1.0.0 # Data ingestion
# Application Insights
azure-applicationinsights>=0.1.1 # App Insights query API
opencensus-ext-azure>=1.1.13 # OpenCensus integration
opentelemetry-sdk>=1.21.0 # OpenTelemetry
# Supporting Libraries
azure-identity>=1.15.0 # Authentication
azure-core>=1.29.0 # Core functionalityfrom azure.identity import DefaultAzureCredential
from azure.mgmt.loganalytics import LogAnalyticsManagementClient
from azure.mgmt.loganalytics.models import Workspace, WorkspaceSku
credential = DefaultAzureCredential()
log_client = LogAnalyticsManagementClient(credential, subscription_id)
# Create workspace
workspace_params = Workspace(
location='westeurope',
sku=WorkspaceSku(name='PerGB2018'),
retention_in_days=30,
tags={
'environment': 'production',
'project': 'idp-monitoring'
}
)
workspace = log_client.workspaces.begin_create_or_update(
'my-rg',
'my-workspace',
workspace_params
).result()
print(f"Created workspace: {workspace.name}")
print(f"Workspace ID: {workspace.customer_id}")from azure.identity import DefaultAzureCredential
from azure.monitor.query import LogsQueryClient, LogsQueryStatus
from datetime import timedelta
credential = DefaultAzureCredential()
logs_client = LogsQueryClient(credential)
# KQL query
query = """
AzureActivity
| where TimeGenerated > ago(1d)
| where OperationNameValue contains "write"
| summarize Count=count() by ResourceGroup, OperationNameValue
| order by Count desc
| limit 10
"""
# Execute query
response = logs_client.query_workspace(
workspace_id=workspace_id,
query=query,
timespan=timedelta(days=1)
)
if response.status == LogsQueryStatus.SUCCESS:
for table in response.tables:
print(f"\nTable: {table.name}")
print(f"Columns: {[col.name for col in table.columns]}")
for row in table.rows:
print(row)
else:
print(f"Query failed: {response.partial_error}")from azure.identity import DefaultAzureCredential
from azure.mgmt.applicationinsights import ApplicationInsightsManagementClient
from azure.mgmt.applicationinsights.models import ApplicationInsightsComponent
credential = DefaultAzureCredential()
app_insights_client = ApplicationInsightsManagementClient(credential, subscription_id)
# Create Application Insights
app_insights_params = ApplicationInsightsComponent(
location='westeurope',
kind='web',
application_type='web',
workspace_resource_id=workspace.id,
ingestion_mode='LogAnalytics',
tags={
'application': 'my-app',
'environment': 'production'
}
)
app_insights = app_insights_client.components.create_or_update(
'my-rg',
'my-app-insights',
app_insights_params
)
print(f"Created App Insights: {app_insights.name}")
print(f"Instrumentation Key: {app_insights.instrumentation_key}")
print(f"Connection String: {app_insights.connection_string}")from opencensus.ext.azure.log_exporter import AzureLogHandler
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace.tracer import Tracer
from opencensus.trace.samplers import ProbabilitySampler
import logging
# Configure logging with Application Insights
logger = logging.getLogger(__name__)
logger.addHandler(AzureLogHandler(
connection_string=connection_string
))
# Configure tracing
tracer = Tracer(
exporter=AzureExporter(connection_string=connection_string),
sampler=ProbabilitySampler(rate=1.0)
)
# Send custom event
logger.info('User login', extra={
'custom_dimensions': {
'user_id': 'user123',
'login_method': 'oauth',
'ip_address': '192.168.1.1'
}
})
# Create trace
with tracer.span(name='process_order') as span:
span.add_attribute('order_id', '12345')
span.add_attribute('amount', 99.99)
# Process order logic here
logger.info('Order processed successfully')from azure.identity import DefaultAzureCredential
from azure.mgmt.monitor import MonitorManagementClient
from azure.mgmt.monitor.models import (
MetricAlertResource,
MetricAlertCriteria,
MetricCriteria,
MetricAlertAction
)
credential = DefaultAzureCredential()
monitor_client = MonitorManagementClient(credential, subscription_id)
# Define metric alert
alert = MetricAlertResource(
location='global',
description='Alert when CPU usage exceeds 80%',
severity=2,
enabled=True,
scopes=[vm_resource_id],
evaluation_frequency='PT5M',
window_size='PT15M',
criteria=MetricAlertCriteria(
all_of=[
MetricCriteria(
name='HighCPU',
metric_name='Percentage CPU',
metric_namespace='Microsoft.Compute/virtualMachines',
operator='GreaterThan',
threshold=80,
time_aggregation='Average'
)
]
),
actions=[
MetricAlertAction(
action_group_id=action_group_id
)
],
tags={
'severity': 'high',
'team': 'operations'
}
)
# Create alert
alert_rule = monitor_client.metric_alerts.create_or_update(
'my-rg',
'high-cpu-alert',
alert
)
print(f"Created alert: {alert_rule.name}")from azure.identity import DefaultAzureCredential
from azure.mgmt.portal import PortalClient
from azure.mgmt.portal.models import Dashboard
credential = DefaultAzureCredential()
portal_client = PortalClient(credential, subscription_id)
# Define dashboard
dashboard_properties = {
"lenses": {
"0": {
"order": 0,
"parts": {
"0": {
"position": {
"x": 0,
"y": 0,
"colSpan": 6,
"rowSpan": 4
},
"metadata": {
"type": "Extension/HubsExtension/PartType/MonitorChartPart",
"settings": {
"content": {
"chartType": "Line",
"metrics": [{
"resourceId": vm_resource_id,
"name": "Percentage CPU"
}]
}
}
}
}
}
}
}
}
dashboard = Dashboard(
location='westeurope',
tags={'environment': 'production'},
properties=dashboard_properties
)
# Create dashboard
created_dashboard = portal_client.dashboards.create_or_update(
'my-rg',
'my-dashboard',
dashboard
)
print(f"Created dashboard: {created_dashboard.name}")from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter
# Set up tracer provider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure Azure Monitor exporter
exporter = AzureMonitorTraceExporter(
connection_string=connection_string
)
# Add span processor
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(exporter)
)
# Create spans
with tracer.start_as_current_span("parent-operation") as parent_span:
parent_span.set_attribute("user.id", "user123")
# Child operation 1
with tracer.start_as_current_span("database-query") as db_span:
db_span.set_attribute("db.system", "postgresql")
db_span.set_attribute("db.operation", "SELECT")
# Database query logic
# Child operation 2
with tracer.start_as_current_span("external-api-call") as api_span:
api_span.set_attribute("http.method", "GET")
api_span.set_attribute("http.url", "https://api.example.com")
# API call logic
print("Distributed trace sent to Application Insights")import structlog
# Configure structured logging
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
)
logger = structlog.get_logger()
# Log with structured data
logger.info("user_action",
user_id="user123",
action="login",
ip_address="192.168.1.1",
success=True
)from opencensus.trace.samplers import ProbabilitySampler
# Sample 10% of traces in production
sampler = ProbabilitySampler(rate=0.1)
tracer = Tracer(
exporter=AzureExporter(connection_string=connection_string),
sampler=sampler
)# Critical: System-wide failures
logger.critical("Database connection failed")
# Error: Recoverable errors
logger.error("Failed to process payment", order_id=12345)
# Warning: Potential issues
logger.warning("API response slow", duration_ms=5000)
# Info: Business events
logger.info("Order completed", order_id=12345, amount=99.99)
# Debug: Detailed debugging
logger.debug("Cache hit", key="user:123")# Use summarize for aggregations
query = """
requests
| where timestamp > ago(1h)
| summarize Count=count(), AvgDuration=avg(duration) by bin(timestamp, 5m)
"""
# Project only needed columns
query = """
requests
| project timestamp, name, duration, resultCode
| where duration > 1000
"""
# Use materialized views for frequent queries# Install monitoring libraries
pip install opencensus-ext-azure
pip install opentelemetry-sdk
pip install prometheus-client
# KQL tools
pip install kqlmagic # For Jupyter notebooks
# Testing
pip install pytest
pip install pytest-mock# Use Log Analytics in Azure Portal for query development
# Export queries to files for version controlβββββββββββββββββββββββββββββββββββββββ
β LOGS β
β - Structured logging β
β - Log aggregation β
β - Search and analysis β
βββββββββββββββ¬ββββββββββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββββββββββ
β METRICS β
β - Time-series data β
β - Dashboards β
β - Alerting β
βββββββββββββββ¬ββββββββββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββββββββββ
β TRACES β
β - Distributed tracing β
β - Request flow β
β - Performance analysis β
βββββββββββββββββββββββββββββββββββββββ
Layer 4: Business Metrics
- User engagement
- Revenue metrics
- Conversion rates
Layer 3: Application Metrics
- Response times
- Error rates
- Dependencies
Layer 2: Infrastructure Metrics
- CPU, Memory, Disk
- Network traffic
- Resource utilization
Layer 1: Platform Metrics
- Azure service health
- Resource availability
- Service limits
- learning-internal-development-platform - Main overview
- learning-idp-python-azure-sdk - Azure SDK fundamentals
- learning-idp-azure-security - Security monitoring
- learning-idp-cicd-pipelines - Pipeline monitoring
- learning-idp-platform-engineering - Platform observability
This is a personal learning repository, but suggestions and improvements are welcome!
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Ensure all tests pass
- Submit a pull request
This project is for educational purposes. See LICENSE file for details.
Willem van Heemstra
- GitHub: @vanHeemstraSystems
- LinkedIn: Willem van Heemstra
Last updated: December 18, 2025 Part of the learning-internal-development-platform series