Azure Kubernetes Service (AKS) Complete Guide

Azure Kubernetes Service (AKS) Complete DevOps Guide

Introduction to Azure Kubernetes Service (AKS)

Azure Kubernetes Service (AKS) is Microsoft's managed Kubernetes offering that simplifies deploying, managing, and scaling containerized applications using Kubernetes. It handles critical tasks like health monitoring, maintenance, and security patching, allowing DevOps teams to focus on applications rather than infrastructure.

Key Features:

  • Managed Control Plane: Free managed Kubernetes master nodes
  • Azure AD Integration: Native Azure Active Directory integration
  • Azure Monitor: Built-in monitoring and logging
  • Virtual Nodes: Serverless Kubernetes with ACI
  • Network Policies: Calico or Azure CNI networking
  • Auto-scaling: Cluster and pod auto-scaling
  • Security: Azure Policy, RBAC, and managed identities

AKS Architecture

Core Components:

AKS Architecture:
├── Control Plane (Managed by Azure)
│   ├── API Server
│   ├── Scheduler
│   ├── Controller Manager
│   └── etcd (Managed)
├── Node Pools
│   ├── System Node Pool (Critical system pods)
│   ├── User Node Pool (Application workloads)
│   └── Spot Node Pool (Cost-optimized)
└── Azure Services Integration
    ├── Azure Container Registry (ACR)
    ├── Azure Monitor
    ├── Azure Active Directory
    └── Azure Virtual Network

# High Availability Options:
- Availability Zones (Spread nodes across 3 zones)
- Region Pairs (Disaster recovery across regions)
- Multiple Node Pools (Different VM types per workload)
                

Networking Models:

Model Description Use Case
Kubenet Basic networking, Azure creates VNET Simple deployments, basic requirements
Azure CNI Advanced networking, pods get VNET IPs Enterprise, network policies, existing VNET

Cluster Creation & Management

Creating AKS Cluster with Azure CLI:

#!/bin/bash
# Variables
RESOURCE_GROUP="aks-rg"
CLUSTER_NAME="aks-cluster"
LOCATION="eastus"
NODE_COUNT=3
NODE_SIZE="Standard_DS2_v2"

# Create Resource Group
az group create --name $RESOURCE_GROUP --location $LOCATION

# Create AKS Cluster with advanced features
az aks create \
    --resource-group $RESOURCE_GROUP \
    --name $CLUSTER_NAME \
    --node-count $NODE_COUNT \
    --node-vm-size $NODE_SIZE \
    --enable-addons monitoring \
    --enable-managed-identity \
    --network-plugin azure \
    --network-policy calico \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 10 \
    --nodepool-name systempool \
    --nodepool-tags "Environment=Production" \
    --enable-private-cluster \
    --outbound-type loadBalancer \
    --load-balancer-sku standard \
    --generate-ssh-keys

# Get credentials
az aks get-credentials \
    --resource-group $RESOURCE_GROUP \
    --name $CLUSTER_NAME \
    --overwrite-existing

# Verify cluster
kubectl get nodes
kubectl cluster-info
                

Terraform AKS Configuration:

# main.tf
terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "aks" {
  name     = "aks-rg"
  location = "East US"
}

resource "azurerm_kubernetes_cluster" "aks" {
  name                = "aks-cluster"
  location            = azurerm_resource_group.aks.location
  resource_group_name = azurerm_resource_group.aks.name
  dns_prefix          = "aks-cluster"
  kubernetes_version  = "1.26.3"

  default_node_pool {
    name                = "systempool"
    node_count          = 3
    vm_size             = "Standard_DS2_v2"
    vnet_subnet_id      = azurerm_subnet.aks.id
    enable_auto_scaling = true
    min_count           = 1
    max_count           = 10
    os_disk_size_gb     = 128
    type                = "VirtualMachineScaleSets"
    node_labels = {
      "role" = "system"
    }
    tags = {
      Environment = "Production"
    }
  }

  identity {
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin    = "azure"
    network_policy    = "calico"
    load_balancer_sku = "standard"
    service_cidr      = "10.0.0.0/16"
    dns_service_ip    = "10.0.0.10"
  }

  addon_profile {
    azure_policy {
      enabled = true
    }
    oms_agent {
      enabled                    = true
      log_analytics_workspace_id = azurerm_log_analytics_workspace.aks.id
    }
  }

  azure_active_directory_role_based_access_control {
    managed                = true
    azure_rbac_enabled     = true
    admin_group_object_ids = [var.aks_admin_group_id]
  }

  auto_scaler_profile {
    balance_similar_node_groups      = true
    expander                         = "priority"
    max_graceful_termination_sec     = 600
    scale_down_delay_after_add       = "10m"
    scale_down_unneeded_time         = "10m"
    scale_down_unready_time          = "20m"
    scale_down_utilization_threshold = 0.5
  }

  tags = {
    Environment = "Production"
  }
}

# Additional user node pool
resource "azurerm_kubernetes_cluster_node_pool" "user" {
  name                  = "userpool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
  vm_size               = "Standard_D4s_v3"
  node_count            = 2
  min_count             = 2
  max_count             = 5
  enable_auto_scaling   = true
  vnet_subnet_id        = azurerm_subnet.aks.id
  node_labels = {
    "role" = "user"
  }
  node_taints = [
    "app=user:NoSchedule"
  ]
  tags = {
    Workload = "Application"
  }
}
                

Networking Configuration

Advanced Networking Setup:

# Azure CNI with Custom VNET
az aks create \
    --resource-group $RESOURCE_GROUP \
    --name $CLUSTER_NAME \
    --node-count 3 \
    --network-plugin azure \
    --network-policy azure \
    --vnet-subnet-id /subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Network/virtualNetworks/{vnet}/subnets/{subnet} \
    --docker-bridge-address 172.17.0.1/16 \
    --dns-service-ip 10.0.0.10 \
    --service-cidr 10.0.0.0/16 \
    --load-balancer-sku standard \
    --outbound-type userDefinedRouting \
    --enable-private-cluster

# Application Gateway Ingress Controller (AGIC)
helm repo add application-gateway-kubernetes-ingress https://appgwingress.blob.core.windows.net/ingress-azure-helm-package/
helm repo update

helm install ingress-azure \
    application-gateway-kubernetes-ingress/ingress-azure \
    --set appgw.name=applicationgateway \
    --set appgw.resourceGroup=aks-rg \
    --set appgw.subscriptionId=$SUBSCRIPTION_ID \
    --set appgw.shared=false \
    --set armAuth.type=aadPodIdentity \
    --set armAuth.identityResourceId=$IDENTITY_RESOURCE_ID \
    --set armAuth.identityClientId=$IDENTITY_CLIENT_ID \
    --set rbac.enabled=true \
    --namespace kube-system

# Ingress configuration
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: agic-ingress
  annotations:
    kubernetes.io/ingress.class: azure/application-gateway
    appgw.ingress.kubernetes.io/ssl-redirect: "true"
    appgw.ingress.kubernetes.io/connection-draining: "true"
    appgw.ingress.kubernetes.io/connection-draining-timeout: "30"
spec:
  tls:
  - hosts:
    - app.example.com
    secretName: tls-secret
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: app-service
            port:
              number: 80
                

Network Policies:

# Azure Network Policy (using Azure NPM)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.0/8
    ports:
    - protocol: TCP
      port: 443

# Calico Network Policy
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: default-deny
spec:
  selector: all()
  types:
  - Ingress
  - Egress
  ingress:
  - action: Allow
    protocol: TCP
    destination:
      ports:
      - 80
      - 443
  egress:
  - action: Allow
    protocol: TCP
    destination:
      ports:
      - 53
      - 443
                

Storage Management

Azure Disk Storage Classes:

# Storage Class for managed disks
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
provisioner: disk.csi.azure.com
parameters:
  skuname: Premium_LRS
  kind: Managed
  cachingMode: ReadOnly
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

# Storage Class for Azure Files
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-csi
provisioner: file.csi.azure.com
parameters:
  skuName: Standard_LRS
  protocol: smb
reclaimPolicy: Delete
volumeBindingMode: Immediate
mountOptions:
  - dir_mode=0777
  - file_mode=0777
  - uid=0
  - gid=0
  - mfsymlinks
  - cache=strict
  - nosharesock

# Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-azure-disk
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 100Gi

# StatefulSet with Azure Disk
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
spec:
  serviceName: postgresql
  replicas: 3
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
    spec:
      containers:
      - name: postgresql
        image: postgres:13
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: managed-premium
      resources:
        requests:
          storage: 50Gi
                

Azure Blob CSI Driver:

# Install Blob CSI Driver
helm repo add blob-csi-driver https://raw.githubusercontent.com/kubernetes-sigs/blob-csi-driver/master/charts
helm install blob-csi-driver blob-csi-driver/blob-csi-driver \
    --namespace kube-system

# Storage Class for Blob Storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: blob-csi
provisioner: blob.csi.azure.com
parameters:
  skuName: Standard_LRS
reclaimPolicy: Delete
volumeBindingMode: Immediate

# Persistent Volume with static provisioning
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-blob
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: blob-csi
  csi:
    driver: blob.csi.azure.com
    readOnly: false
    volumeHandle: unique-volumeid
    volumeAttributes:
      containerName: mycontainer
    nodeStageSecretRef:
      name: azure-secret
      namespace: default
                

Security Best Practices

Azure AD Integration:

# Enable Azure AD integration
az aks update \
    --resource-group $RESOURCE_GROUP \
    --name $CLUSTER_NAME \
    --enable-aad \
    --aad-admin-group-object-ids $ADMIN_GROUP_ID \
    --aad-tenant-id $TENANT_ID

# Kubernetes Role Binding with Azure AD Groups
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aks-admins-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: "aks-admins-group-id"

# Pod Identity with AAD Pod Identity
apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentity
metadata:
  name: app-identity
  namespace: default
spec:
  type: 0  # User-assigned managed identity
  resourceID: /subscriptions/{sub-id}/resourcegroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity}
  clientID: {client-id}

apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentityBinding
metadata:
  name: app-identity-binding
  namespace: default
spec:
  azureIdentity: app-identity
  selector: app
                

Azure Policy for Kubernetes:

# Enable Azure Policy
az aks enable-addons \
    --addons azure-policy \
    --name $CLUSTER_NAME \
    --resource-group $RESOURCE_GROUP

# Built-in policy definitions
az policy assignment create \
    --name 'aks-policy' \
    --display-name 'AKS Security Policy' \
    --scope /subscriptions/{sub-id}/resourceGroups/{rg} \
    --policy-set-definition '/providers/Microsoft.Authorization/policySetDefinitions/42b8ef37-b724-4e24-bbc8-7a7708edfe00'

# Custom Gatekeeper policies
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels
        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("You must provide labels: %v", [missing])
        }

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-environment-label
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    labels:
      - "environment"
                

Security Context and Pod Security Standards:

# Pod Security Admission (Kubernetes 1.25+)
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

# Pod Security Context
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-app
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 3000
        fsGroup: 2000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: app
        image: myapp:latest
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          privileged: false
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"
                

Scaling Strategies

Cluster Auto-scaling:

# Enable cluster auto-scaler during creation
az aks create \
    --resource-group $RESOURCE_GROUP \
    --name $CLUSTER_NAME \
    --node-count 3 \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 10

# Update existing cluster
az aks update \
    --resource-group $RESOURCE_GROUP \
    --name $CLUSTER_NAME \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 15

# Node pool scaling
az aks nodepool update \
    --resource-group $RESOURCE_GROUP \
    --cluster-name $CLUSTER_NAME \
    --name userpool \
    --enable-cluster-autoscaler \
    --min-count 2 \
    --max-count 20

# Horizontal Pod Autoscaler with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: packets-per-second
      target:
        type: AverageValue
        averageValue: 1k
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max
                

Virtual Nodes (Serverless):

# Enable Virtual Nodes
az aks enable-addons \
    --addons virtual-node \
    --name $CLUSTER_NAME \
    --resource-group $RESOURCE_GROUP \
    --subnet-name aci-subnet

# Deploy to Virtual Nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: serverless-app
spec:
  replicas: 100  # Can scale to hundreds instantly
  selector:
    matchLabels:
      app: serverless-app
  template:
    metadata:
      labels:
        app: serverless-app
    spec:
      nodeSelector:
        kubernetes.io/role: agent
        beta.kubernetes.io/os: linux
        type: virtual-kubelet
      tolerations:
      - key: virtual-kubelet.io/provider
        operator: Exists
      - key: azure.com/aci
        effect: NoSchedule
      containers:
      - name: serverless-app
        image: myapp:latest
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
                

CI/CD Integration

GitHub Actions for AKS:

# .github/workflows/deploy.yml
name: Deploy to AKS

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

env:
  REGISTRY: myacr.azurecr.io
  IMAGE_NAME: myapp
  CLUSTER_NAME: aks-cluster
  RESOURCE_GROUP: aks-rg

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3
    
    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v2
    
    - name: Login to Azure Container Registry
      uses: azure/docker-login@v1
      with:
        login-server: ${{ env.REGISTRY }}
        username: ${{ secrets.ACR_USERNAME }}
        password: ${{ secrets.ACR_PASSWORD }}
    
    - name: Build and push
      uses: docker/build-push-action@v4
      with:
        push: true
        tags: |
          ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
          ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
    
    - name: Login to Azure
      uses: azure/login@v1
      with:
        creds: ${{ secrets.AZURE_CREDENTIALS }}
    
    - name: Get AKS credentials
      uses: azure/aks-set-context@v3
      with:
        resource-group: ${{ env.RESOURCE_GROUP }}
        cluster-name: ${{ env.CLUSTER_NAME }}
    
    - name: Deploy to AKS
      run: |
        # Update image in deployment
        kubectl set image deployment/myapp \
          myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
        
        # Wait for rollout
        kubectl rollout status deployment/myapp --timeout=300s
        
        # Run tests
        kubectl run test --rm -i --restart=Never \
          --image=alpine/curl:latest \
          -- curl -f http://myapp-service/health
                

Azure DevOps Pipeline:

# azure-pipelines.yml
trigger:
- main

variables:
  azureSubscription: 'Azure-Service-Connection'
  aksCluster: 'aks-cluster'
  resourceGroup: 'aks-rg'
  containerRegistry: 'myacr.azurecr.io'
  imageRepository: 'myapp'
  dockerfilePath: '$(Build.SourcesDirectory)/Dockerfile'
  tag: '$(Build.BuildId)'

stages:
- stage: Build
  displayName: Build and push stage
  jobs:
  - job: Build
    displayName: Build
    pool:
      vmImage: ubuntu-latest
    steps:
    - task: Docker@2
      displayName: Build and push an image
      inputs:
        command: buildAndPush
        repository: $(imageRepository)
        dockerfile: $(dockerfilePath)
        containerRegistry: $(containerRegistry)
        tags: |
          $(tag)
          latest

- stage: Deploy
  displayName: Deploy stage
  dependsOn: Build
  jobs:
  - deployment: Deploy
    displayName: Deploy
    environment: 'production'
    pool:
      vmImage: ubuntu-latest
    strategy:
      runOnce:
        deploy:
          steps:
          - task: KubernetesManifest@0
            displayName: Deploy to Kubernetes
            inputs:
              action: deploy
              kubernetesServiceConnection: $(aksCluster)
              namespace: 'default'
              manifests: |
                $(Build.SourcesDirectory)/manifests/deployment.yaml
                $(Build.SourcesDirectory)/manifests/service.yaml
                $(Build.SourcesDirectory)/manifests/ingress.yaml
              containers: |
                $(containerRegistry)/$(imageRepository):$(tag)
          
          - task: Kubernetes@1
            displayName: Verify deployment
            inputs:
              connectionType: Kubernetes Service Connection
              kubernetesServiceEndpoint: $(aksCluster)
              command: rollout
              arguments: status deployment/myapp --timeout=300s
                

Monitoring & Logging

Azure Monitor for Containers:

# Enable Azure Monitor during cluster creation
az aks create \
    --resource-group $RESOURCE_GROUP \
    --name $CLUSTER_NAME \
    --enable-addons monitoring \
    --workspace-resource-id /subscriptions/{sub-id}/resourcegroups/{rg}/providers/microsoft.operationalinsights/workspaces/{workspace}

# Or enable on existing cluster
az aks enable-addons \
    --addons monitoring \
    --name $CLUSTER_NAME \
    --resource-group $RESOURCE_GROUP \
    --workspace-resource-id $WORKSPACE_ID

# Prometheus metrics scraping
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  serviceAccountName: prometheus
  serviceMonitorSelector: {}
  podMonitorSelector: {}
  resources:
    requests:
      memory: 400Mi
  enableRemoteWriteReceiver: true

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-service-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: web
    interval: 30s
    path: /metrics
                

Log Analytics Queries:

// Kusto Query Language (KQL) examples

// Container logs
ContainerLog
| where ContainerName == "myapp"
| project TimeGenerated, LogEntry, Computer
| order by TimeGenerated desc

// Performance metrics
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| summarize avg(CounterValue) by bin(TimeGenerated, 5m), InstanceName

// Node status
KubeNodeInventory
| where ClusterName == "aks-cluster"
| project TimeGenerated, Computer, Status, Labels

// Pod status
KubePodInventory
| where ClusterName == "aks-cluster"
| where Namespace == "production"
| project TimeGenerated, Name, Status, PodIp

// Alert rules
// Create alert for high CPU usage
let threshold = 80;
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m), InstanceName
| where AvgCPU > threshold
                

GitOps with Flux v2

Flux v2 Installation:

# Install Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash

# Bootstrap Flux
flux bootstrap github \
    --owner=myorg \
    --repository=gitops-aks \
    --branch=main \
    --path=./clusters/production \
    --personal

# Create source for Helm repository
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: bitnami
  namespace: flux-system
spec:
  interval: 30m
  url: https://charts.bitnami.com/bitnami

# Create Helm release
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: redis
  namespace: production
spec:
  interval: 5m
  chart:
    spec:
      chart: redis
      version: '14.0.0'
      sourceRef:
        kind: HelmRepository
        name: bitnami
        namespace: flux-system
  values:
    architecture: standalone
    auth:
      enabled: false
    master:
      persistence:
        enabled: true
        size: 8Gi

# Kustomization for application deployment
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  interval: 5m
  sourceRef:
    kind: GitRepository
    name: flux-system
  path: ./apps/production
  prune: true
  validation: client
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: myapp
    namespace: production
                

Disaster Recovery

Cluster Backup with Velero:

# Install Velero with Azure plugin
velero install \
    --provider azure \
    --plugins velero/velero-plugin-for-microsoft-azure:v1.5.0 \
    --bucket velero-backups \
    --secret-file ./credentials-velero \
    --backup-location-config resourceGroup=$BACKUP_RG,storageAccount=$STORAGE_ACCOUNT \
    --snapshot-location-config apiTimeout=5m,resourceGroup=$BACKUP_RG

# Schedule backups
velero schedule create daily-backup \
    --schedule="0 1 * * *" \
    --include-namespaces production

# On-demand backup
velero backup create manual-backup \
    --include-namespaces production \
    --selector app=critical

# Restore from backup
velero restore create --from-backup daily-backup-20240115-010000

# Cross-region backup
velero install \
    --provider azure \
    --bucket velero-backups-dr \
    --secret-file ./credentials-velero \
    --backup-location-config resourceGroup=$DR_RG,storageAccount=$DR_STORAGE_ACCOUNT \
    --snapshot-location-config apiTimeout=5m,resourceGroup=$DR_RG
                

Multi-region AKS Deployment:

# Primary region (East US)
az aks create \
    --resource-group aks-primary-rg \
    --name aks-primary \
    --location eastus \
    --node-count 3 \
    --enable-addons monitoring \
    --generate-ssh-keys

# Secondary region (West US)
az aks create \
    --resource-group aks-secondary-rg \
    --name aks-secondary \
    --location westus \
    --node-count 2 \
    --enable-addons monitoring \
    --generate-ssh-keys

# Global Traffic Manager for failover
az network traffic-manager profile create \
    --name aks-global \
    --resource-group global-rg \
    --routing-method Priority \
    --unique-dns-name aks-global-$RANDOM

# Add primary endpoint
az network traffic-manager endpoint create \
    --name aks-primary \
    --profile-name aks-global \
    --resource-group global-rg \
    --type externalEndpoints \
    --target $(az aks show -g aks-primary-rg -n aks-primary --query addonProfiles.httpApplicationRouting.config.HTTPApplicationRoutingZoneName -o tsv) \
    --priority 1 \
    --weight 1

# Add secondary endpoint
az network traffic-manager endpoint create \
    --name aks-secondary \
    --profile-name aks-global \
    --resource-group global-rg \
    --type externalEndpoints \
    --target $(az aks show -g aks-secondary-rg -n aks-secondary --query addonProfiles.httpApplicationRouting.config.HTTPApplicationRoutingZoneName -o tsv) \
    --priority 2 \
    --weight 1
                

Cost Optimization

Spot Node Pools:

# Create spot node pool
az aks nodepool add \
    --resource-group $RESOURCE_GROUP \
    --cluster-name $CLUSTER_NAME \
    --name spotpool \
    --priority Spot \
    --eviction-policy Delete \
    --spot-max-price -1 \
    --enable-cluster-autoscaler \
    --min-count 0 \
    --max-count 10 \
    --node-count 1 \
    --node-vm-size Standard_DS2_v2 \
    --labels cost=spot \
    --taints sku=spot:NoSchedule

# Pod with spot toleration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spot-app
spec:
  replicas: 5
  selector:
    matchLabels:
      app: spot-app
  template:
    metadata:
      labels:
        app: spot-app
    spec:
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"
      nodeSelector:
        cost: spot
      containers:
      - name: app
        image: myapp:latest
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
                

Resource Optimization:

# Vertical Pod Autoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myapp-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: myapp
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 100m
        memory: 50Mi
      maxAllowed:
        cpu: 1
        memory: 512Mi
      controlledResources: ["cpu", "memory"]

# Resource quotas
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
  namespace: production
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"

# Limit ranges
apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range
spec:
  limits:
  - default:
      memory: 512Mi
    defaultRequest:
      memory: 256Mi
    type: Container
                

Best Practices

Operational Excellence:

  • Use Managed Identity: Always use system-assigned managed identities
  • Enable Azure Policy: Enforce compliance and security policies
  • Implement Network Policies: Use Calico or Azure Network Policies
  • Use Private Clusters: For production workloads, enable private clusters
  • Implement Pod Security Standards: Use Pod Security Admission
  • Enable Azure Defender: For security monitoring and threat detection

Performance Optimization:

  • Right-size Nodes: Use appropriate VM sizes for workloads
  • Use Multiple Node Pools: Separate system and user workloads
  • Implement HPA and VPA: Auto-scale based on demand
  • Use Cluster Autoscaler: Scale nodes based on pod requirements
  • Optimize Container Images: Use multi-stage builds and slim images

Cost Management:

  • Use Spot Instances: For fault-tolerant workloads
  • Implement Resource Quotas: Prevent resource over-provisioning
  • Right-size Resources: Regularly review and adjust resource requests/limits
  • Use Azure Cost Management: Monitor and optimize spending
  • Clean Up Resources: Regularly delete unused resources

Essential Commands Reference

# Cluster management
az aks create              # Create AKS cluster
az aks update              # Update cluster configuration
az aks delete              # Delete cluster
az aks get-credentials     # Get kubeconfig
az aks upgrade             # Upgrade Kubernetes version

# Node pool management
az aks nodepool list       # List node pools
az aks nodepool add        # Add node pool
az aks nodepool update     # Update node pool
az aks nodepool delete     # Delete node pool
az aks nodepool scale      # Scale node pool

# Monitoring and troubleshooting
az aks browse              # Open Kubernetes dashboard
az aks show                # Show cluster details
az aks get-upgrades        # Get available upgrades
kubectl get events         # View cluster events
kubectl describe node      # Get node details

# Maintenance
az aks maintenanceconfiguration show
az aks maintenanceconfiguration add
az aks maintenanceconfiguration delete

# Security
az aks check-acr           # Check ACR integration
az aks get-versions        # Get available versions
az aks enable-addons       # Enable addons (monitoring, etc.)
az aks disable-addons      # Disable addons
        

Additional Resources

© 2025 Azure Kubernetes Service DevOps Guide. All rights reserved.

Comments

Popular posts from this blog

Real-world Terraform scenarios to test and improve your Infrastructure as Code skills

Automate Your DevOps Documentation: `iac-to-docs` Lands on PyPI with AI Power