Azure Kubernetes Service (AKS) Complete Guide
Introduction to Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS) is Microsoft's managed Kubernetes offering that simplifies deploying, managing, and scaling containerized applications using Kubernetes. It handles critical tasks like health monitoring, maintenance, and security patching, allowing DevOps teams to focus on applications rather than infrastructure.
Key Features:
- Managed Control Plane: Free managed Kubernetes master nodes
- Azure AD Integration: Native Azure Active Directory integration
- Azure Monitor: Built-in monitoring and logging
- Virtual Nodes: Serverless Kubernetes with ACI
- Network Policies: Calico or Azure CNI networking
- Auto-scaling: Cluster and pod auto-scaling
- Security: Azure Policy, RBAC, and managed identities
AKS Architecture
Core Components:
AKS Architecture:
├── Control Plane (Managed by Azure)
│ ├── API Server
│ ├── Scheduler
│ ├── Controller Manager
│ └── etcd (Managed)
├── Node Pools
│ ├── System Node Pool (Critical system pods)
│ ├── User Node Pool (Application workloads)
│ └── Spot Node Pool (Cost-optimized)
└── Azure Services Integration
├── Azure Container Registry (ACR)
├── Azure Monitor
├── Azure Active Directory
└── Azure Virtual Network
# High Availability Options:
- Availability Zones (Spread nodes across 3 zones)
- Region Pairs (Disaster recovery across regions)
- Multiple Node Pools (Different VM types per workload)
Networking Models:
| Model | Description | Use Case |
|---|---|---|
| Kubenet | Basic networking, Azure creates VNET | Simple deployments, basic requirements |
| Azure CNI | Advanced networking, pods get VNET IPs | Enterprise, network policies, existing VNET |
Cluster Creation & Management
Creating AKS Cluster with Azure CLI:
#!/bin/bash
# Variables
RESOURCE_GROUP="aks-rg"
CLUSTER_NAME="aks-cluster"
LOCATION="eastus"
NODE_COUNT=3
NODE_SIZE="Standard_DS2_v2"
# Create Resource Group
az group create --name $RESOURCE_GROUP --location $LOCATION
# Create AKS Cluster with advanced features
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-count $NODE_COUNT \
--node-vm-size $NODE_SIZE \
--enable-addons monitoring \
--enable-managed-identity \
--network-plugin azure \
--network-policy calico \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10 \
--nodepool-name systempool \
--nodepool-tags "Environment=Production" \
--enable-private-cluster \
--outbound-type loadBalancer \
--load-balancer-sku standard \
--generate-ssh-keys
# Get credentials
az aks get-credentials \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--overwrite-existing
# Verify cluster
kubectl get nodes
kubectl cluster-info
Terraform AKS Configuration:
# main.tf
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
}
}
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "aks" {
name = "aks-rg"
location = "East US"
}
resource "azurerm_kubernetes_cluster" "aks" {
name = "aks-cluster"
location = azurerm_resource_group.aks.location
resource_group_name = azurerm_resource_group.aks.name
dns_prefix = "aks-cluster"
kubernetes_version = "1.26.3"
default_node_pool {
name = "systempool"
node_count = 3
vm_size = "Standard_DS2_v2"
vnet_subnet_id = azurerm_subnet.aks.id
enable_auto_scaling = true
min_count = 1
max_count = 10
os_disk_size_gb = 128
type = "VirtualMachineScaleSets"
node_labels = {
"role" = "system"
}
tags = {
Environment = "Production"
}
}
identity {
type = "SystemAssigned"
}
network_profile {
network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
service_cidr = "10.0.0.0/16"
dns_service_ip = "10.0.0.10"
}
addon_profile {
azure_policy {
enabled = true
}
oms_agent {
enabled = true
log_analytics_workspace_id = azurerm_log_analytics_workspace.aks.id
}
}
azure_active_directory_role_based_access_control {
managed = true
azure_rbac_enabled = true
admin_group_object_ids = [var.aks_admin_group_id]
}
auto_scaler_profile {
balance_similar_node_groups = true
expander = "priority"
max_graceful_termination_sec = 600
scale_down_delay_after_add = "10m"
scale_down_unneeded_time = "10m"
scale_down_unready_time = "20m"
scale_down_utilization_threshold = 0.5
}
tags = {
Environment = "Production"
}
}
# Additional user node pool
resource "azurerm_kubernetes_cluster_node_pool" "user" {
name = "userpool"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
vm_size = "Standard_D4s_v3"
node_count = 2
min_count = 2
max_count = 5
enable_auto_scaling = true
vnet_subnet_id = azurerm_subnet.aks.id
node_labels = {
"role" = "user"
}
node_taints = [
"app=user:NoSchedule"
]
tags = {
Workload = "Application"
}
}
Networking Configuration
Advanced Networking Setup:
# Azure CNI with Custom VNET
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-count 3 \
--network-plugin azure \
--network-policy azure \
--vnet-subnet-id /subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Network/virtualNetworks/{vnet}/subnets/{subnet} \
--docker-bridge-address 172.17.0.1/16 \
--dns-service-ip 10.0.0.10 \
--service-cidr 10.0.0.0/16 \
--load-balancer-sku standard \
--outbound-type userDefinedRouting \
--enable-private-cluster
# Application Gateway Ingress Controller (AGIC)
helm repo add application-gateway-kubernetes-ingress https://appgwingress.blob.core.windows.net/ingress-azure-helm-package/
helm repo update
helm install ingress-azure \
application-gateway-kubernetes-ingress/ingress-azure \
--set appgw.name=applicationgateway \
--set appgw.resourceGroup=aks-rg \
--set appgw.subscriptionId=$SUBSCRIPTION_ID \
--set appgw.shared=false \
--set armAuth.type=aadPodIdentity \
--set armAuth.identityResourceId=$IDENTITY_RESOURCE_ID \
--set armAuth.identityClientId=$IDENTITY_CLIENT_ID \
--set rbac.enabled=true \
--namespace kube-system
# Ingress configuration
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: agic-ingress
annotations:
kubernetes.io/ingress.class: azure/application-gateway
appgw.ingress.kubernetes.io/ssl-redirect: "true"
appgw.ingress.kubernetes.io/connection-draining: "true"
appgw.ingress.kubernetes.io/connection-draining-timeout: "30"
spec:
tls:
- hosts:
- app.example.com
secretName: tls-secret
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app-service
port:
number: 80
Network Policies:
# Azure Network Policy (using Azure NPM)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-policy
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/8
ports:
- protocol: TCP
port: 443
# Calico Network Policy
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
name: default-deny
spec:
selector: all()
types:
- Ingress
- Egress
ingress:
- action: Allow
protocol: TCP
destination:
ports:
- 80
- 443
egress:
- action: Allow
protocol: TCP
destination:
ports:
- 53
- 443
Storage Management
Azure Disk Storage Classes:
# Storage Class for managed disks
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_LRS
kind: Managed
cachingMode: ReadOnly
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
# Storage Class for Azure Files
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurefile-csi
provisioner: file.csi.azure.com
parameters:
skuName: Standard_LRS
protocol: smb
reclaimPolicy: Delete
volumeBindingMode: Immediate
mountOptions:
- dir_mode=0777
- file_mode=0777
- uid=0
- gid=0
- mfsymlinks
- cache=strict
- nosharesock
# Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-azure-disk
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 100Gi
# StatefulSet with Azure Disk
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
spec:
serviceName: postgresql
replicas: 3
selector:
matchLabels:
app: postgresql
template:
metadata:
labels:
app: postgresql
spec:
containers:
- name: postgresql
image: postgres:13
ports:
- containerPort: 5432
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: managed-premium
resources:
requests:
storage: 50Gi
Azure Blob CSI Driver:
# Install Blob CSI Driver
helm repo add blob-csi-driver https://raw.githubusercontent.com/kubernetes-sigs/blob-csi-driver/master/charts
helm install blob-csi-driver blob-csi-driver/blob-csi-driver \
--namespace kube-system
# Storage Class for Blob Storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: blob-csi
provisioner: blob.csi.azure.com
parameters:
skuName: Standard_LRS
reclaimPolicy: Delete
volumeBindingMode: Immediate
# Persistent Volume with static provisioning
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-blob
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: blob-csi
csi:
driver: blob.csi.azure.com
readOnly: false
volumeHandle: unique-volumeid
volumeAttributes:
containerName: mycontainer
nodeStageSecretRef:
name: azure-secret
namespace: default
Security Best Practices
Azure AD Integration:
# Enable Azure AD integration
az aks update \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--enable-aad \
--aad-admin-group-object-ids $ADMIN_GROUP_ID \
--aad-tenant-id $TENANT_ID
# Kubernetes Role Binding with Azure AD Groups
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: aks-admins-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: "aks-admins-group-id"
# Pod Identity with AAD Pod Identity
apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentity
metadata:
name: app-identity
namespace: default
spec:
type: 0 # User-assigned managed identity
resourceID: /subscriptions/{sub-id}/resourcegroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity}
clientID: {client-id}
apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentityBinding
metadata:
name: app-identity-binding
namespace: default
spec:
azureIdentity: app-identity
selector: app
Azure Policy for Kubernetes:
# Enable Azure Policy
az aks enable-addons \
--addons azure-policy \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP
# Built-in policy definitions
az policy assignment create \
--name 'aks-policy' \
--display-name 'AKS Security Policy' \
--scope /subscriptions/{sub-id}/resourceGroups/{rg} \
--policy-set-definition '/providers/Microsoft.Authorization/policySetDefinitions/42b8ef37-b724-4e24-bbc8-7a7708edfe00'
# Custom Gatekeeper policies
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
name: k8srequiredlabels
spec:
crd:
spec:
names:
kind: K8sRequiredLabels
validation:
openAPIV3Schema:
properties:
labels:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredlabels
violation[{"msg": msg, "details": {"missing_labels": missing}}] {
provided := {label | input.review.object.metadata.labels[label]}
required := {label | label := input.parameters.labels[_]}
missing := required - provided
count(missing) > 0
msg := sprintf("You must provide labels: %v", [missing])
}
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: require-environment-label
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
parameters:
labels:
- "environment"
Security Context and Pod Security Standards:
# Pod Security Admission (Kubernetes 1.25+)
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
# Pod Security Context
apiVersion: apps/v1
kind: Deployment
metadata:
name: secure-app
spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:latest
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
privileged: false
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
Scaling Strategies
Cluster Auto-scaling:
# Enable cluster auto-scaler during creation
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-count 3 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10
# Update existing cluster
az aks update \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 15
# Node pool scaling
az aks nodepool update \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name userpool \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 20
# Horizontal Pod Autoscaler with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
Virtual Nodes (Serverless):
# Enable Virtual Nodes
az aks enable-addons \
--addons virtual-node \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--subnet-name aci-subnet
# Deploy to Virtual Nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: serverless-app
spec:
replicas: 100 # Can scale to hundreds instantly
selector:
matchLabels:
app: serverless-app
template:
metadata:
labels:
app: serverless-app
spec:
nodeSelector:
kubernetes.io/role: agent
beta.kubernetes.io/os: linux
type: virtual-kubelet
tolerations:
- key: virtual-kubelet.io/provider
operator: Exists
- key: azure.com/aci
effect: NoSchedule
containers:
- name: serverless-app
image: myapp:latest
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
CI/CD Integration
GitHub Actions for AKS:
# .github/workflows/deploy.yml
name: Deploy to AKS
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
env:
REGISTRY: myacr.azurecr.io
IMAGE_NAME: myapp
CLUSTER_NAME: aks-cluster
RESOURCE_GROUP: aks-rg
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Login to Azure Container Registry
uses: azure/docker-login@v1
with:
login-server: ${{ env.REGISTRY }}
username: ${{ secrets.ACR_USERNAME }}
password: ${{ secrets.ACR_PASSWORD }}
- name: Build and push
uses: docker/build-push-action@v4
with:
push: true
tags: |
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
- name: Login to Azure
uses: azure/login@v1
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
- name: Get AKS credentials
uses: azure/aks-set-context@v3
with:
resource-group: ${{ env.RESOURCE_GROUP }}
cluster-name: ${{ env.CLUSTER_NAME }}
- name: Deploy to AKS
run: |
# Update image in deployment
kubectl set image deployment/myapp \
myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
# Wait for rollout
kubectl rollout status deployment/myapp --timeout=300s
# Run tests
kubectl run test --rm -i --restart=Never \
--image=alpine/curl:latest \
-- curl -f http://myapp-service/health
Azure DevOps Pipeline:
# azure-pipelines.yml
trigger:
- main
variables:
azureSubscription: 'Azure-Service-Connection'
aksCluster: 'aks-cluster'
resourceGroup: 'aks-rg'
containerRegistry: 'myacr.azurecr.io'
imageRepository: 'myapp'
dockerfilePath: '$(Build.SourcesDirectory)/Dockerfile'
tag: '$(Build.BuildId)'
stages:
- stage: Build
displayName: Build and push stage
jobs:
- job: Build
displayName: Build
pool:
vmImage: ubuntu-latest
steps:
- task: Docker@2
displayName: Build and push an image
inputs:
command: buildAndPush
repository: $(imageRepository)
dockerfile: $(dockerfilePath)
containerRegistry: $(containerRegistry)
tags: |
$(tag)
latest
- stage: Deploy
displayName: Deploy stage
dependsOn: Build
jobs:
- deployment: Deploy
displayName: Deploy
environment: 'production'
pool:
vmImage: ubuntu-latest
strategy:
runOnce:
deploy:
steps:
- task: KubernetesManifest@0
displayName: Deploy to Kubernetes
inputs:
action: deploy
kubernetesServiceConnection: $(aksCluster)
namespace: 'default'
manifests: |
$(Build.SourcesDirectory)/manifests/deployment.yaml
$(Build.SourcesDirectory)/manifests/service.yaml
$(Build.SourcesDirectory)/manifests/ingress.yaml
containers: |
$(containerRegistry)/$(imageRepository):$(tag)
- task: Kubernetes@1
displayName: Verify deployment
inputs:
connectionType: Kubernetes Service Connection
kubernetesServiceEndpoint: $(aksCluster)
command: rollout
arguments: status deployment/myapp --timeout=300s
Monitoring & Logging
Azure Monitor for Containers:
# Enable Azure Monitor during cluster creation
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--enable-addons monitoring \
--workspace-resource-id /subscriptions/{sub-id}/resourcegroups/{rg}/providers/microsoft.operationalinsights/workspaces/{workspace}
# Or enable on existing cluster
az aks enable-addons \
--addons monitoring \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--workspace-resource-id $WORKSPACE_ID
# Prometheus metrics scraping
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
serviceAccountName: prometheus
serviceMonitorSelector: {}
podMonitorSelector: {}
resources:
requests:
memory: 400Mi
enableRemoteWriteReceiver: true
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-service-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: web
interval: 30s
path: /metrics
Log Analytics Queries:
// Kusto Query Language (KQL) examples
// Container logs
ContainerLog
| where ContainerName == "myapp"
| project TimeGenerated, LogEntry, Computer
| order by TimeGenerated desc
// Performance metrics
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| summarize avg(CounterValue) by bin(TimeGenerated, 5m), InstanceName
// Node status
KubeNodeInventory
| where ClusterName == "aks-cluster"
| project TimeGenerated, Computer, Status, Labels
// Pod status
KubePodInventory
| where ClusterName == "aks-cluster"
| where Namespace == "production"
| project TimeGenerated, Name, Status, PodIp
// Alert rules
// Create alert for high CPU usage
let threshold = 80;
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m), InstanceName
| where AvgCPU > threshold
GitOps with Flux v2
Flux v2 Installation:
# Install Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash
# Bootstrap Flux
flux bootstrap github \
--owner=myorg \
--repository=gitops-aks \
--branch=main \
--path=./clusters/production \
--personal
# Create source for Helm repository
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bitnami
namespace: flux-system
spec:
interval: 30m
url: https://charts.bitnami.com/bitnami
# Create Helm release
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: redis
namespace: production
spec:
interval: 5m
chart:
spec:
chart: redis
version: '14.0.0'
sourceRef:
kind: HelmRepository
name: bitnami
namespace: flux-system
values:
architecture: standalone
auth:
enabled: false
master:
persistence:
enabled: true
size: 8Gi
# Kustomization for application deployment
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
name: apps
namespace: flux-system
spec:
interval: 5m
sourceRef:
kind: GitRepository
name: flux-system
path: ./apps/production
prune: true
validation: client
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: myapp
namespace: production
Disaster Recovery
Cluster Backup with Velero:
# Install Velero with Azure plugin
velero install \
--provider azure \
--plugins velero/velero-plugin-for-microsoft-azure:v1.5.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--backup-location-config resourceGroup=$BACKUP_RG,storageAccount=$STORAGE_ACCOUNT \
--snapshot-location-config apiTimeout=5m,resourceGroup=$BACKUP_RG
# Schedule backups
velero schedule create daily-backup \
--schedule="0 1 * * *" \
--include-namespaces production
# On-demand backup
velero backup create manual-backup \
--include-namespaces production \
--selector app=critical
# Restore from backup
velero restore create --from-backup daily-backup-20240115-010000
# Cross-region backup
velero install \
--provider azure \
--bucket velero-backups-dr \
--secret-file ./credentials-velero \
--backup-location-config resourceGroup=$DR_RG,storageAccount=$DR_STORAGE_ACCOUNT \
--snapshot-location-config apiTimeout=5m,resourceGroup=$DR_RG
Multi-region AKS Deployment:
# Primary region (East US)
az aks create \
--resource-group aks-primary-rg \
--name aks-primary \
--location eastus \
--node-count 3 \
--enable-addons monitoring \
--generate-ssh-keys
# Secondary region (West US)
az aks create \
--resource-group aks-secondary-rg \
--name aks-secondary \
--location westus \
--node-count 2 \
--enable-addons monitoring \
--generate-ssh-keys
# Global Traffic Manager for failover
az network traffic-manager profile create \
--name aks-global \
--resource-group global-rg \
--routing-method Priority \
--unique-dns-name aks-global-$RANDOM
# Add primary endpoint
az network traffic-manager endpoint create \
--name aks-primary \
--profile-name aks-global \
--resource-group global-rg \
--type externalEndpoints \
--target $(az aks show -g aks-primary-rg -n aks-primary --query addonProfiles.httpApplicationRouting.config.HTTPApplicationRoutingZoneName -o tsv) \
--priority 1 \
--weight 1
# Add secondary endpoint
az network traffic-manager endpoint create \
--name aks-secondary \
--profile-name aks-global \
--resource-group global-rg \
--type externalEndpoints \
--target $(az aks show -g aks-secondary-rg -n aks-secondary --query addonProfiles.httpApplicationRouting.config.HTTPApplicationRoutingZoneName -o tsv) \
--priority 2 \
--weight 1
Cost Optimization
Spot Node Pools:
# Create spot node pool
az aks nodepool add \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name spotpool \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--enable-cluster-autoscaler \
--min-count 0 \
--max-count 10 \
--node-count 1 \
--node-vm-size Standard_DS2_v2 \
--labels cost=spot \
--taints sku=spot:NoSchedule
# Pod with spot toleration
apiVersion: apps/v1
kind: Deployment
metadata:
name: spot-app
spec:
replicas: 5
selector:
matchLabels:
app: spot-app
template:
metadata:
labels:
app: spot-app
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
nodeSelector:
cost: spot
containers:
- name: app
image: myapp:latest
resources:
requests:
memory: "256Mi"
cpu: "250m"
Resource Optimization:
# Vertical Pod Autoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: myapp-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: myapp
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 50Mi
maxAllowed:
cpu: 1
memory: 512Mi
controlledResources: ["cpu", "memory"]
# Resource quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
namespace: production
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
# Limit ranges
apiVersion: v1
kind: LimitRange
metadata:
name: mem-limit-range
spec:
limits:
- default:
memory: 512Mi
defaultRequest:
memory: 256Mi
type: Container
Best Practices
Operational Excellence:
- Use Managed Identity: Always use system-assigned managed identities
- Enable Azure Policy: Enforce compliance and security policies
- Implement Network Policies: Use Calico or Azure Network Policies
- Use Private Clusters: For production workloads, enable private clusters
- Implement Pod Security Standards: Use Pod Security Admission
- Enable Azure Defender: For security monitoring and threat detection
Performance Optimization:
- Right-size Nodes: Use appropriate VM sizes for workloads
- Use Multiple Node Pools: Separate system and user workloads
- Implement HPA and VPA: Auto-scale based on demand
- Use Cluster Autoscaler: Scale nodes based on pod requirements
- Optimize Container Images: Use multi-stage builds and slim images
Cost Management:
- Use Spot Instances: For fault-tolerant workloads
- Implement Resource Quotas: Prevent resource over-provisioning
- Right-size Resources: Regularly review and adjust resource requests/limits
- Use Azure Cost Management: Monitor and optimize spending
- Clean Up Resources: Regularly delete unused resources
Comments
Post a Comment