# Production Deployment Guide Complete guide for deploying the AI Coding Tools — **two independent systems** that can be deployed together or separately. > [!NOTE] > **Key:** The Orchestrator (port 5001) and Agentic Team (port 5002) are independent services. Each has its own config, adapters, and UI. Deploy both or either one. ## Deployment Architecture ```mermaid graph TD subgraph "Load Balancer / Ingress" LB[NGINX / Application Gateway] end subgraph "Orchestrator Service :5001" O_UI[orchestrator/ui/app.py] O_ENGINE[orchestrator/core/engine.py] O_ADAPT[orchestrator/adapters/] end subgraph "Agentic Team Service :5002" A_UI[agentic_team/ui/app.py] A_ENGINE[agentic_team/engine.py] A_ADAPT[agentic_team/adapters/] end subgraph "Monitoring" PROM[Prometheus :9090] GRAF[Grafana :3000] end LB -->|/orchestrator| O_UI LB -->|/agentic-team| A_UI O_UI --> PROM A_UI --> PROM PROM --> GRAF ``` ## CI/CD Pipeline ```mermaid flowchart LR A[Push to main] --> B[Lint + Type Check] B --> C[Run 314 Tests] C --> D[Build Docker Image] D --> E[Push to Registry] E --> F{Deploy Strategy} F -->|Blue/Green| G[Switch Traffic] F -->|Canary| H[Progressive Rollout] F -->|Direct| I[Rolling Update] ``` ## Quick Start ```bash # Docker Compose (both services) docker compose up --build -d # Orchestrator: http://localhost:5001 # Agentic Team: http://localhost:5002 # Kubernetes kubectl apply -f deployment/kubernetes/deployment.yaml kubectl apply -f deployment/kubernetes/service.yaml ``` ## Docker Compose Service Topology The `docker-compose.yml` defines the complete local/staging deployment. All services share volumes but run independently. ```mermaid graph TD subgraph "Docker Compose Stack" direction TB subgraph "Application Services" ORCH["orchestrator-ui
Image: ai-orchestrator:latest
Port 5001
orchestrator/ui/app.py"] AGENT["agentic-team-ui
Image: ai-orchestrator:latest
Port 5002
agentic_team/ui/app.py"] end subgraph "Monitoring Profile" PROM["prometheus
Port 9091
Scrape interval: 15s"] GRAF["grafana
Port 3000
Dashboard UI"] end end subgraph "MCP Server (optional)" MCP["mcp_server/server.py
Port 8000
FastMCP 3.x"] end subgraph "Shared Volumes" V_OUT["output/"] V_WORK["workspace/"] V_LOG["logs/"] V_SESS["sessions/"] V_CONF_O["orchestrator/config/"] V_CONF_A["agentic_team/config/"] end ORCH --> V_OUT & V_WORK & V_LOG & V_SESS & V_CONF_O AGENT --> V_OUT & V_WORK & V_LOG & V_SESS & V_CONF_A PROM -->|"GET /metrics"| ORCH PROM -->|"GET /metrics"| AGENT GRAF --> PROM MCP -->|in-process| ORCH MCP -->|in-process| AGENT style ORCH fill:#2b6cb0,stroke:#2c5282,color:#fff style AGENT fill:#276749,stroke:#22543d,color:#fff style PROM fill:#c05621,stroke:#9c4221,color:#fff style GRAF fill:#6b46c1,stroke:#553c9a,color:#fff style MCP fill:#9b2c2c,stroke:#742a2a,color:#fff ``` ## Kubernetes Pod and Service Architecture When deployed to Kubernetes, each system runs as a separate Deployment with its own Service, behind a shared Ingress. ```mermaid graph TD subgraph "Kubernetes Cluster" ING["Ingress Controller
NGINX / App Gateway"] subgraph "Namespace: ai-coding-tools" subgraph "Orchestrator Deployment" O_POD1["Pod: orchestrator-blue-1
orchestrator/ui/app.py :5001"] O_POD2["Pod: orchestrator-blue-2
orchestrator/ui/app.py :5001"] O_POD3["Pod: orchestrator-blue-3
orchestrator/ui/app.py :5001"] end O_SVC["Service: orchestrator-svc
ClusterIP :5001"] subgraph "Agentic Team Deployment" A_POD1["Pod: agentic-team-1
agentic_team/ui/app.py :5002"] A_POD2["Pod: agentic-team-2
agentic_team/ui/app.py :5002"] end A_SVC["Service: agentic-team-svc
ClusterIP :5002"] CM["ConfigMap
agents.yaml"] SEC["Secret
API keys"] PVC["PVC
workspace + sessions"] end end ING -->|"/orchestrator"| O_SVC ING -->|"/agentic-team"| A_SVC O_SVC --> O_POD1 & O_POD2 & O_POD3 A_SVC --> A_POD1 & A_POD2 CM -.->|mount| O_POD1 & O_POD2 & O_POD3 & A_POD1 & A_POD2 SEC -.->|mount| O_POD1 & O_POD2 & O_POD3 & A_POD1 & A_POD2 PVC -.->|mount| O_POD1 & O_POD2 & O_POD3 & A_POD1 & A_POD2 style ING fill:#553c9a,stroke:#44337a,color:#fff style O_SVC fill:#2b6cb0,stroke:#2c5282,color:#fff style A_SVC fill:#276749,stroke:#22543d,color:#fff ``` ## Azure Deployment This deployment guide covers a complete production-ready Azure infrastructure stack featuring: - ✅ **Azure Kubernetes Service (AKS)** - Managed Kubernetes with auto-scaling - ✅ **Azure Container Registry (ACR)** - Private Docker registry with geo-replication - ✅ **Azure Application Gateway** - Layer 7 load balancer with WAF - ✅ **Azure Front Door** - Global load balancer with CDN - ✅ **Azure Key Vault** - Secrets and certificate management - ✅ **Azure Redis Cache** - Premium Redis for caching - ✅ **Azure Monitor** - Comprehensive monitoring and alerting - ✅ **Application Insights** - APM and distributed tracing - ✅ **Azure Files** - Persistent storage for workspaces - ✅ **Terraform** - Infrastructure as Code - ✅ **Blue-Green & Canary Deployments** - Zero-downtime releases ## Architecture ### High-Level Architecture ``` Internet | v ┌──────────────────────┐ │ Azure Front Door │ ← Global Load Balancer + CDN │ (Premium Tier) │ └──────────────────────┘ | v ┌──────────────────────┐ │ Application Gateway │ ← Regional Load Balancer + WAF │ (WAF_v2) │ └──────────────────────┘ | ┌───────────────────┴───────────────────┐ v v ┌─────────────────────┐ ┌─────────────────────┐ │ Primary Region │ │ Secondary Region │ │ (East US) │ │ (West US 2) │ │ │ │ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ AKS │ │ │ │ AKS │ │ │ │ (v1.28) │ │ │ │ (Standby) │ │ │ │ │ │ │ │ │ │ │ │ ┌────────┐ │ │ │ │ ┌────────┐ │ │ │ │ │ Blue │ │ │ │ │ │ Blue │ │ │ │ │ │ Pods │ │ │ │ │ │ Pods │ │ │ │ │ └────────┘ │ │ │ │ └────────┘ │ │ │ │ ┌────────┐ │ │ │ │ │ │ │ │ │ Green │ │ │ │ └──────────────┘ │ │ │ │ Pods │ │ │ │ │ │ │ └────────┘ │ │ │ ┌──────────────┐ │ │ └──────────────┘ │ │ │ ACR │ │ │ │ │ │ (Replica) │ │ │ ┌──────────────┐ │ │ └──────────────┘ │ │ │ ACR │───┼─────────────────────────────────────┘ │ │ (Premium) │ │ Geo-Replication │ └──────────────┘ │ │ │ │ ┌──────────────┐ │ │ │ Key Vault │ │ │ │ (Premium) │ │ │ └──────────────┘ │ │ │ │ ┌──────────────┐ │ │ │ Redis Cache │ │ │ │ (Premium) │ │ │ └──────────────┘ │ │ │ │ ┌──────────────┐ │ │ │ Azure Files │ │ │ │ (Premium) │ │ │ └──────────────┘ │ │ │ │ ┌──────────────┐ │ │ │ Monitor │ │ │ │ + App Insights│ │ │ └──────────────┘ │ └─────────────────────┘ ``` ### Network Architecture ``` Azure Virtual Network (10.0.0.0/16) │ ├── AKS Subnet (10.0.1.0/24) │ ├── System Node Pool (3-20 nodes) │ └── User Node Pool (3-20 nodes) │ ├── Application Gateway Subnet (10.0.2.0/24) │ └── Application Gateway v2 │ └── Private Endpoints Subnet (10.0.3.0/24) ├── ACR Private Endpoint ├── Key Vault Private Endpoint └── Storage Private Endpoint ``` ### Runtime Profiles The orchestrator now supports three deployment/runtime profiles: 1. **Cloud-only** - Use CLI-backed cloud agents only (`type: cli`) - Keep local agents disabled - `settings.offline.enabled: false` 2. **Hybrid (recommended for resilience)** - Use cloud agents for primary steps - Configure local agents (`type: ollama` / `type: llamacpp`) as fallback - Enable fallback routing in `settings.fallback` 3. **Offline/local-only** - Enable local agents only - Run with `--offline` in CLI jobs or set `settings.offline.enabled: true` - Ensure local model endpoints are reachable from the runtime network namespace ### Local Backend Connectivity in Kubernetes If using local model backends in cluster deployments, expose them as internal services: - `ollama..svc.cluster.local:11434` - `..svc.cluster.local:8080` Then set agent endpoints accordingly in `orchestrator/config/agents.yaml`. ## Azure Services ### 1. Azure Kubernetes Service (AKS) **Configuration:** - **Version**: 1.28 (latest stable) - **Node Pools**: - System: 3-20 nodes (Standard_D4s_v3) - User: 3-20 nodes (Standard_D4s_v3) - **Networking**: Azure CNI with Calico network policy - **Auto-scaling**: Enabled (cluster and pod level) - **Azure Policy**: Enabled for governance - **Monitoring**: Azure Monitor + Container Insights - **Security**: Managed identity, Azure RBAC, Key Vault integration **Features:** - Multi-zone deployment for high availability - Automatic OS and security patching - Built-in secrets management via Key Vault - Network policies for micro-segmentation - Pod security policies enforcement ### 2. Azure Container Registry (ACR) **Configuration:** - **SKU**: Premium - **Features**: - Geo-replication (East US → West US 2) - Container scanning (Microsoft Defender) - Content trust (signed images) - Image retention policies (30 days) - Network restrictions (VNet integration) - Private endpoints **Benefits:** - Low-latency pulls from any region - Automatic image replication - Vulnerability scanning - Immutable image tags ### 3. Azure Application Gateway **Configuration:** - **SKU**: WAF_v2 - **Features**: - Web Application Firewall (OWASP 3.1) - SSL/TLS termination - URL-based routing - Session affinity - Health probes - Auto-scaling (2-10 instances) **Security:** - DDoS protection - Bot protection - IP allow/deny lists - Rate limiting - Custom WAF rules ### 4. Azure Front Door **Configuration:** - **SKU**: Premium - **Features**: - Global HTTP/HTTPS load balancing - CDN capabilities - SSL offloading - URL-based routing - Session affinity - Health probes - Private Link to backend **Benefits:** - Reduced latency (edge caching) - Increased availability (multi-region) - DDoS protection - Bot protection - Web Application Firewall ### 5. Azure Key Vault **Configuration:** - **SKU**: Premium (HSM-backed) - **Features**: - Secrets management - Certificate management - Key management with HSM - Soft delete (90 days) - Purge protection - Network restrictions - Private endpoint - RBAC access control **Integration:** - AKS CSI driver for secrets - Automatic rotation (2-minute interval) - Audit logging to Azure Monitor ### 6. Azure Redis Cache **Configuration:** - **SKU**: Premium P1 (6 GB) - **Features**: - Data persistence - Geo-replication - VNet integration - TLS 1.2 enforcement - Patch scheduling (Sunday 2 AM UTC) - Maxmemory policy: allkeys-lru **Use Cases:** - Session caching - API response caching - Task queue management - Rate limiting ### 7. Azure Monitor + Application Insights **Configuration:** - **Log Analytics Workspace**: 30-day retention - **Application Insights**: Transaction tracking - **Container Insights**: Pod and node metrics - **Metrics**: - CPU, memory, disk, network - Request rates, latency, errors - Custom application metrics - Kubernetes events **Alerting:** - High CPU/memory usage - Pod crash loops - Failed deployments - Slow response times - High error rates ### 8. Azure Files (Premium) **Configuration:** - **Tier**: Premium (SSD-backed) - **Shares**: - `workspace`: 100 GB (AI agent workspaces) - `sessions`: 50 GB (user sessions) - **Features**: - SMB 3.0 protocol - Encryption at rest and in transit - Snapshots for backup - Integration with AKS via CSI driver ## Prerequisites ### Tools Required ```bash # Azure CLI (v2.54.0+) curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash # Terraform (v1.5.0+) wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip unzip terraform_1.6.0_linux_amd64.zip sudo mv terraform /usr/local/bin/ # kubectl (v1.28+) curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl # Helm (v3.12+) curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash ``` ### Azure Subscription Requirements - **Subscription Type**: Pay-As-You-Go or Enterprise Agreement - **Required Permissions**: - Owner or Contributor role on subscription - Ability to create service principals - Ability to create resources in East US and West US 2 - **Resource Quotas** (verify with `az vm list-usage`): - vCPU quota: 64+ cores per region - Public IP addresses: 10+ per region - Standard Load Balancers: 5+ per region - Network Interfaces: 100+ per region ### Cost Estimate **Monthly cost breakdown (USD):** | Service | Configuration | Est. Cost | |---------|--------------|-----------| | AKS (compute) | 6x D4s_v3 nodes | $~750 | | AKS (management) | Free | $0 | | ACR Premium | With geo-replication | $250 | | Application Gateway | WAF_v2, 2 instances | $250 | | Azure Front Door | Premium tier | $300 | | Key Vault Premium | + operations | $50 | | Redis Cache P1 | 6 GB Premium | $300 | | Azure Files Premium | 150 GB | $200 | | Log Analytics | 30-day retention | $150 | | Bandwidth | Outbound transfer | $100 | | **Total** | | **~$2,350/month** | *Costs vary based on usage. Use Azure Pricing Calculator for accurate estimates.* ## Quick Start ### One-Command Deployment ```bash # Clone repository git clone https://github.com/hoangsonww/AI-Agents-Orchestrator.git cd AI-Agents-Orchestrator # Run deployment script chmod +x deployment/azure/scripts/deploy.sh ./deployment/azure/scripts/deploy.sh ``` The script will: 1. ✅ Verify prerequisites 2. ✅ Login to Azure 3. ✅ Create Terraform backend 4. ✅ Deploy infrastructure with Terraform 5. ✅ Configure kubectl for AKS 6. ✅ Install NGINX Ingress Controller 7. ✅ Install cert-manager for TLS 8. ✅ Install Prometheus + Grafana 9. ✅ Apply Kubernetes manifests 10. ✅ Verify deployment **Duration**: ~30 minutes ## Detailed Setup ### Step 1: Azure Login ```bash # Login to Azure az login # Set subscription az account set --subscription "YOUR_SUBSCRIPTION_ID" # Verify az account show ``` ### Step 2: Create Service Principal (for CI/CD) ```bash # Create service principal az ad sp create-for-rbac \ --name "ai-orchestrator-sp" \ --role Contributor \ --scopes /subscriptions/YOUR_SUBSCRIPTION_ID \ --sdk-auth # Save output (contains clientId, clientSecret, tenantId, subscriptionId) ``` ### Step 3: Deploy Infrastructure with Terraform ```bash cd deployment/azure/terraform # Initialize Terraform terraform init # Review plan terraform plan \ -var="environment=production" \ -var="location=eastus" \ -out=tfplan # Apply terraform apply tfplan # Save outputs terraform output -json > ../outputs.json ``` ### Step 4: Configure kubectl ```bash # Get AKS credentials az aks get-credentials \ --resource-group ai-orchestrator-production-rg \ --name ai-orchestrator-production-aks \ --overwrite-existing # Verify kubectl cluster-info kubectl get nodes ``` ### Step 5: Install Core Components ```bash # Install NGINX Ingress helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx \ --create-namespace \ --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz # Install cert-manager helm repo add jetstack https://charts.jetstack.io helm upgrade --install cert-manager jetstack/cert-manager \ --namespace cert-manager \ --create-namespace \ --set installCRDs=true # Install Prometheus stack helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set grafana.enabled=true ``` ### Step 6: Deploy Application ```bash # Create namespace kubectl create namespace ai-orchestrator # Apply manifests kubectl apply -f deployment/kubernetes/ -n ai-orchestrator # Wait for deployment kubectl rollout status deployment/ai-orchestrator-blue -n ai-orchestrator ``` ### Step 6.1: Configure Agent Runtime Mode After deploying manifests, ensure `orchestrator/config/agents.yaml` (or ConfigMap-mounted equivalent) matches your runtime profile. **Hybrid example (cloud primary + local fallback):** ```yaml agents: claude: type: cli command: claude enabled: true local-instruct: type: ollama endpoint: http://ollama.ai-orchestrator.svc.cluster.local:11434 model: mistral:7b-instruct offline: true enabled: true workflows: hybrid: steps: - agent: claude role: reviewer fallback: local-instruct settings: fallback: enabled: true map: claude: local-instruct ``` Apply config and restart pods after changes: ```bash kubectl apply -f deployment/kubernetes/configmap.yaml -n ai-orchestrator kubectl rollout restart deployment/ai-orchestrator-blue -n ai-orchestrator ``` ### Step 7: Build and Push Docker Image ```bash # Login to ACR az acr login --name aiorchestrator productionacr # Build and push az acr build \ --registry aiorchestrator productionacr \ --image ai-orchestrator:latest \ --image ai-orchestrator:$(git rev-parse --short HEAD) \ . ``` ### Step 8: Configure DNS ```bash # Get Application Gateway public IP az network public-ip show \ --resource-group ai-orchestrator-production-rg \ --name ai-orchestrator-production-appgw-pip \ --query ipAddress -o tsv # Create A record in your DNS provider pointing to this IP # Example: ai-orchestrator.yourdomain.com -> 20.85.123.45 ``` ### Step 9: Configure TLS ```bash # Create ClusterIssuer for Let's Encrypt kubectl apply -f - <>K8s: Deploy new image to Green K8s->>Green: Rolling update to v2 Green-->>K8s: All pods Ready Ops->>Green: Run smoke tests Green-->>Ops: Tests pass Ops->>Svc: Patch selector: version=green Note over Svc: Traffic switches instantly Svc->>Green: All traffic now to Green Blue-->>K8s: (idle, still running) alt Issues detected Ops->>Svc: Patch selector: version=blue Note over Svc: Instant rollback (< 10s) else Stable after 30 min Ops->>K8s: Scale Blue to 0 end ``` **Process:** 1. **Deploy to Green environment**: ```bash # Update green deployment kubectl set image deployment/ai-orchestrator-green \ ai-orchestrator=aiorchestrator productionacr.azurecr.io/ai-orchestrator:v2.0.0 \ -n ai-orchestrator # Scale up green kubectl scale deployment/ai-orchestrator-green --replicas=3 -n ai-orchestrator # Wait for readiness kubectl rollout status deployment/ai-orchestrator-green -n ai-orchestrator ``` 2. **Run smoke tests**: ```bash # Test green pods directly GREEN_POD=$(kubectl get pod -n ai-orchestrator -l version=green -o jsonpath='{.items[0].metadata.name}') kubectl exec -n ai-orchestrator $GREEN_POD -- curl http://localhost:5001/health ``` 3. **Switch traffic**: ```bash # Option 1: Automated script ./deployment/scripts/blue-green-switch.sh blue green # Option 2: Manual kubectl patch service ai-orchestrator-service -n ai-orchestrator \ -p '{"spec":{"selector":{"version":"green"}}}' ``` 4. **Monitor and verify**: ```bash # Watch metrics in Azure Monitor # Check logs kubectl logs -n ai-orchestrator -l version=green --tail=100 -f # If issues, instant rollback: kubectl patch service ai-orchestrator-service -n ai-orchestrator \ -p '{"spec":{"selector":{"version":"blue"}}}' ``` 5. **Scale down blue**: ```bash # After 30 minutes of stable operation kubectl scale deployment/ai-orchestrator-blue --replicas=0 -n ai-orchestrator ``` **Rollback Time**: < 10 seconds ### Canary Deployment **Progressive Rollout Stages:** 1. **10% Traffic** (5 minutes): ```bash kubectl scale deployment/ai-orchestrator-stable --replicas=5 -n ai-orchestrator kubectl scale deployment/ai-orchestrator-canary --replicas=1 -n ai-orchestrator ``` 2. **25% Traffic** (10 minutes): ```bash kubectl scale deployment/ai-orchestrator-stable --replicas=3 -n ai-orchestrator kubectl scale deployment/ai-orchestrator-canary --replicas=1 -n ai-orchestrator ``` 3. **50% Traffic** (15 minutes): ```bash kubectl scale deployment/ai-orchestrator-stable --replicas=1 -n ai-orchestrator kubectl scale deployment/ai-orchestrator-canary --replicas=1 -n ai-orchestrator ``` 4. **100% Traffic** (10 minutes): ```bash kubectl scale deployment/ai-orchestrator-stable --replicas=0 -n ai-orchestrator kubectl scale deployment/ai-orchestrator-canary --replicas=3 -n ai-orchestrator ``` **Automated with script**: ```bash ./deployment/scripts/canary-rollout.sh ``` **Rollback Time**: < 30 seconds (automatic on metrics degradation) ### Rolling Update Standard Kubernetes rolling update: ```bash # Update image kubectl set image deployment/ai-orchestrator \ ai-orchestrator=aiorchestrator productionacr.azurecr.io/ai-orchestrator:v2.0.0 \ -n ai-orchestrator # Monitor kubectl rollout status deployment/ai-orchestrator -n ai-orchestrator # Rollback if needed kubectl rollout undo deployment/ai-orchestrator -n ai-orchestrator ``` ## Monitoring & Observability ### Health Check Flow Both services expose `/health` and `/ready` endpoints. Kubernetes liveness and readiness probes, load balancers, and monitoring stacks all rely on these endpoints. ```mermaid flowchart TD subgraph "Health Check Sources" K8S_LP["Kubernetes
Liveness Probe"] K8S_RP["Kubernetes
Readiness Probe"] LB["Load Balancer
Health Probe"] PROM["Prometheus
Scrape Target"] end subgraph "Orchestrator :5001" O_HEALTH["/health"] O_READY["/ready"] O_METRICS["/metrics"] end subgraph "Agentic Team :5002" A_HEALTH["/health"] A_READY["/ready"] end K8S_LP -->|"GET every 10s"| O_HEALTH K8S_LP -->|"GET every 10s"| A_HEALTH K8S_RP -->|"GET every 5s"| O_READY K8S_RP -->|"GET every 5s"| A_READY LB -->|"GET every 30s"| O_HEALTH LB -->|"GET every 30s"| A_HEALTH PROM -->|"GET every 15s"| O_METRICS O_HEALTH --> |"200 OK"| PASS_O{Healthy} O_HEALTH --> |"503"| FAIL_O[Restart Pod] A_HEALTH --> |"200 OK"| PASS_A{Healthy} A_HEALTH --> |"503"| FAIL_A[Restart Pod] style PASS_O fill:#276749,stroke:#22543d,color:#fff style PASS_A fill:#276749,stroke:#22543d,color:#fff style FAIL_O fill:#9b2c2c,stroke:#742a2a,color:#fff style FAIL_A fill:#9b2c2c,stroke:#742a2a,color:#fff ``` ### Azure Monitor **Access Azure Monitor**: ```bash # Open in browser az monitor metrics list \ --resource $(az aks show -g ai-orchestrator-production-rg -n ai-orchestrator-production-aks --query id -o tsv) \ --metric-names "node_cpu_usage_percentage" ``` **Key Metrics**: - CPU usage (nodes and pods) - Memory usage - Disk I/O - Network traffic - Request rates - Error rates - Latency (p50, p95, p99) ### Application Insights **View in Azure Portal**: 1. Navigate to Application Insights resource 2. View Application Map (service dependencies) 3. Check Performance blade (slow requests) 4. Review Failures blade (exceptions) 5. Analyze Live Metrics (real-time telemetry) **Query with KQL**: ```kusto // Top 10 slowest requests requests | where timestamp > ago(1h) | summarize avg(duration) by operation_Name | top 10 by avg_duration desc // Error rate by hour requests | where timestamp > ago(24h) | summarize total = count(), errors = countif(success == false) by bin(timestamp, 1h) | project timestamp, error_rate = (errors * 100.0) / total ``` ### Grafana Dashboards **Access Grafana**: ```bash # Port-forward to Grafana kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80 # Open http://localhost:3000 # Username: admin, Password: admin123 ``` **Import Dashboards**: - Kubernetes Cluster Monitoring (ID: 7249) - Node Exporter Full (ID: 1860) - Kubernetes Pod Monitoring (ID: 6417) - NGINX Ingress Controller (ID: 9614) ### Log Analytics **Query Logs**: ```kusto // Container logs ContainerLog | where Namespace == "ai-orchestrator" | where TimeGenerated > ago(1h) | project TimeGenerated, Computer, ContainerID, LogEntry | order by TimeGenerated desc // Performance metrics Perf | where TimeGenerated > ago(1h) | where ObjectName == "K8SContainer" | summarize avg(CounterValue) by CounterName, bin(TimeGenerated, 5m) ``` ### Alerts **Configure Alerts**: 1. **High CPU Alert**: ```bash az monitor metrics alert create \ --name "aks-high-cpu" \ --resource-group ai-orchestrator-production-rg \ --scopes $(az aks show -g ai-orchestrator-production-rg -n ai-orchestrator-production-aks --query id -o tsv) \ --condition "avg node_cpu_usage_percentage > 80" \ --window-size 5m \ --evaluation-frequency 1m \ --action-group-id /subscriptions/.../actionGroups/... ``` 2. **Pod Crash Alert**: ```bash az monitor metrics alert create \ --name "pod-crash-loop" \ --resource-group ai-orchestrator-production-rg \ --scopes $(az aks show -g ai-orchestrator-production-rg -n ai-orchestrator-production-aks --query id -o tsv) \ --condition "avg kube_pod_container_status_restarts_total > 5" \ --window-size 15m \ --evaluation-frequency 5m ``` ## Security ### Network Security **Network Security Groups (NSGs)**: ```bash # AKS subnet - restrict inbound az network nsg rule create \ --resource-group ai-orchestrator-production-rg \ --nsg-name aks-nsg \ --name allow-https \ --priority 100 \ --source-address-prefixes Internet \ --destination-port-ranges 443 \ --access Allow \ --protocol Tcp ``` **Azure Firewall** (optional for advanced scenarios): ```bash # Create Azure Firewall for egress filtering az network firewall create \ --name ai-orchestrator-firewall \ --resource-group ai-orchestrator-production-rg \ --location eastus ``` ### Identity & Access Management **Azure AD Integration**: ```bash # Enable Azure AD integration for AKS az aks update \ --resource-group ai-orchestrator-production-rg \ --name ai-orchestrator-production-aks \ --enable-azure-rbac \ --enable-aad ``` **Managed Identity**: ```bash # AKS uses system-assigned managed identity by default # Grant permissions to managed identity az role assignment create \ --assignee $(az aks show -g ai-orchestrator-production-rg -n ai-orchestrator-production-aks --query identityProfile.kubeletidentity.objectId -o tsv) \ --role "Key Vault Secrets User" \ --scope $(az keyvault show -n aiorchestrator productionkv --query id -o tsv) ``` ### Secrets Management **Key Vault Integration**: ```bash # Create secret in Key Vault az keyvault secret set \ --vault-name aiorchestrator productionkv \ --name openai-api-key \ --value "sk-..." # Use secret in pod via CSI driver kubectl apply -f - < -n ai-orchestrator # Check events kubectl get events -n ai-orchestrator --sort-by='.lastTimestamp' # Check logs kubectl logs -n ai-orchestrator --previous ``` **Common Causes**: - Insufficient node resources → Scale up node pool - Image pull errors → Check ACR authentication - Config errors → Verify ConfigMap and Secrets - Health check failures → Adjust probe settings #### 2. High Latency **Diagnosis**: ```bash # Check Application Insights az monitor app-insights query \ --app ai-orchestrator-production-ai \ --analytics-query "requests | summarize avg(duration) by bin(timestamp, 5m)" # Check HPA status kubectl get hpa -n ai-orchestrator # Check resource usage kubectl top pods -n ai-orchestrator ``` **Solutions**: - Scale up pods: `kubectl scale deployment/ai-orchestrator-blue --replicas=10` - Enable Redis caching - Optimize database queries - Enable CDN caching #### 3. Certificate Issues **Symptoms**: - HTTPS not working - Browser certificate warnings **Diagnosis**: ```bash # Check cert-manager logs kubectl logs -n cert-manager deployment/cert-manager # Check certificate status kubectl describe certificate -n ai-orchestrator # Check certificate-request kubectl get certificaterequest -n ai-orchestrator ``` **Solutions**: ```bash # Delete and recreate certificate kubectl delete certificate ai-orchestrator-tls -n ai-orchestrator kubectl apply -f ingress.yaml # Check Let's Encrypt rate limits # Wait 1 hour if rate limited ``` #### 4. Local Model Fallback Not Triggering **Symptoms**: - Cloud step fails, but no fallback agent executes - Workflow logs show primary agent errors only **Diagnosis**: ```bash # Confirm fallback settings are present in running config kubectl exec -n ai-orchestrator deploy/ai-orchestrator-blue -- \ sh -lc "grep -n \"fallback\" -n /app/orchestrator/config/agents.yaml || true" # Verify local endpoint connectivity from pod kubectl exec -n ai-orchestrator deploy/ai-orchestrator-blue -- \ sh -lc "curl -sf http://ollama.ai-orchestrator.svc.cluster.local:11434/api/tags | head" # Check orchestrator logs for fallback decisions kubectl logs -n ai-orchestrator -l app=ai-orchestrator | grep -i fallback ``` **Common Causes**: - `settings.fallback.enabled` is false - No mapping for the primary agent in `settings.fallback.map` - Step-level `fallback` points to a disabled/unavailable agent - Local endpoint DNS/service not reachable from orchestrator pod **Solutions**: ```bash # Enable local agent(s) and fallback in config, then roll deployment kubectl apply -f deployment/kubernetes/configmap.yaml -n ai-orchestrator kubectl rollout restart deployment/ai-orchestrator-blue -n ai-orchestrator # Verify local endpoint from pod after restart kubectl exec -n ai-orchestrator deploy/ai-orchestrator-blue -- \ sh -lc "curl -sf http://ollama.ai-orchestrator.svc.cluster.local:11434/api/tags | head" ``` #### 5. Node Pool Issues **Diagnosis**: ```bash # Check node status kubectl get nodes # Check node conditions kubectl describe node # Check cluster autoscaler logs kubectl logs -n kube-system deployment/cluster-autoscaler ``` **Solutions**: ```bash # Manually scale node pool az aks nodepool scale \ --resource-group ai-orchestrator-production-rg \ --cluster-name ai-orchestrator-production-aks \ --name user \ --node-count 5 # Restart node (drain + delete) kubectl drain --ignore-daemonsets --delete-emptydir-data kubectl delete node ``` ### Debugging Commands ```bash # Get all resources kubectl get all -n ai-orchestrator # Describe deployment kubectl describe deployment ai-orchestrator-blue -n ai-orchestrator # Check rollout history kubectl rollout history deployment/ai-orchestrator-blue -n ai-orchestrator # Get pod logs kubectl logs -f deployment/ai-orchestrator-blue -n ai-orchestrator # Execute command in pod kubectl exec -it -n ai-orchestrator -- /bin/bash # Port-forward for debugging kubectl port-forward svc/ai-orchestrator-service 5001:5001 -n ai-orchestrator # Check network policies kubectl get networkpolicies -n ai-orchestrator # View events kubectl get events --sort-by='.lastTimestamp' -n ai-orchestrator ``` ### Azure Support **Create Support Ticket**: ```bash az support tickets create \ --ticket-name "AKS-Issue-$(date +%Y%m%d)" \ --title "AKS cluster issues" \ --description "Pods not starting in AKS cluster" \ --severity moderate \ --problem-classification-id "/subscriptions/.../providers/Microsoft.Support/services/..." ``` ## CI/CD Integration ### Pipeline Stages The following diagram shows the complete CI/CD pipeline from code push through production deployment. ```mermaid flowchart TD PUSH["git push to main/develop"] --> LINT["Stage 1: Lint + Format
flake8, pylint, black, isort"] LINT --> TYPE["Stage 2: Type Check
mypy"] TYPE --> TEST["Stage 3: Test
314 tests (pytest)"] TEST --> SEC["Stage 4: Security Scan
bandit + safety"] SEC --> BUILD["Stage 5: Docker Build
Multi-stage image"] BUILD --> PUSH_IMG["Stage 6: Push to ACR
Tag: git SHA + latest"] PUSH_IMG --> BRANCH{Branch?} BRANCH -->|develop| STAGING["Deploy to Staging
Rolling update"] BRANCH -->|main| PROD_CHOICE{Strategy} PROD_CHOICE -->|Blue-Green| BG["Deploy to Green
Smoke test
Switch traffic"] PROD_CHOICE -->|Canary| CAN["10% -> 25% -> 50% -> 100%
Auto-rollback on degradation"] PROD_CHOICE -->|Rolling| ROLL["Rolling update
maxSurge: 1
maxUnavailable: 0"] STAGING --> VERIFY_S["Verify health + metrics"] BG --> VERIFY_P["Monitor 30 min"] CAN --> VERIFY_P ROLL --> VERIFY_P style PUSH fill:#2b6cb0,stroke:#2c5282,color:#fff style TEST fill:#276749,stroke:#22543d,color:#fff style SEC fill:#9b2c2c,stroke:#742a2a,color:#fff style BUILD fill:#553c9a,stroke:#44337a,color:#fff ``` ### Canary Rollout Stages ```mermaid flowchart LR S1["10% canary
5 min observe"] --> CHECK1{Metrics OK?} CHECK1 -->|Yes| S2["25% canary
10 min observe"] CHECK1 -->|No| ROLLBACK["Instant rollback
to stable"] S2 --> CHECK2{Metrics OK?} CHECK2 -->|Yes| S3["50% canary
15 min observe"] CHECK2 -->|No| ROLLBACK S3 --> CHECK3{Metrics OK?} CHECK3 -->|Yes| S4["100% canary
Promote to stable"] CHECK3 -->|No| ROLLBACK style S4 fill:#276749,stroke:#22543d,color:#fff style ROLLBACK fill:#9b2c2c,stroke:#742a2a,color:#fff ``` ### Azure DevOps Pipeline Create `.azure-pipelines.yml`: ```yaml trigger: branches: include: - main - develop variables: azureSubscription: 'your-service-connection' resourceGroup: 'ai-orchestrator-production-rg' aksCluster: 'ai-orchestrator-production-aks' acrName: 'aiorchestrator productionacr' imageRepository: 'ai-orchestrator' imageTag: '$(Build.BuildId)' stages: - stage: Build jobs: - job: BuildAndPush pool: vmImage: 'ubuntu-latest' steps: - task: Docker@2 displayName: Build and push image inputs: containerRegistry: '$(acrName)' repository: '$(imageRepository)' command: 'buildAndPush' Dockerfile: '**/Dockerfile' tags: | $(imageTag) latest - stage: DeployStaging condition: eq(variables['Build.SourceBranch'], 'refs/heads/develop') jobs: - deployment: DeployToStaging environment: 'staging' pool: vmImage: 'ubuntu-latest' strategy: runOnce: deploy: steps: - task: AzureCLI@2 inputs: azureSubscription: '$(azureSubscription)' scriptType: 'bash' scriptLocation: 'inlineScript' inlineScript: | az aks get-credentials -g $(resourceGroup) -n $(aksCluster) kubectl set image deployment/ai-orchestrator-blue \ ai-orchestrator=$(acrName).azurecr.io/$(imageRepository):$(imageTag) \ -n ai-orchestrator - stage: DeployProduction condition: eq(variables['Build.SourceBranch'], 'refs/heads/main') jobs: - deployment: DeployToProduction environment: 'production' pool: vmImage: 'ubuntu-latest' strategy: runOnce: deploy: steps: - task: AzureCLI@2 displayName: Deploy to Green inputs: azureSubscription: '$(azureSubscription)' scriptType: 'bash' scriptLocation: 'inlineScript' inlineScript: | az aks get-credentials -g $(resourceGroup) -n $(aksCluster) kubectl set image deployment/ai-orchestrator-green \ ai-orchestrator=$(acrName).azurecr.io/$(imageRepository):$(imageTag) \ -n ai-orchestrator kubectl scale deployment/ai-orchestrator-green --replicas=3 - task: ManualValidation@0 displayName: 'Approve traffic switch' inputs: notifyUsers: 'admin@example.com' instructions: 'Verify green environment and approve traffic switch' - task: AzureCLI@2 displayName: Switch Traffic inputs: azureSubscription: '$(azureSubscription)' scriptType: 'bash' scriptLocation: 'inlineScript' inlineScript: | kubectl patch service ai-orchestrator-service \ -n ai-orchestrator \ -p '{"spec":{"selector":{"version":"green"}}}' sleep 30 kubectl scale deployment/ai-orchestrator-blue --replicas=0 ``` ### GitHub Actions Already configured in `Jenkinsfile`, `.gitlab-ci.yml`, and `.circleci/config.yml`. For Azure-specific GitHub Actions, add `.github/workflows/azure-deploy.yml`: ```yaml name: Azure Deploy on: push: branches: [ main, develop ] env: AZURE_RESOURCE_GROUP: ai-orchestrator-production-rg AKS_CLUSTER: ai-orchestrator-production-aks ACR_NAME: aiorchestrator productionacr jobs: build-and-deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Azure Login uses: azure/login@v1 with: creds: ${{ secrets.AZURE_CREDENTIALS }} - name: ACR Login run: az acr login --name ${{ env.ACR_NAME }} - name: Build and Push run: | az acr build \ --registry ${{ env.ACR_NAME }} \ --image ai-orchestrator:${{ github.sha }} \ --image ai-orchestrator:latest \ . - name: Get AKS Credentials run: | az aks get-credentials \ --resource-group ${{ env.AZURE_RESOURCE_GROUP }} \ --name ${{ env.AKS_CLUSTER }} - name: Deploy to AKS run: | kubectl set image deployment/ai-orchestrator-blue \ ai-orchestrator=${{ env.ACR_NAME }}.azurecr.io/ai-orchestrator:${{ github.sha }} \ -n ai-orchestrator kubectl rollout status deployment/ai-orchestrator-blue -n ai-orchestrator ``` --- ## Additional Resources - [Azure Kubernetes Service Documentation](https://docs.microsoft.com/azure/aks/) - [Azure Container Registry Documentation](https://docs.microsoft.com/azure/container-registry/) - [Azure Monitor Documentation](https://docs.microsoft.com/azure/azure-monitor/) - [Terraform Azure Provider](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs) - [Project Architecture](./ARCHITECTURE.md) - [Offline Mode Guide](./docs/OFFLINE_MODE.md) - [General Deployment Guide](./deployment/DEPLOYMENT.md) ## Support For Azure-specific issues: 1. Check [Azure Status](https://status.azure.com/) 2. Review [Azure Service Health](https://portal.azure.com/#blade/Microsoft_Azure_Health/AzureHealthBrowseBlade/serviceIssues) 3. Check logs in Azure Monitor 4. Open support ticket: `az support tickets create` For application issues: 1. Check application logs: `kubectl logs -n ai-orchestrator -l app=ai-orchestrator` 2. Review metrics in Grafana 3. Check Application Insights 4. Open GitHub issue ---