Upgrading underlying host machines or shifting workloads to specialized compute shapes is a standard administrative task in enterprise Kubernetes environments. When running on Google Kubernetes Engine (GKE), executing these migrations without application downtime requires a careful orchestration of node pooling, traffic management, and pod evictions. In this comprehensive step-by-step tutorial, we demonstrate how to gracefully migrate containerized workloads from a default node pool to a newly provisioned, high-performance node pool with absolute zero user-facing downtime.
⚡ Key Takeaways
- Seamless Node Upgrades: Learn to shift running containers to new VM instances without causing packet loss or service disruptions.
- Cordon vs. Drain: Master the two essential Kubernetes node commands that control pod scheduling and eviction safety.
- Optimize Resources: Understand how to add high-performance node pools to GKE clusters and retire legacy pools safely.
- Automated Rescheduling: Witness how GKE automatically detects unschedulable nodes and migrates pods to healthy resources.
Why Migrate Workloads to a New GKE Node Pool?
In a production Google Kubernetes Engine (GKE) environment, application workloads rarely remain static. Over time, your compute demands may evolve, necessitating larger CPU cores, GPU accelerators, or high-memory virtual machines. Creating a separate node pool allows you to introduce virtual machines with distinct hardware profiles, different IAM scopes, or modern operating systems into your existing cluster. To prevent active user requests from failing during this infrastructure shift, administrators must utilize Kubernetes' native scheduling controls to drain active nodes gracefully, forcing pods to redeploy onto the new, healthy node pool without interrupting application availability.
Prerequisites
To successfully follow along with this hands-on guide, you will need:
- An active Google Cloud Platform (GCP) account.
- A project configured with billing enabled.
- An IAM identity with Owner, Editor, or Kubernetes Engine Admin permissions.
- Google Cloud SDK installed locally, or access to the GCP Cloud Shell.
Step 1: Enable the Kubernetes Engine API
Before provisioning any managed resources in Google Cloud, you must ensure the corresponding APIs are enabled. Search for the 'Kubernetes Engine API' in your GCP Console search bar and click 'Enable'. Alternatively, you can enable it instantly via the Cloud Shell using the gcloud command line tool:
gcloud services enable container.googleapis.com
Step 2: Launch and Authenticate Cloud Shell
Google Cloud Shell provides a pre-configured terminal environment equipped with all the necessary CLI utilities, including gcloud and kubectl. Click the Cloud Shell icon in the top-right corner of the GCP console. Once the session initializes, run the authentication command to verify your identity:
gcloud auth login
Step 3: Set Your Target Project, Region, and Zone
To prevent resource creation errors, explicitly declare your active project ID and set the default compute region and zone. Replace [PROJECT_ID] with your actual GCP project identifier:
gcloud config set project [PROJECT_ID]
gcloud config set compute/region us-central1
gcloud config set compute/zone us-central1-a
Verify that your settings are correctly applied by checking the configuration list: gcloud config list.
Step 4: Provision Your Initial GKE Cluster
Start by creating a demo cluster consisting of three standard virtual machines. This cluster will act as the source hosting environment for our initial application workload. The following command creates a cluster named demo-cluster with three worker nodes running on standard e2-medium instances:
gcloud container clusters create demo-cluster --num-nodes=3 --machine-type=e2-medium
This process generally takes between 5 to 10 minutes to fully bootstrap the master control plane and provision the compute nodes.
Step 5: Clone the Application Demo Repository
Retrieve the sample Kubernetes deployment manifest files from our source code repository. Run the following git commands in your terminal to fetch the resources and switch to the project directory:
git clone https://github.com/ShivaniG04/KubernetesWorkloadMigration.git
cd KubernetesWorkloadMigration
Step 6: Deploy the Replicated Application to the Cluster
Deploy the replicated sample web application to your active GKE cluster. Apply the deployment manifest utilizing kubectl:
kubectl apply -f node-pools-deployment.yaml
Confirm that the pods are running and note their host node mappings by appending the wide output flag:
kubectl get pods -o wide
You will observe that all active application pods are currently distributed across the three nodes of your default node pool.
Step 7: Provision the High-Performance Target Node Pool
Now, we will introduce a new, higher-memory node pool named high-mem-pool into our cluster. This pool utilizes e2-highmem-2 instances, which are optimized for resource-intensive workloads. Execute the pool creation command:
gcloud container node-pools create high-mem-pool --cluster=demo-cluster --machine-type=e2-highmem-2 --num-nodes=3
Once created, list the node pools to verify both pools are attached to your cluster: gcloud container node-pools list --cluster=demo-cluster. Running kubectl get nodes will now display six active nodes in your cluster.
Step 8: Migrate Workloads to the New Pool (Cordon & Drain)
To safely migrate the workload, we must systematically evacuate the default nodes. This is a two-step process: cordoning and draining.
First, cordon all nodes belonging to the default-pool to mark them as unschedulable. This prevents Kubernetes from placing any new pods on these nodes:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl cordon "$node"
done
Next, drain the cordoned nodes. Draining evicts the active pods gracefully, forcing them to reschedule onto the newly available nodes in the high-mem-pool:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl drain --force --ignore-daemonsets --delete-emptydir-data --grace-period=10 "$node"
done
Verify that your pods have successfully rescheduled to the high-memory nodes by running: kubectl get pods -o wide. You will notice that they have shifted to the new nodes with zero downtime.
Step 9: Retain and Delete the Legacy Node Pool
Once all application pods have successfully rescheduled and are verified healthy on the new node pool, it is safe to decommission the legacy hardware. Deleting the old pool releases the compute resources and helps optimize your GCP spending:
gcloud container node-pools delete default-pool --cluster=demo-cluster
Confirm the deletion by listing the remaining node pools. Your cluster should now run exclusively on the high-memory pool.
Step 10: Clean Up Your Cloud Infrastructure
To avoid incurring ongoing charges on your GCP account, always delete the resources once your testing or operations are complete. Delete the entire GKE cluster, which will automatically clean up the attached load balancers, virtual machines, disks, and networking components:
gcloud container clusters delete demo-cluster
Quick Comparison: Cordon vs. Drain Operations
| Operation | Primary Action | Impact on Existing Pods | Impact on New Pods |
|---|---|---|---|
| Cordon | Marks node as unschedulable | Existing pods continue running undisturbed | New pods are blocked from scheduling |
| Drain | Evicts pods and triggers rescheduling | Gracefully terminates and recreates pods elsewhere | Blocks new scheduling and clears active workloads |
❓ Frequently Asked Questions
Does cordoning a node immediately stop the running containers?
No. Cordoning only updates the node's metadata to mark it as unschedulable. Any containers currently running on the node will continue to execute undisturbed until the node is explicitly drained or the pods are deleted.
Why must we include the --ignore-daemonsets flag when draining nodes?
DaemonSets are specialized pods that must run on every node in the cluster (such as logging agents or monitoring tools). If you do not include the --ignore-daemonsets flag, the drain operation will fail because DaemonSets cannot be rescheduled onto other nodes.
How do we ensure pods do not experience downtime during migration?
To prevent downtime, ensure your application deployment has a replicas count of two or more, and configure a PodDisruptionBudget. This ensures that when a node is drained, Kubernetes maintains a minimum number of healthy, active pods to handle user traffic.
Can we automate node pool migration dynamically?
Yes. Many enterprise environments use managed GKE features like Node Auto-Provisioning or GCP's managed node upgrades, which automatically perform these cordon, drain, and pool replacement processes behind the scenes during maintenance windows.
🎯 Conclusion
Migrating GKE workloads to a new node pool is a vital operational skill that guarantees your application scales seamlessly to handle evolving business needs. By understanding the distinction between cordoning and draining, you can transition running pods between different machine profiles with absolute confidence and zero client impact. Incorporate these practices into your regular infrastructure updates, and establish a resilient, highly available cloud-native framework today.
Related Topics: gke node pool migration, kubernetes cordon and drain, zero downtime workload migration, google kubernetes engine tutorials, pod rescheduling best practices, gcp compute engine shapes, kubernetes cluster administration, daemonset eviction