Migrating massive on-premises datasets to the cloud is a complex engineering task that often poses challenges regarding transfer speed, data integrity, and network reliability. Amazon Web Services provides AWS DataSync, a fully managed online data transfer service that simplifies, automates, and accelerates data replication to AWS storage services. This step-by-step tutorial walks you through the practical process of configuring an AWS DataSync agent and setting up a secure, automated data synchronization pipeline from an on-premises NFS server directly to an Amazon S3 bucket.
⚡ Key Takeaways
- AWS DataSync speeds up data transfers up to 10 times faster than open-source tools like rsync or rclone by using a proprietary network protocol.
- An on-premises DataSync Agent must be deployed on a virtual machine (or EC2 instance for simulation) with a minimum of 4 vCPUs and 16 GiB RAM.
- DataSync preserves file metadata, directory structures, and system permissions seamlessly between source and destination endpoints.
- All transferred data is encrypted in transit using TLS, and integrity checks are executed automatically during and after the transfer process.
Why AWS DataSync is the Optimal Choice for Cloud Migration
Standard data transfer scripts (such as aws s3 sync) run into performance bottlenecks when handling millions of small files or struggling with network latency. AWS DataSync bypasses these limitations by employing a multi-threaded, proprietary data transfer protocol designed to maximize network bandwidth. DataSync handles connection retries automatically, manages data integrity verifications on the fly, and provides native integration with Amazon CloudWatch for end-to-end monitoring. This minimizes custom scripting efforts, letting engineers focus on data strategy rather than connection monitoring.
Understanding the DataSync Architecture
Before launching the configuration process, it is important to understand the four primary components of an AWS DataSync deployment:
- DataSync Agent: A virtual appliance deployed in your local datacenter environment that reads from your source file systems and securely streams the data to AWS.
- Source Location: The configuration endpoint representing the source system, which can include NFS shares, SMB shares, HDFS clusters, or self-managed object storage.
- Destination Location: The AWS storage endpoint representing your target service, such as Amazon S3, Amazon EFS, or Amazon FSx.
- Task: The logical execution job that binds the source location, destination location, and transfer parameters (e.g., bandwidth limits, scheduling, and validation criteria) together.
13 Steps to Sync On-Premises Data to Amazon S3
To simulate an on-premises datacenter inside AWS, we will deploy an Amazon EC2 instance to act as our local NFS server, alongside an EC2 instance hosting our DataSync Agent.
Step 1: Retrieve the Latest DataSync AMI ID
To begin, retrieve the latest, officially validated AWS DataSync Amazon Machine Image (AMI) ID for your target AWS Region. You can query the AWS Systems Manager (SSM) Parameter Store via the AWS Command Line Interface (CLI) by running the following command:
aws ssm get-parameter --name /aws/service/datasync/ami --region us-east-1
Step 2: Launch the DataSync Agent Instance
Launch an EC2 instance using the AMI ID retrieved in the previous step. For production environments, ensure you provision a host with at least 4 vCPUs and 16 GiB of RAM (a t2.xlarge instance is ideal for simulation). Assign a public IP address to the instance, and ensure the security group permits inbound traffic on port 80 (HTTP) specifically from your administrative workstation to retrieve the activation key safely.
Step 3: Access the AWS DataSync Console
Sign in to the AWS Management Console, navigate to the search bar, type DataSync, and click on the service to open the landing page. In the left navigation menu, click on Agents, and then click on Create agent to initiate the connection wizard.
Step 4: Retrieve the Agent Activation Key
Under the Hypervisor section, choose Amazon EC2. For the Service Endpoint, select Public endpoints. In the Agent Address field, input the public IP address of the running DataSync Agent instance you deployed in Step 2. Click the Get key button, which triggers a localized HTTP request to your agent on port 80 to fetch the activation token securely.
Step 5: Finalize Agent Creation
Once the activation key retrieval is successful, provide a descriptive name for your agent (e.g., on-premise-nfs-agent). Click the Create agent button. Within seconds, your agent will establish a connection to AWS and display a status of Online, indicating it is ready to execute data transfer tasks.
Step 6: Deploy the Simulated On-Premises Server
To simulate your local datacenter file system, deploy a standard Linux EC2 instance (a cost-effective t2.micro is sufficient). This host will act as your local file server and store the data payload that you wish to replicate to Amazon S3.
Step 7: Configure Network Security Group Rules
Ensure the security group attached to your simulated on-premises server allows inbound traffic on port 2049 (NFS) specifically from the security group of your DataSync Agent. This allows the agent to read file systems without exposing ports to the public internet.
Step 8: Install and Configure the NFS Daemon
Establish an SSH connection to your simulated on-premises server. Install the NFS utility package and start the NFS server daemon by running the following commands in your terminal:
sudo yum update -y
sudo yum install nfs-utils -y
sudo systemctl enable --now nfs-server
Step 9: Create and Share the Source Directory
Create a directory named /test to store your test files. Inside this directory, generate a mock text file named sampletext.txt. Configure your NFS exports file to share the directory with the DataSync Agent's IP address. Set the appropriate permissions using the following commands:
sudo mkdir /test
sudo chown -R nobody:nobody /test
echo "This is a sample file for AWS DataSync replication." | sudo tee /test/sampletext.txt
echo "/test *(rw,sync,no_subtree_check,no_root_squash)" | sudo tee -a /etc/exports
sudo exportfs -arv
Step 10: Configure Source and Destination Endpoints
Return to the AWS DataSync Console. Click on Tasks, and then click Create task. Configure the source location with the following parameters:
- Location Type: Network File System (NFS)
- Agent: Select the agent you activated in Step 5
- NFS Server: Enter the private IP address of your simulated NFS server
- Mount Path: Enter
/test
Click Next, and configure your destination location by selecting Amazon S3, choosing your target S3 bucket, and defining your desired S3 folder prefix.
Step 11: Configure Task Logging and Permissions
Provide a clear task name (e.g., nfs-to-s3-sync). Under the execution configurations, leave the data verification and transfer parameters at their default values. For the IAM execution role, choose the Autogenerate option, which automatically creates a secure IAM role granting DataSync permission to write files directly into your destination S3 bucket.
Step 12: Wait for Task Deployment to Complete
Click Next, review your configuration parameters on the summary page, and then click Create task. Monitor the task dashboard and wait a moment until the task status changes from Creating to Available.
Step 13: Execute and Verify the DataSync Task
Click the Start button on the task details page and select Start with defaults. AWS DataSync will analyze your source directory, calculate the differences, secure a connection, and begin streaming the files. Once the status displays Success, navigate to your target Amazon S3 bucket using the AWS S3 Console to verify that sampletext.txt has successfully synchronized with its directory structure intact.
Quick Comparison: AWS Online Data Transfer Options
| Feature | AWS DataSync | AWS Storage Gateway | AWS Transfer Family |
|---|---|---|---|
| Primary Use Case | One-time or recurring bulk migrations and synchronization. | Hybrid storage caching; seamless local file access. | FTP, SFTP, and FTPS client integrations. |
| Performance Optimization | Proprietary network acceleration protocol. | Local cache disk read/write optimization. | Standard file transfer protocols over SSH/SSL. |
| Protocol Support | NFS, SMB, HDFS, Object Storage (S3 API). | NFS, SMB, iSCSI volume mappings. | SFTP, FTPS, FTP, AS2. |
| Integrity Checking | Automatic, end-to-end checksum verification. | Implicit storage layer consistency validations. | Client-managed verification models. |
❓ Frequently Asked Questions
Can AWS DataSync sync data between other public clouds and AWS S3?
Yes. AWS DataSync can copy data from alternative public cloud storage providers, such as Google Cloud Storage (using the S3 compatible API) or Microsoft Azure Files (using the SMB protocol). You can deploy the DataSync agent on an Amazon EC2 instance or within your alternative cloud environment to orchestrate the migration.
Does AWS DataSync encrypt data during the synchronization process?
Absolutely. AWS DataSync ensures that all data transferred between the local datacenter agent and AWS storage services is encrypted in transit using Transport Layer Security (TLS). Furthermore, the data written to Amazon S3 is encrypted at rest using default S3 managed keys (SSE-S3) or your custom KMS keys.
How does AWS DataSync handle data validation?
AWS DataSync calculates and records a checksum for every file at the source location and compares it to the checksum of the copied file at the destination. You can configure tasks to verify only transferred data, verify the entire dataset upon task completion, or disable verification entirely to speed up transfers of non-critical data.
How does AWS DataSync charge for data transfers?
AWS DataSync is priced using a simple usage-based billing model. You are charged a flat rate per gigabyte (GB) of data transferred from your source system to AWS. There are no licensing fees, and you only pay for what you actually transfer. Standard AWS data transfer and storage costs apply separately.
🎯 Conclusion
AWS DataSync is a powerful, highly secure, and exceptionally efficient tool that eliminates the stress of cloud data migration. By automating deployment agents, managing location endpoints, and establishing structured replication tasks, you can successfully sync massive, on-premises file structures to AWS S3 with absolute peace of mind. Take control of your enterprise migrations by transitioning from unstable, custom-made transfer scripts to AWS DataSync today. Upgrading your cloud deployment and infrastructure capabilities with modern data services will ensure that your business remains highly resilient, scalable, and prepared for future growth!
Related Topics: AWS DataSync, on-premises migration, Amazon S3 bucket, NFS sharing, cloud data transfer, data synchronization, Systems Manager parameter, Storage Gateway comparison