We construct our infrastructure on AWS Global and AWS China, and we need to transfer objects between them. Because of the GFW, package loss is frequent when data is transfer from and to China. Therefore, we decide to transfer all data through DX(Direct Connect), and give this solution to copy S3 objects from global to China automatically, by using SQS(Simple Queue Service) and S3 VPC(Virtual Private Cloud) endpoint to improve robustness.
- By using S3 Events, notification will be sent automatically to SQS when objects are created or deleted.
- A small long-running EC2 clusters are always long polling from the SQS queue and execute transfer script once messages are received. And we use multi-processing on each EC2 to maximize its utilization.
- When receiving massages from SQS, EC2 will execute the script to download objects from global S3 by range, and upload objects to China S3 by using multi-part upload. We seriously consider the part size so that it would not occupy too much memory in spite of reaching the maximum processes. By the way, objects would be saved in memory instead of disk, so that disk IO would not be the bottleneck in our solution.
- EC2 cluster would scale in and out automatically by monitoring the queue size of SQS.
- After an object is uploaded, transfer script would call SQS API to delete the corresponding message in queue. In case of failure, this message would be received and handled by other EC2 clients after invisible timeout. As you can see, SQS is important for us to retry unsuccessful tasks.
- VPCs global and China are connecting with DX gateway and their routes are propagated through BGP(Border Gateway Protocol). So EC2 in VPCs can access each other directly.
- Instead of using http proxy, such as squid, to transfer objects, we use S3 interface endpoint in China VPC to handle the S3 flow. This would improve the reliability of our solution. Meanwhile, private host zone attached to global VPC would be set to resolve s3 domain(s3.cn-north-1.amazonaws.com.cn) to the ip address of S3 interface endpoint.
We use SQS and VPC endpoint to improve the reliability. Multiprocessing and autoscaling are also important to balance the performance and cost. Lambda is not used in this solution due to the runtime and memory limit, otherwise we need to set complicated logic to ensure it runs well in all scenarios.