Uploading Large Files Upto 5TB to Amazon S3 using Boto3 in Python
By abdulmumin yaqeen
on January 31, 2024
Amazon Simple Storage Service (S3) is a widely-used cloud storage service that allows users to store and retrieve any amount of data at any time. Uploading large files, especially those approaching the terabyte scale, can be challenging. Boto3, the AWS SDK for Python, provides a powerful and flexible way to interact with S3, including handling large file uploads through its multipart upload feature.
Prerequisites
Before we begin, make sure you have the following:
-
AWS Account: You need an AWS account with appropriate permissions to access S3.
-
Boto3 Installation: Install Boto3 by running
pip install boto3
in your terminal. -
AWS Credentials: Set up your AWS credentials, either by configuring the AWS CLI (
aws configure
) or directly within your script.
Why Multipart?
-
While single-file uploads using Presigned URLs are limited to a maximum of 5GB, multipart uploads can handle files up to 5TB.
-
Multipart uploads are efficient for large files, especially when parallelization can be leveraged to speed up the process.
-
Presigned URLs may introduce additional latency, as each part of the file requires a separate HTTP request.
Benefits
S3 Multipart Upload is beneficial for handling large files efficiently. Here are key reasons to use it:
-
Efficiency for Large Files:
- Splits large files into smaller parts for better handling.
-
Resilience to Failures:
- Reduces the risk of failure by allowing resumption from the point of interruption.
-
Parallel Uploads:
- Speeds up uploads by enabling parallel uploading of file parts.
-
Optimal for Unstable Connections:
- Minimizes the impact of network failures by retrying only the failed parts.
-
Support for Transfer Acceleration:
- Compatible with S3 Transfer Acceleration for faster uploads.
-
SDK Support:
- AWS SDKs offer built-in support, simplifying implementation.
-
Concurrency Control:
- Allows control over the number of parallel uploads.
Writing the Python Script
Let's create a Python script that utilizes Boto3 to upload a large file to S3 in a multipart fashion.
import boto3 from boto3.s3.transfer import TransferConfig # Set your AWS credentials and region aws_access_key_id = 'YOUR_ACCESS_KEY_ID' aws_secret_access_key = 'YOUR_SECRET_ACCESS_KEY' region_name = 'YOUR_REGION' # Set your S3 bucket and object key bucket_name = 'YOUR_BUCKET_NAME' object_key = 'your-prefix/your-large-file.tar.gz' # Specify the local file to upload local_file_path = 'path/to/your-large-file.tar.gz' # Set the desired part size and number of threads part_size_mb = 50 # You can adjust this based on your requirements num_threads = 10 # Create an S3 client s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key, region_name=region_name) # Create a TransferConfig object transfer_config = TransferConfig(multipart_threshold=part_size_mb * 1024 * 1024, max_concurrency=num_threads) # Create an S3 transfer manager transfer_manager = boto3.s3.transfer.TransferManager(s3, config=transfer_config) try: # Upload the file using multipart upload upload = transfer_manager.upload(local_file_path, bucket_name, object_key) # Wait for the upload to complete upload.wait() print(f"File uploaded successfully to {bucket_name}/{object_key}") except Exception as e: print(f"Error uploading file: {e}") finally: # Clean up resources transfer_manager.shutdown()
Understanding the Script
Let's break down the key components of the script:
-
AWS Credentials and Configuration: Set your AWS credentials (access key and secret key) and the AWS region where your S3 bucket is located.
# Set your AWS credentials and region aws_access_key_id = 'YOUR_ACCESS_KEY_ID' aws_secret_access_key = 'YOUR_SECRET_ACCESS_KEY' region_name = 'YOUR_REGION'
-
S3 Bucket and Object Key: Define the target S3 bucket and the object key (path) under which the file will be stored.
# Set your S3 bucket and object key bucket_name = 'YOUR_BUCKET_NAME' object_key = 'your-prefix/your-large-file.tar.gz'
-
Local File Path: Specify the local path of the large file you want to upload.
# Specify the local file to upload local_file_path = 'path/to/your-large-file.tar.gz'
-
Part Size and Concurrency: Determine the part size in megabytes and the number of threads to use during the multipart upload. Adjust these values based on your network conditions and requirements.
# Set the desired part size and number of threads part_size_mb = 50 # You can adjust this based on your requirements num_threads = 10
-
Creating S3 Client and Transfer Manager: Initialize the Boto3 S3 client and create a TransferConfig object with the specified multipart settings.
# Create an S3 client s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key, region_name=region_name)
-
Multipart Upload: Use the TransferManager to initiate a multipart upload of the specified file to the S3 bucket. This method automatically handles the division of the file into parts and manages the upload process.
# Create a TransferConfig object transfer_config = TransferConfig(multipart_threshold=part_size_mb * 1024 * 1024, max_concurrency=num_threads)
-
Wait for Upload Completion: Wait for the multipart upload to complete before proceeding. This ensures that all parts are successfully uploaded and assembled on the S3 bucket.
# Wait for the upload to complete upload.wait()
-
Clean Up: Finally, clean up resources by shutting down the TransferManager.
# Clean up resources transfer_manager.shutdown()
Running the Script
To run the script:
- Save the script to a file (e.g.,
upload_to_s3.py
). - Open a terminal and navigate to the script's directory.
- Run the script using the command
python upload_to_s3.py
.
Ensure that the AWS credentials have the necessary permissions to perform S3 uploads.
Conclusion
Uploading large files to Amazon S3 using Boto3 in Python becomes a manageable task with the multipart upload feature. By breaking down the file into smaller parts and uploading them concurrently, you can efficiently transfer large datasets to S3. Adjusting parameters such as part size and concurrency allows you to optimize the upload process based on your specific requirements. Incorporating this approach into your workflow facilitates the seamless transfer of large files to the cloud, unlocking the full potential of Amazon S3 for scalable and reliable storage.
Continue Reading
Top Cloud Services providers in Nigeria with CloudPlexo's Innovative Solutions
Understanding the Difference Between AWS SNS and SQS
Uploading and Downloading Files to/from Amazon S3 using Boto3