Contact Us

Uploading Large Files Upto 5TB to Amazon S3 using Boto3 in Python


By abdulmumin yaqeen

on January 31, 2024



img

Amazon Simple Storage Service (S3) is a widely-used cloud storage service that allows users to store and retrieve any amount of data at any time. Uploading large files, especially those approaching the terabyte scale, can be challenging. Boto3, the AWS SDK for Python, provides a powerful and flexible way to interact with S3, including handling large file uploads through its multipart upload feature.

Prerequisites

Before we begin, make sure you have the following:

  1. AWS Account: You need an AWS account with appropriate permissions to access S3.

  2. Boto3 Installation: Install Boto3 by running pip install boto3 in your terminal.

  3. AWS Credentials: Set up your AWS credentials, either by configuring the AWS CLI (aws configure) or directly within your script.

Why Multipart?

Benefits

S3 Multipart Upload is beneficial for handling large files efficiently. Here are key reasons to use it:

  1. Efficiency for Large Files:

    • Splits large files into smaller parts for better handling.
  2. Resilience to Failures:

    • Reduces the risk of failure by allowing resumption from the point of interruption.
  3. Parallel Uploads:

    • Speeds up uploads by enabling parallel uploading of file parts.
  4. Optimal for Unstable Connections:

    • Minimizes the impact of network failures by retrying only the failed parts.
  5. Support for Transfer Acceleration:

    • Compatible with S3 Transfer Acceleration for faster uploads.
  6. SDK Support:

    • AWS SDKs offer built-in support, simplifying implementation.
  7. Concurrency Control:

    • Allows control over the number of parallel uploads.

Writing the Python Script

Let's create a Python script that utilizes Boto3 to upload a large file to S3 in a multipart fashion.

import boto3 from boto3.s3.transfer import TransferConfig # Set your AWS credentials and region aws_access_key_id = 'YOUR_ACCESS_KEY_ID' aws_secret_access_key = 'YOUR_SECRET_ACCESS_KEY' region_name = 'YOUR_REGION' # Set your S3 bucket and object key bucket_name = 'YOUR_BUCKET_NAME' object_key = 'your-prefix/your-large-file.tar.gz' # Specify the local file to upload local_file_path = 'path/to/your-large-file.tar.gz' # Set the desired part size and number of threads part_size_mb = 50 # You can adjust this based on your requirements num_threads = 10 # Create an S3 client s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key, region_name=region_name) # Create a TransferConfig object transfer_config = TransferConfig(multipart_threshold=part_size_mb * 1024 * 1024, max_concurrency=num_threads) # Create an S3 transfer manager transfer_manager = boto3.s3.transfer.TransferManager(s3, config=transfer_config) try: # Upload the file using multipart upload upload = transfer_manager.upload(local_file_path, bucket_name, object_key) # Wait for the upload to complete upload.wait() print(f"File uploaded successfully to {bucket_name}/{object_key}") except Exception as e: print(f"Error uploading file: {e}") finally: # Clean up resources transfer_manager.shutdown()

Understanding the Script

Let's break down the key components of the script:

  1. AWS Credentials and Configuration: Set your AWS credentials (access key and secret key) and the AWS region where your S3 bucket is located.

    # Set your AWS credentials and region aws_access_key_id = 'YOUR_ACCESS_KEY_ID' aws_secret_access_key = 'YOUR_SECRET_ACCESS_KEY' region_name = 'YOUR_REGION'
  2. S3 Bucket and Object Key: Define the target S3 bucket and the object key (path) under which the file will be stored.

    # Set your S3 bucket and object key bucket_name = 'YOUR_BUCKET_NAME' object_key = 'your-prefix/your-large-file.tar.gz'
  3. Local File Path: Specify the local path of the large file you want to upload.

    # Specify the local file to upload local_file_path = 'path/to/your-large-file.tar.gz'
  4. Part Size and Concurrency: Determine the part size in megabytes and the number of threads to use during the multipart upload. Adjust these values based on your network conditions and requirements.

    # Set the desired part size and number of threads part_size_mb = 50 # You can adjust this based on your requirements num_threads = 10
  5. Creating S3 Client and Transfer Manager: Initialize the Boto3 S3 client and create a TransferConfig object with the specified multipart settings.

    # Create an S3 client s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key, region_name=region_name)
  6. Multipart Upload: Use the TransferManager to initiate a multipart upload of the specified file to the S3 bucket. This method automatically handles the division of the file into parts and manages the upload process.

    # Create a TransferConfig object transfer_config = TransferConfig(multipart_threshold=part_size_mb * 1024 * 1024, max_concurrency=num_threads)
  7. Wait for Upload Completion: Wait for the multipart upload to complete before proceeding. This ensures that all parts are successfully uploaded and assembled on the S3 bucket.

    # Wait for the upload to complete upload.wait()
  8. Clean Up: Finally, clean up resources by shutting down the TransferManager.

    # Clean up resources transfer_manager.shutdown()

Running the Script

To run the script:

  1. Save the script to a file (e.g., upload_to_s3.py).
  2. Open a terminal and navigate to the script's directory.
  3. Run the script using the command python upload_to_s3.py.

Ensure that the AWS credentials have the necessary permissions to perform S3 uploads.

Conclusion

Uploading large files to Amazon S3 using Boto3 in Python becomes a manageable task with the multipart upload feature. By breaking down the file into smaller parts and uploading them concurrently, you can efficiently transfer large datasets to S3. Adjusting parameters such as part size and concurrency allows you to optimize the upload process based on your specific requirements. Incorporating this approach into your workflow facilitates the seamless transfer of large files to the cloud, unlocking the full potential of Amazon S3 for scalable and reliable storage.

Continue Reading

Top Cloud Services providers in Nigeria with CloudPlexo's Innovative Solutions

Understanding the Difference Between AWS SNS and SQS

Uploading and Downloading Files to/from Amazon S3 using Boto3