Amazon S3 and Glacier
The last few years has shown that there is a tremendous upsurge in the number of companies utilizing the cloud services, and out of that the clear winner is Amazon’s AWS. Amazon is the number one choice for all the cloud-based business solutions out there and is leaping bounds ahead with a market share where none of its competitors are even close.
Presently it is a transition period for the established and the big wig firms to shift to cloud solutions. Whereas for the startup companies or medium based companies who are familiar with the cloud, these are the times for designing solutions based on the already existing AWS services.
We, Factweavers Technologies fall into the second category, and have architectured, designed and implemented several solutions, some including AWS services, while others purely in serverless mode comprising of only AWS services.
In our experience of architecturing AWS service based solutions for both big and small clients, we have identified some common areas where irrespective of the nature of clients, there were some misconceptions or lack of clarity which lead to wrong design patterns.
One of such common issues is the choice of Amazon s3 or Glacier for storage options. So I try to explain in this blog, about the distinctions to be aware or to be kept in mind while thinking to choose between Amazon s3 and Glacier as a solution in your AWS architecture.
It is good to have a revisit on the basic specifications of a storage system when we think of a problem solution involving storing our data.
Data availability is easily one of the prime factors when considering for a storage solution. Availability means the data should be available within the set of required performance levels for even the most disastrous scenarios.
The burden of this falls mainly on the hardware side of things, where the hardware should be able to deliver the data meeting the performance criterias.
Durability of a data storage system is the ability to protect it from getting corrupted over long periods. The typical gold standard for durability is Eleven 9s (99.999999999%).
This is essentially different from availability because both of them addresses different problems. If we are focussing on the availability factor, then the solution might not necessarily meet high end durability standards.
Amazon Simple Storage Services -S3
Amazon Simple Storage Services, or simply S3, launched in 2006 is a data storage service which can be used to store and retrieve data from the cloud. S3 took the internet world by a storm and has now become almost synonymous to cloud data storage.
The primary reason behind the popularity of S3 is that it could significantly bring down the cost associated with data storage for every segment of users ranging from individuals to enterprises.
S3 provides a bulk of features like encryption, versioning, scheduling etc which makes it a very versatile data storage solution.
Even though S3 focuses on storage, it is designed to be more available. This means the data which is saved in s3 can be expected to be retrieved faster. The data retrieval requests in s3 are extremely cheaper compared to other solutions.
S3 is used as a data storage option for a wide variety of cases. Out of these the most common use cases are listed below
1. One of the most common use case of Amazon S3 is to do massive backup of the files and other data in organisations.
3. As the backups of primary databases like Elasticsearch etc. In case of Elasticsearch there are options to take snapshots and then push it to S3.
4. Another important usage is in conjunction with CloudFront to host static websites.
5. Since it supports encryption, individuals and firms alike use them for storing important configuration files.
Amazon Glacier is relatively younger compared to the Amazon S3 and aims at providing a more durable data storage solution. Durable here means that the data meant to be stored in Glacier is not intended to be retrieved frequently, but it is to be kept for long periods of times. Essentially this means that the data in Glacier, moves very slow compared to that of the S3, and hence the name Glacier.
Since by design and concept, Glacier gives more predominance to the data retention time than that of reducing the data latency. This reflects in the pricing too, the data storage in Glacier is much cheaper as compared to S3, but where as the data retrieval requests from Glacier are highly priced.
Owing to the above points, the data which is to be stored in Glacier should be static data, means it should not change, as active or more frequently changing data if stored in Glacier can result in more cost.
The basic units of the data storage in Glacier is called an archive, and each archive has an auto generated key which later can be used to retrieve the data.
1. The most common use case of data storage in Glacier is that for the log data. The relevance of log data is inversely proportional to time and hence the old logs are almost completely archive material.
2. The long term data storage solutions for enterprise, where there are requirements and recommendations that collected data should be stored/archived for years, even though it is not used.
3. In some cases the source files which are used for processing and extracting informations, are to be stored for long term for various reasons. Here also the nature of the data is cold and hence prefer Glacier.
S3 and Glacier in conjunction
So far we have seen the separate use cases of S3 and Glacier use cases, let us explore one of the solutions where they are used together.
Being in the log parsing and analytics industry for quite a while, Factweavers has seen an often repeating pattern, where we can clearly explain the difference between the necessity of an S3 based solution and a Glacier based solution.
For most of the clients who reach to us for log parsing and analytics solution using the Elasticsearch-Logstash-Kibana stack, there is one common scenario. As you know, most of the logging information, irrespective of the use case, will be time based data. We have a strategy of implementing the solution in such a way that the data gets pushed in to the time based indices.
This is because, in 99% of the logging solutions, as the data gets old, the significance or relevance also goes down and the older indices would not be used. What we do here is to take the snapshots of these indices and push to s3 for initial backup.
The data which resides in the S3 bucket will be kept there for quite some time (usually 3-6 months, depending on the application), and after that the buckets will be configured to push in to Glacier. This is one of the strategies which makes use of best of both worlds and is found to be very effective.
S3 or Glacier – Points to consider
The important points to be taken from our discussion is that, there are few basic recommendations while we take a decision for the best data storage choice, which are listed down below:
1. If the data is to be changing or hot, S3 is the choice you should make
2. If it involves low latency of data retrieval, S3 is the recommended choice
3. If the data is subjected to little/no changes and the storage retention time is higher, the option you should go for is Glacier
4. In solutions where you want both short term storage with low latency and also later higher retention times with little/no change in data, as we saw in the Elasticsearch log management scenario, you should consider a Hybrid solution.
In this blog, I have explained how Amazon S3 and Glacier are different from each other and the best use cases for them to be used alone and also discussed where they can be used in conjunction to provide better cloud solutions.