Reddit Down

April 27, 2011 by USA Post 

Reddit Down, As reported in SYS-CON and elsewhere, we found the Amazon cloud crashed, taking in sites such as Image, Foursquare, Quran, Hootsuite, Indaba, GroupM, Scvngr, and little else down with him.

As reported in various parts of the Amazon Cloud portfolio, such as EC2, Elastic Block Store (EBS), relational databases Service (RDS), Beastalk elastic CloudFormation and MapReduce lately were all shocked.

Amazon has provided the following explanation of the fall at this time:

“A networking event caused a lot of re-mirroring EBS [Extended Block Store] volumes … This mirroring re-established a capacity shortage … that impacted the creation of new EBS volume and the rate at which we can return to retrieve mirror and EBS volumes affected. ”

Although this problem will be solved by now, this has created a great impact on the adoption of Clouds for large companies. However, traditional practices of the best high availability are always valid also for the cloud and this issue cannot be seen as a failure of the Cloud, a little more about the application. The following best practice applications clouds keep on top of the exit of the high availability of housing provided by the cloud provider like Amazon.

Ensuring scalability driven applications
We have components such as automatic scaling, load balancing elastic and cloud watch, etc. This helps scalability by monitoring resource utilization and automatically assigns new instances.

However, this is achieved, if the application is aware of its use and scales accordingly.

A pattern of this application is a routing server, where the feature of application and the user type, geography or type of operation determines the destination to process the application and load balancing accordingly.

Dissemination of data standards aware configurable level without restarting the servers will go a long way in adjusting the routing mechanism to specific servers in the case of regions and availability zones have been reduced due to the reasons unknown. This will also ensure that standards of scalability can be altered dynamically in case of catastrophic situations, so that some high priority transactions can still be served and the low priority transactions can be put on hold.

Keep Offline
Although the typical application consists of several software and hardware components, it is best to separate each of these components, so that each layer interacts with the next layer in an asynchronous manner.

While there are some applications such as banking, stock trading and online reservations requiring real time and linked nature, most applications in the scenario of today can still have the advantage of offline architecture.

The use of reliable messaging and request / response framework for end-users are never aware that their request is queued rather the feeling that your request is answered and got a satisfactory answer. This will ensure that while some physical servers or software components have been reduced still cannot affect the end user.

Keep smaller transactions
The best way to ensure the transparent and fault recovery is to ensure that transactions are as small as possible, and each step will ensure a significant logical step in the whole process from an end user perspective.

Remember that some legacy applications from the previous era, which accepts the transaction data for the fields of multiple pages and used to have a single SAVE button, and if something happens, the end user loses all data needed to re- come, this must be avoided at all costs that systems must be designed to be a combination of logical steps smaller than a loosely bound together.

Veet user entry
In a disconnected environment, end users are not to correct data entry errors or provide additional information so that systems of better fault tolerance are designed when the user is to enter a minimum and pattern data Veet (Validate Enrich Extract Transform) is applied to user data.

Validate: Once the entries of transactions are recorded and accepted, is as meaningful information through the system components without the need to correct the data.

Excerpt: Never accept the information that can be derived, this will ensure that mistakes are avoided in the known data.

Enrich: Gather information from existing information, so that the user need not enter information. For example, if the user enters the zip code, city, state and other information can be retrieved automatically.

Transform: Transformation from one form to another form, which is significant for the flow of the system.

The above steps ensure that we can recover gracefully from failures, which will be transparent to the user.

Keep the backup data to the lowest granularity for recovery
We have seen the storage mechanisms such as Amazon EBS (Amazon Elastic Block Store) have on the safety mechanism installed tolerant, such that the volumes are automatically replicated. This is a nice feature. But the data is backed up as the volumes of crude oil, should also consider the ability to recover quickly and move on disaster.

Cases of database take some time to recover pending transactions or to roll back the earrings, the mechanisms for proper backup can help you recover from this situation quickly.

The following options may be considered in order to quickly recover from a disaster.

Alternative write mechanism: That a log shipping or standing by the database or simply a reflection of data availability to other areas is one of the best mechanism to keep the databases in sync and recover quickly when a zone is not available.

Backup implied gross volume: This is employed outside the box most of the platform in the cloud, but the intelligence to quickly restore oil volumes with automated scripts must be in place.

Share Nothing
Projects in the Amazon, it is clear that despite the best available mechanisms adopted by the cloud provider, rarely can be finished in few areas available beaten by the disaster.

However, in these scenarios that wanted to make sure that not all users are affected, but only the minimum number of users. This can be achieved by adopting the “Shared Nothing pattern ‘for tenants are logically and physically separated in the Cloud ecological system.

This will ensure the failure of the infrastructure will not affect everyone in the system.

The Amazon tag for the event is a warning about how the cloud can be used. There is an automatic switch that guarantees all necessary fault tolerant systems. However, this has reinforced the strong fundamental principles with which applications should be built to be tough. This incident cannot be seen as a failure of self Cloud platform and have much room for improvement and to avoid these situations in the future.

Report to Team

Please feel free to send if you have any questions regarding this post , you can contact on

Disclaimer: The views expressed on this site are that of the authors and not necessarily that of U.S.S.POST.


Comments are closed.