Amazon SQS Gotchas - CloudStacking.com

There’s nothing wrong with SQS, but nobody’s perfect either.

In a previous post, I’ve covered the basics of Message Queuing, and Amazon’s implementation of it: SQS (Simple Queue Service).

Amazon has built SQS with three leading principles in mind: Simplicity, Scalability and Redundancy.

In order to achieve exactly that (and achieve they have), some concessions and unorthodox design decisions, creating a few gotchas that we need to keep in mind when working with SQS.

Some technical background before we dive in

In order to provide the so-called unlimited scalability for a given Queue (represented by the unlimited number of messages that could be placed in it) - the operation of the Queue service divided between a number of servers

This is done by grouping messages in the Queue Blocks composed of a certain number of messages which are dispersed, with each Block handled by a separate server.

This Note that in order to also achieve Redundancy, a given Block is actually duplicated to multiple servers in multiple AWS Availability Zones so that no single server failure will result in a loss of messages.

Anyone who has ever tried scaling software on a massive scale (and massive is really the word that comes into mind when thinking of AWS) knows that the #1 challenge and as such the bane of scalability is synchronization.

Having multiple servers/services/applications/pieces trying to coordinate and synchronize their actions by definition involves one component spending part of its time waiting for another (the more components, the greater the wait to work ratio) - and down goes the utilization.

That’s why AWS has bent some rules and made concessions as far as synchronization goes in order to enable the massive scalability of their offerings - SQS included (for example: you can have unlimited number of messages in a single SQS Queue).

But these amazing qualities come at a price - and below I will describe what exactly that price may be and how we can live with it.

At least one delivery

Every Message inserted in the queue is duplicated to multiple servers to ensure redundancy. Let’s envision a scenario where a server holding a given messages fails, and before the server returned to normal operation a duplicate of the message was successfully received from an alternate server.

In a strictly synchronized environment, the returning server would identify that the message it holds has actually already been delivered and should now be deleted for consistency with the remainder of the servers operating the logical Queue.

However, as previously mentioned, SQS does not offer strict synchronization and thus the above scenario is likely to result in a re-delivery of the same duplicate message - hence the term “At lest one delivery”.

Usually this event is not overly problematic (especially as it very rarely occurs) - it’s just a possibility that needs to be considered and either properly handled by the logic of the receiving application (either by knowing there’s no harm in a duplicate receive or by placing an applicative fail-safe).

Check, recheck and check again

We’ve already established that a Queue is dispersed across multiple servers, so whenever a client performs an action against a Queue - the SQS mechanism redirects him to one of the actual servers operating the Queue.

Again, using simplicity as the enabler for scalability - this SQS load-balancing algorithm does not guarantee that a client request will be redirected to a server that actually contains messages. It is entirely possible to be redirected to one empty server and getting a response of “no messages in Queue” while there are still messages in the Queue that are simply residing on a different server than the one which the client was just redirected to.

The easy (and only) fix to this behavior is to simply have the client re-check again and again the Queue, even if it reports it is empty - this way we ensure that the numerous queries will be load balanced across all of the servers operating the Queue and will eventually get to the waiting messages regardless of their current host server.

Not FIFO

Revisiting the SQS Block architecture and the imperfect load balancing algorithem fronting it - we run into another potential pitfall: contrary to their namesake SQS Queues are not FIFO (First-In-First-Out).

It is entirely possible (and even quite common) that the AWS scalability mechanism which does the redirection of messages to servers to redirect subsequent messages which we will label as Message A and Message B to two separate physical servers (which we will also label as Server A and Server B).

Continuing our scenario, that same load balancing and scalability mechanism may redirect a client to receive messages first from Server B and then from Server A - resulting in the delivery of the messages in scrambled order.

In plain english: there’s no gurantee that messages will be delivered in the same order in which they were sent - if that’s a problem then SQS is not a viable platform and other MQ platforms should be considered instead!

8K Message Size Limit

A single message in SQS is limited to 8K in size - plain and simple.

If there is a requirement to transfer more than that to the recipient we can do one of the two:

Place the data that would otherwise have been sent as a message as a single file in some staging area (S3 is a great choice to this end) and just send the URL to the recipient (and still retain the actual benefits of properly dispersing the responsibility to process these files among the recipients).
Simply use a different Message Queueing mechanism (I’ve briefly discussed these in a previous article).

4 Days Message Retention

Actually, this one is actually a feature - most MQ administrators would kill to have a SLA guarantee that unread messages will be retained in the Queue for 4 days before purged.

Keep in mind that the proper usage of SQS is as a pipeline, not an archive, and that under normal circumstances messages should never be left unread for so long (no queuing of messages in the beginning of the month pending end-of-month processing).

Conclusion

SQS is everything we’d expect it to be - being a product of AWS that has the word Simple in it’s name.

It’s readily available, simple to use, scalable and incredibly cheap (a bargain at $0.000001 per API call plus regular data charges which are waived if made from EC2 instances).

As discussed in a previous article, using Message Queuing is usually a good idea. This is even more so when designing elastic systems and to that end SQS usually does the job well - just remember to steer away from potential pitfalls and everything will work like clockwork.