Some time ago, we developed a mobile app that manages user’s contacts. The app takes one of the user’s contact book entries and organizes, labels and updates it with the latest information available.
The process contains operations such as:
Checking if the input is valid. Does it have a full name and required fields for the matching process. Data standardization and potential cleanup.
Matching data against 3rd-party services. Finding the latest contact information available about that person.
Determining if the returned result is valid and usable. If the match is not exact, use Jaro-Winkler distance to determine a fuzzy match.
When there is no match, try comparing it to the nickname table from the DynamoDB database.
Labeling of old, new and confirmed data and integration into one response.
Depending on the input, this can result in using a lot of processing power and time.
The whole process would take approximately 5 seconds to return a result to the client. Keep in mind we’re talking about 1 item, a contact from a contact book.
After a while, a new requirement is added. Solution must support companies that manage their contacts organized in files.
We agreed to develop a web application that would support upload of dataset files, like .csv.
The new web application should support third party integrations. Some of them are the ESP/CRM services, like Mailchimp.
User does not need to see the result immediately. System can later notify the user about the result, when the process finishes. This information is the key that we need to leverage on when developing our solution.
So here’s how we did it.
Table of Contents
Our initial approach went like this:
The web application would read a .csv file line-by-line and submit an array of entries to the backend REST API.
Processing is in real time, immediately after the user makes a request for the process to start, he expects a result.
Another approach is to apply vertical scaling to AWS architecture. Meaning we used larger, more expensive instances that have more RAM and processing power.
This system worked well for smaller datasets, maybe up to 5k entries, but for larger datasets multiple problems arose.
This solution is prone to failure due to long request time. Also, because of some 3rd party services that aren’t optimized for large scale loads or they don’t have batch endpoints. 3rd party services also tend to protect their architecture from making a huge bill on their end.
The whole process was done in real-time. The request can take a lot of time hanging in there while the backend would do all the work and return the result to the client. The request was limited to 300 seconds (at time of writing the article) before AWS would cancel it…
Also I need to mention that the user has to wait a few minutes to see the results. They can’t use the app during the process, which makes for a bad user experience.
This did not go well, the process would break at some point because of various other reasons. Other than listed problems, some others arose such as:
insufficient instance memory
internet connection failure, connection timeout
user’s manual interrupt (navigate back, close tab, etc.)
The process was not resilient to failure. When one of the critical parts failed we sent an error status to the client. User needs to start the whole process from the beginning, without the ability to retry the failed operation.
Another reason this wouldn’t work is because it isn’t scalable for heavy usage. Mailchimp has a limitation of 10 concurrent connections and 1k contacts retrieved per request (at time of writing the article). This is a problem when we want to do more than 10 operations at the same time. When we have 2 active users trying to get contacts from Mailchimp with more than 10k contacts in their lists, we would have to create 10 requests. This would result in Mailchimp rejecting our calls.
As we mentioned before, the user does not need to see the result immediately. System can later notify the user about the result, when the process finishes. This information is crucial for designing our solution.
The whole process needed to detach into smaller and simpler operations. To decouple this long-running process from usual tasks performed by our web application, we opted to use services from AWS.
SQS allows you to send, store, and receive messages between software components. SQS works at any volume. It does not lose messages and does not need other services to be available.
Here we used SQS to decouple services which helps management throughout. It also assists with retry logic. It’s very helpful if there is a lot of heavy-duty and batch-oriented processing required. It leads to making the system fault tolerant and resilient.
S3 on the other hand is a scalable cloud storage service that stores object data within buckets.
Here’s why we chose it:
These two services allowed us to separate one long-running process into 2 separate tasks.
In the foreground we have synchronous tasks, and in the background asynchronous workflow which worked as follows.
The user started the fetch process.
The app uploads the .csv document to S3 bucket.
Once the upload is finished, the app posts a message to SQS containing the url of the uploaded file.
App reports success to the user.
The app starts a worker process that reads messages from SQS.
The app performs the work the messages describe.
After the process finishes, system can use available asynchronous mechanisms to inform the user (polling, webSockets)
When a message is read from SQS it is invisible for 30 seconds. 30 seconds is a default visibility timeout that can extend to 12 hours. This work cannot be duplicated by another workflow at the same time. It is expected that the worker deletes the message from SQS when it finishes processing it.
If anything goes wrong, messages stay in the queue and may be retried later. Messages won’t be deleted and will reappear after a visibility timeout expires. Another worker process can pick up the same message again and perform the work needed.
You can limit the number of these retries. You need to prevent workers from retrying the process indefinitely. After the limit exceeds, the message goes to a dead letter queue. Here you can inspect it and debug.
In this solution, each component ran independently of others and failed independently. This increased the system’s overall stability and fault tolerance.
We had to ensure system scalability and compliance with heavy usage by many users at the same time.
There was also a possibility that the user would upload a huge dataset that can take a lot more time to process. We can adjust the system by increasing message visibility timeout in SQS. This can also introduce the similar problems that we had with our initial solution.
Not being able to continue from failure, restart the process from the beginning. We need to consider how to make the solution even more resilient here.
If a .csv file is large, we can separate it into chunks of datasets. One way to do this is by creating many messages. Those messages indicate the range of processed rows in that chunk of data.
After processing the last chunk, the worker should merge the results into one final file. In the end, the system should notify the user about the final result.
The similar optimization applies to 3rd party services that support pagination API. We can take advantage of pagination by storing ranges of datasets in SQS messages. Reading the ranges, the system can request the next chunk by passing start and end index to their APIs.
We also considered horizontal scaling of the system. This is done by increasing the number of computing instances. By doing that, we could waste a lot of computing power if there is no demand for it. Because of that, we need to configure system autoscaling. Auto scaling also takes some time to come into effect.
Another possible optimization is to use AWS Lambda functions for on-demand instant computation.
AWS Lambda is a serverless computing service which lets you run code without needing to provision servers.
Pros of AWS Lambda
Developers can focus on more important work.
You only pay for computing time you use.
Charges you only for the time it’s actually doing useful work.
Cons of AWS Lambda
Maximum 15 minutes invocation time.
Requires more work and training for developers.
Developers need to adjust the code to take advantage of Lambda.
They also need to be careful not to end up in some kind of antipattern solution.
With optimization out of the way, let’s conclude.
Web applications usually communicate with backend services by using synchronous APIs.
Asynchronous workflows are desirable when the system needs to perform long-running tasks. We achieved this by using decoupled services like SQS and S3. Using these services we made systems resilient, scalable and fault tolerant.
This allowed separation of the request ingestion and response from the request processing.
SQS and S3 have a very simple API for a reason, to make it simple to integrate with each other. They are compatible with many other different services available in the AWS cloud.
Skilled in React Native, iOS and backend, Toni has a demonstrated knowledge of the information technology and services industry, with plenty of hands-on experience to back it up. He’s also an experienced Cloud engineer in Amazon Web Services (AWS), passionate about leveraging cloud technologies to improve the agility and efficiency of businesses.
One of Toni’s most special traits is his talent for online shopping. In fact, our delivery guy is convinced that ‘Toni Vujević’ is a pseudonym for all DECODErs.