How do we scale the point of sales solution with serverless and asynchronous processing at Conta Azul?

Roberto Duessmann
Conta Azul Engineering Blog
6 min readSep 27, 2018

--

How to get performance maintaining maximum resilience? That was our question when we came across the challenge of scaling our point of sale solution.

While developing a new solution for the retail market, we noticed some specific characteristics that we haven’t seen before, such as low tolerance for latency, high sales volume and retail sales operation.

Point of sale

In addition, the solution we were developing had to be integrated with the government in order to issue consumer invoices (NFC-e), that are electronic documents which will replace the traditional tax coupon issued by some printers. This is a new modality that brazilian government opened to retail businesses to make easier for them to comply with the country’s current tax rules.

Photo by Rob Bye on Unsplash

We could not, for example, spend 1 minute issuing an invoice and let the customer standing in line awaiting approval and purchase completion.

Initial Scenario

To launch fast and test the market, we embedded our solution into an implementation that already existed within the application, basically using:

  • JavaEE + Rest API
  • Database-based Jobs
  • Synchronous processing
  • Invoice issuer service
Initial Scenario

When a customer saves a sale, we call a REST API to save the data from that request and start some job schedulers to issue the invoice and getting the documents (XML and PDF).

Each step (3 in total) sets a new status in the database so that the next processor gets the request and executes the next step.

At the same time, front-end polls the back-end to get the status of the sale and update it to the customer.

When the jobs are finished, the status is set to ok and front-end updates it on screen to show document links and authorized status.

We had some bottlenecks with this workflow:

  • Jobs (4 seconds): in each step in issue flow we were losing 4 seconds, which is the time the job takes to get the results from database to a new processing
  • Unnecessary requests (~1 second): even having all the information we need, we make extra requests in some services
  • Polling (5 seconds): on customer’s screen, we just update the invoice status every 5 seconds
  • Integration with government web services (~8 seconds): time that SEFAZ WS takes to authorize and issue the consumer invoice
Initial times

In total, we were spending around 22.88 seconds to save a sale and issue the consumer’s invoice.

Proposed Scenario

By facing this, we realized that in order to scale our solution, we would need something more, something different to support our users to have resilience, while delivering a good time experience.

After some team discussions, we’ve decided to implement a more asynchronous structure, to improve resilience between involved services and be more event driven.

Event notification is nice because it implies a low level of coupling, and is pretty simple to set up. (Martin Fowler)

The new solution is based in SQS messages to control the flow and a Lambda function, as serverless, to receive some notification from the invoice issuer. In addition, we also reduced the polling from screen from 5 to 1 second.

Proposed Scenario

Removing coupling between request from front-end and issue the invoice, we started to work with messages between steps.

SQS was the solution that we chose to control this flow and sending information in-between steps. Here we had the first relevant result, we have reduced the job time to 1 second for each step (and without worrying with database performance).

And how about Lambda? It receives a notification from the invoice issuer, enrich and validate the messages and collects this messages to put in a queue.

The Lambda function

Ok, so let’s talk a little about the code, right?

AWS Lambda lets you run code without provisioning or managing servers. (AWS Documentation)

With a few lines of code, without worrying with servers, scaling the solution and the reliance of receiving a message, we can deliver a solution that gets the notification from the invoice issuer we can trust and it will always be available to queue messages, so we can process them according to our capacity.

Basically, our Lambda receives a notification from an API (we had used API gateway to trigger it), validates the message from business context and using AWS library it queues in SQS in order to our back-end gets the messages and finishes the process, according to it capacity, totally asynchronous.

The complete example you can get here: https://github.com/robertoduessmann/lambda-sqs.

A tip to work with Lambda is to use Serverless. It manages your function in AWS environment. You can use a yaml file to set a range of Lambda properties. From permissions configuration to the initialisation trigger.

Results

After delivering this new solution and putting it into production, we are now able to offer a better experience for our customers, a truly WOW experience. Our mainly results were: time reducing and resilience.

Results time

About timing, now the total average time to save and issue the invoice is around 9.24 seconds, remembering that within this time we also integrate with government web services to issue the invoice that are slightly slow.

Time comparison

In comparison, we can see the time to all the processing is much faster, around 50%.

Resilience is another great advantage that event driven architecture brings.

What we gain is greater resilience, since the recipient systems can function if the customer system is becomes unavailable. (Martin Fowler)

As we use SQS, all integration between steps are asynchronous and we also configure Dead-Letter Queues to handle errors.

If some system parts are offline, for example, the message will be in the queue and will be processed after service is available again.

In addition, we are reducing the database usage, since we do not have recurring queries of our jobs to pick up and set status.

We still have some opportunities, like replacing pooling for a web socket or something different and run some things in parallel. We’re always trying to improve.

And what about you, how do you deal with resilience and asynchronous processing? We will be glad to know how you are dealing with these challenges.

--

--