Consumers have high expectations when shopping online. Whether consciously or not, website performance is an unspoken one: 47% expect web pages to load in two seconds or less, and nearly 80% won’t return to a website they perceive to be slow. As eCommerce sites grow, so does the need to handle increasing amounts of traffic. Quickly. This is especially true for fraud detection, where even a minor delay can bring the buying process to a screeching halt. And that is why performance is a key concern for us.
We need to provide our customers with fraud risk assessment as quickly as possible, either immediately after login or in real-time. Whidegroup reported that a one-second delay on user verification or transaction processing could reduce conversions by 7% and page views by 11%. The way we tackle this problem is by designing our service for scalability.
Google wisdom: “A scalable application is one that works well with 1 user or 1,000,000 users, and gracefully handles peaks and dips in traffic automatically.”
The Scalability Challenge
Applications must be ready to handle any amount of traffic at any time without sacrificing performance or stability. This is a complex problem in itself. Not only are our clients’ applications and websites required to handle large amounts of data quickly, but any service providers like us need to provide the same level of service and reliability. Even more challenging is the need for us to support multiple clients regardless of their individual spikes and lulls in traffic.
Before getting to this stage, we have to have successfully tested our system for load. And when it comes down to simulating real traffic conditions in a test environment, it’s a delicate balance of number of users, length of sessions, session activities, differing environment settings, etc. that can realistically reflect production.
Challenges that inhibit scalability include:
- The geographic distance between our and our customers’ servers.
- Traffic surges from events like marketing campaigns, flash sales and holidays. Some of these are predictable, but some can occur without warning.
- Multi-tenancy where cloud computing resources are shared among multiple customers.
- Availability including our ability to failover and recover in case of failure.
Frequently online retailers focus more on improving the shopping experience than scalability. During the most recent peak season, Costco’s website suffered from performance issues that reportedly cost them $11m in sales. Some retailers, including Lulu Lemon and JCrew, continue to face technical difficulties - many attributed to the inability to complete payments - year-after-year during Cyber 5 season. In 2018, J.Crew experienced a website crash during Black Friday that cost an estimated $700,000 in lost sales. Even Amazon failed to scale appropriately for Prime Day, resulting in thousands of reports on downdetector.com. Proper testing reveals any bottlenecks or pitfalls in your system. Bob Buffone, CTO at Yottaa recommends “load testing your site at five times normal traffic volumes” in preparation for peak traffic. When online retailers can’t scale, they are taking a significant risk, and the effect on their bottom line is often immediate.
How These Challenges Influence System Architecture
The more effectively your architecture can scale, the more value it offers to the business. However, there are some considerations to make when designing a scalable architecture:
Systems are best measured by their slowest components. For example, when monitoring latency, measuring the average latency may return an acceptable value, while measuring the 99th percentile highlights the worst-case scenarios. Targeting and reducing these extreme cases can have a net positive effect for all users.
Improving performance often increases costs, especially on managed platforms. Adding vCPUs to a virtual machine, switching from HDDs to SSDs, and adding intermediate caches can have notable performance gains, but also increase your costs significantly. When evaluating an architecture change, understand the full impact it will have on both scalability and your budget.
Scaling is a bi-directional process: systems must scale up to meet increased demand and scale down when demand is low. This avoids consuming excess resources when idle and lowers your overall costs. However, the architecture must be fast and flexible enough to quickly respond to changes in traffic.
The SecuredTouch Architecture
When ramping up SecuredTouch, we needed a platform that was flexible, scalable, and easy to automate. That’s why we went with Kubernetes, which is an orchestration platform for microservice workloads. With Kubernetes, we built our backend application into small, lightweight, portable units (services) and deployed them onto a cluster. As demand increases, we can scale horizontally by replicating services across the cluster, or scale vertically by adding new nodes. Kubernetes automates much of this for us by tracking cluster resources, service deployments, and other factors.
Kubernetes also makes continuous deployment effortless. SecuredTouch uses an event-driven architecture, where each service is a fully independent unit and requests are pipelined between services. This means we can scale, restart, or upgrade individual services without affecting the entire pipeline. We can launch new changes without bringing down the application or impact our customers’ quality of service.
|The SecuredTouch Architecture|
Our Performance Metrics and KPIs
SecuredTouch constantly monitors and collects KPIs to ensure a high quality of service. These include:
- Event processing latency: the time for a service to receive and respond to a request.
- API latency: the time it takes for the SecuredTouch API to handle a request.
- End-user response latency: the time it takes to communicate with the end user (i.e. the SecuredTouch SDK).
These metrics help us to identify potential failures before they happen and to optimize the system accordingly.
QA and Testing
Before each release, our QA team runs load tests that simulate real user activity. We run these tests with a greater volume of traffic than we expect to handle. This shows us how our application responds to excessively high traffic volumes and whether we need to optimize further.
We have a number of alerts that fire if latency exceeds a certain threshold. If performance is slow, our teams are notified as quickly as possible.
Autoscaling relies on up-to-date metrics in order to quickly respond to changes in demand. The auto scaling process itself also needs to be quick, otherwise existing nodes will become over-saturated while new nodes are still spinning up.
Scalability is a Requirement
Time is precious in eCommerce and fraud detection. When evaluating a potential threat, just a few seconds’ delay is too late. That’s why we’re driven to return results as quickly as possible, no matter how much traffic you send our way. We’ve built our system using state of the art tools and software engineering practices in order to create a comprehensive fraud detection solution that scales with you. The result: reliable real-time detection of fraudulent activities at any scale, while ensuring your customers can complete their transactions.