How is INDstocks by INDmoney delivering super fast trading experience?

Kausal Malladi Image

Kausal Malladi

Last updated:
7 min read
How is INDstocks by INDmoney delivering super fast trading experience?
Table Of Contents
  • Step 1 : Choice of Language: Why Go?
  • Step 2 : Finding the Right Concurrency for Goroutines
  • Step 3 : Leveraging open-source NATS framework for Ultra-Fast Tick Publishing
  • Step 4 : WebSocket server on top of Gorilla + Redis
  • Step 5 : Independent Microservices over Monoliths
  • Step 6 : Aurora database to our rescue for SCALE
  • Step 7 : Ensuring Fault Tolerance with AWS Multicast Groups
  • Step 8 : Data optimizations and smartness on Frontends
  • Step 9 : Enhancing Packet Routing Strategies
  • Step 10 : The Importance of Client-Side Latencies Over Backend Metrics
  • Challenges in Measuring End-to-End Latencies
  • Conclusion

Have you ever missed a trade because of sluggishness on your trading platform? 

At INDstocks, we take latency for traders very seriously. Here’s how we went about building a super fast trading platform.

 

Step 1 : Choice of Language: Why Go?

One of the foundational decisions was choosing the right programming language. Go (Golang) emerged as the clear winner, primarily due to its:

  • Efficient concurrency model through Goroutines
  • Low memory footprint and garbage collection efficiency
  • Fast execution speed compared to interpreted languages
  • Strong standard libraries that simplify networking and data handling

Just for reference, the optimized Go code that receives ticks from NSE and BSE and publishes to our web socket server for consumption by our Apps and website, runs on just 2 vCPUs and 1 GB memory, running a single container!

Step 2 : Finding the Right Concurrency for Goroutines

While Goroutines are lightweight and efficient, managing concurrency levels optimally required numerous hit-and-trial experiments. We needed to strike a balance between excessive Goroutines leading to context-switching overhead and too few limiting throughput. Both extremes cause latencies which are unacceptable in the Stock Trading world. Striking the right balance is critical for users to receive the price ticks and process order updates at lightning speed. Profiling tools such as pprof and trace helped us determine the ideal Goroutine count to cater to maximum throughput with minimal latencies.

Constant number of Goroutines at scale

Our Goroutines stay almost constant right through the market hours without any spikes or aberrationsβ€Šβ€”β€Šthe same stays even in times of 2x traffic load that is common for Stock Broking world when there are any external business / financial / political factors.
 

Step 3 : Leveraging open-source NATS framework for Ultra-Fast Tick Publishing

We adopted the open-source NATS messaging framework to publish price ticks in microseconds, making them immediately available for consumption. NATS’ lightweight design, ability to handle millions of messages per second, and built-in clustering capabilities were critical in ensuring minimal latency in tick delivery. It also complements the Go capabilities very well and hence setting up is very minimal effort. Tuning the NATS config rightly, especially in terms of the network performance is the key to making this setup work with milliseconds end-user latency. The advantage here is that NATS also supports the clients on web sockets natively, making it easier to integrate on the mobile Apps. 

Backend latencies to publish messages to NATS
 

Step 4 : WebSocket server on top of Gorilla + Redis

Providing price tick and order updates to users at lowest latencies is of utmost importance. More than 76% of our order updates reach end-user devices in under 50ms from the time we receive order updates from exchanges. 

Latency distribution of order updates to client devices from exchanges

 

We use Redis pub-sub and a home-grown WebSocket server built on top of Gorilla to ensure we are able to receive order update events and stream them super fast. Usage of sync.Map instead of traditional Map perfectly brought down the utilization of Infrastructure resources by almost 75%. This works well in cases where there are humungous writes and minimal reads in comparison.

In a few use cases, we use Clustered Redis instead of a single node Redis which makes us tide through peak market opens and unprecedented scale in a very elastic way.
 

Step 5 : Independent Microservices over Monoliths

At INDmoney, we prefer launching multiple microservices for different use cases instead of having monoliths. Often, these microservices could be simply a different configuration of the same code repository but it really helps us in scaling well with lesser infrastructure. It also helps us to tweak the scaling policies in line with the requirements of the underlying microservice.
 

Getting the best of both worlds - separation of concerns & scalability.
 

Step 6 : Aurora database to our rescue for SCALE

INDstocks leverages the power of AWS Aurora database for read-efficient and write-heavy scaling especially at the market open peak loads. By adding a few readers at the market open, and scaling them down when we feel we are good, gives us the capability of traditional RDBMS and also flexibility to adapt for scale.
 

Step 7 : Ensuring Fault Tolerance with AWS Multicast Groups

To enhance system resilience, INDstocks implemented Multicast Groups in AWS for a lot of our Pricing use cases, ensuring that multiple services were responsible for pushing ticks to different data stores concurrently. This prevented single points of failure, enhanced redundancy, and ensured high availability in case of infrastructure failures. More importantly, it removed all the network overheads of one backend service sending the data to another backend service.
 

Step 8 : Data optimizations and smartness on Frontends

Our frontend applications are designed with the intelligence to switch between multiple pricing sources dynamically. This ensured

  • Continued price updates even if one source experiences latency issues
  • Minimal disruption to the user experience during backend fluctuations
  • Reducing the probability of an external user impact because of one single system behaving rogue

Introducing protobuf over JSON for price packets reduced the payloads by more than 65% (500 B to 180 B) and hence latencies to the same tune.

The amount of data that backends send to frontends plays a significant role in giving faster experience to users. A major optimization drive recently made sure our payloads reduced by 90% resulting in our trade screens p95 load time decreasing by more than 96%.
 

Step 9 : Enhancing Packet Routing Strategies

To ensure that packets generated to and from client devices follow the fastest and shortest network path, we utilized Argo Smart Routing by Cloudflare. Argo’s real-time optimization dynamically routes traffic based on network congestion and latency patterns, reducing the time required for delivery to client devices. We experimented quite a bit with AWS mTLS and Cloudfront as well to reduce DNS resolution times and thereby giving faster experience to our users.

 

Step 10 : The Importance of Client-Side Latencies Over Backend Metrics

Traditional backend-only latency metrics often provide an incomplete picture of performance. Backend latencies may appear low, but real-world user experience depends on client-side latencies from exchanges like NSE and BSE to the end-user device. We measure

  • End-to-end latency from exchange tick reception to user device
  • End-to-end latency from exchange order update reception to user devices
  • End-to-end order punching latencies
  • Effects of network fluctuations and device processing capabilities
  • A histogram view of latencies to cater to the number of users who are on the high end of spectrum w.r.t. Latencies

We punch a staggering 90% of orders to exchanges in under 150ms (from the time users initiate orders from client devices).

 

Challenges in Measuring End-to-End Latencies

Accurately measuring end-to-end tick latencies posed multiple challenges:

  1. Time Sync Issues: Clock skews across distributed systems made timestamping challenging. We implemented NTP synchronization to mitigate discrepancies.
  2. Logging Overhead: Excessive logging led to performance degradation, requiring us to use lightweight tracing mechanisms and use intelligent sampling on client devices.
  3. Skewed metrices: Traditional avg and p99 Metrices tend to get skewed due to a few users, so a histogram view presents a holistic view across multiple screens on the Apps.

Striking the Right Balance: We optimized sampling techniques on user devices to get a representative latency measure without adding overhead.

Over 95% of all our price ticks reach our users’ client devices in under 200ms. This is the latency between tick reaching client devices and tick being received on our servers from exchanges.

Conclusion

Building a lightning-fast trading system which is fault-tolerant and self-healing, required a blend of Golang’s concurrency advantagesNATS messagingAWS multicast groupsWebSockets, Clustered Redis & Aurora databasesinteresting network routing strategies and experiments, and smart frontend optimizations. By focusing on client-side latencies and continuously refining our approach, we have successfully created one of the fastest trading systems in the Indian stock trading industry.

Stay tuned for more deep dives into how we continue to push the boundaries of stock market technology!

Share: