
- Step 1 : Choice of Language: Why Go?
- Step 2 : Finding the Right Concurrency for Goroutines
- Step 3 : Leveraging open-source NATS framework for Ultra-Fast Tick Publishing
- Step 4 : WebSocket server on top of Gorilla + Redis
- Step 5 : Independent Microservices over Monoliths
- Step 6 : Aurora database to our rescue for SCALE
- Step 7 : Ensuring Fault Tolerance with AWS Multicast Groups
- Step 8 : Data optimizations and smartness on Frontends
- Step 9 : Enhancing Packet Routing Strategies
- Step 10 : The Importance of Client-Side Latencies Over Backend Metrics
- Challenges in Measuring End-to-End Latencies
- Conclusion
Have you ever missed a trade because of sluggishness on your trading platform?
At INDstocks, we take latency for traders very seriously. Hereβs how we went about building a super fast trading platform.
Step 1 : Choice of Language: Why Go?
One of the foundational decisions was choosing the right programming language. Go (Golang) emerged as the clear winner, primarily due to its:
- Efficient concurrency model through Goroutines
- Low memory footprint and garbage collection efficiency
- Fast execution speed compared to interpreted languages
- Strong standard libraries that simplify networking and data handling
Just for reference, the optimized Go code that receives ticks from NSE and BSE and publishes to our web socket server for consumption by our Apps and website, runs on just 2 vCPUs and 1 GB memory, running a single container!
Step 2 : Finding the Right Concurrency for Goroutines
While Goroutines are lightweight and efficient, managing concurrency levels optimally required numerous hit-and-trial experiments. We needed to strike a balance between excessive Goroutines leading to context-switching overhead and too few limiting throughput. Both extremes cause latencies which are unacceptable in the Stock Trading world. Striking the right balance is critical for users to receive the price ticks and process order updates at lightning speed. Profiling tools such as pprof and trace helped us determine the ideal Goroutine count to cater to maximum throughput with minimal latencies.

Constant number of Goroutines at scale
Our Goroutines stay almost constant right through the market hours without any spikes or aberrationsβββthe same stays even in times of 2x traffic load that is common for Stock Broking world when there are any external business / financial / political factors.
Step 3 : Leveraging open-source NATS framework for Ultra-Fast Tick Publishing
We adopted the open-source NATS messaging framework to publish price ticks in microseconds, making them immediately available for consumption. NATSβ lightweight design, ability to handle millions of messages per second, and built-in clustering capabilities were critical in ensuring minimal latency in tick delivery. It also complements the Go capabilities very well and hence setting up is very minimal effort. Tuning the NATS config rightly, especially in terms of the network performance is the key to making this setup work with milliseconds end-user latency. The advantage here is that NATS also supports the clients on web sockets natively, making it easier to integrate on the mobile Apps.

Backend latencies to publish messages to NATS
Step 4 : WebSocket server on top of Gorilla + Redis
Providing price tick and order updates to users at lowest latencies is of utmost importance. More than 76% of our order updates reach end-user devices in under 50ms from the time we receive order updates from exchanges.

Latency distribution of order updates to client devices from exchanges
We use Redis pub-sub and a home-grown WebSocket server built on top of Gorilla to ensure we are able to receive order update events and stream them super fast. Usage of sync.Map instead of traditional Map perfectly brought down the utilization of Infrastructure resources by almost 75%. This works well in cases where there are humungous writes and minimal reads in comparison.

In a few use cases, we use Clustered Redis instead of a single node Redis which makes us tide through peak market opens and unprecedented scale in a very elastic way.
Step 5 : Independent Microservices over Monoliths
At INDmoney, we prefer launching multiple microservices for different use cases instead of having monoliths. Often, these microservices could be simply a different configuration of the same code repository but it really helps us in scaling well with lesser infrastructure. It also helps us to tweak the scaling policies in line with the requirements of the underlying microservice.
Getting the best of both worlds - separation of concerns & scalability.
Step 6 : Aurora database to our rescue for SCALE
INDstocks leverages the power of AWS Aurora database for read-efficient and write-heavy scaling especially at the market open peak loads. By adding a few readers at the market open, and scaling them down when we feel we are good, gives us the capability of traditional RDBMS and also flexibility to adapt for scale.
Step 7 : Ensuring Fault Tolerance with AWS Multicast Groups
To enhance system resilience, INDstocks implemented Multicast Groups in AWS for a lot of our Pricing use cases, ensuring that multiple services were responsible for pushing ticks to different data stores concurrently. This prevented single points of failure, enhanced redundancy, and ensured high availability in case of infrastructure failures. More importantly, it removed all the network overheads of one backend service sending the data to another backend service.
Step 8 : Data optimizations and smartness on Frontends
Our frontend applications are designed with the intelligence to switch between multiple pricing sources dynamically. This ensured
- Continued price updates even if one source experiences latency issues
- Minimal disruption to the user experience during backend fluctuations
- Reducing the probability of an external user impact because of one single system behaving rogue
Introducing protobuf over JSON for price packets reduced the payloads by more than 65% (500 B to 180 B) and hence latencies to the same tune.
The amount of data that backends send to frontends plays a significant role in giving faster experience to users. A major optimization drive recently made sure our payloads reduced by 90% resulting in our trade screens p95 load time decreasing by more than 96%.
Step 9 : Enhancing Packet Routing Strategies
To ensure that packets generated to and from client devices follow the fastest and shortest network path, we utilized Argo Smart Routing by Cloudflare. Argoβs real-time optimization dynamically routes traffic based on network congestion and latency patterns, reducing the time required for delivery to client devices. We experimented quite a bit with AWS mTLS and Cloudfront as well to reduce DNS resolution times and thereby giving faster experience to our users.
Step 10 : The Importance of Client-Side Latencies Over Backend Metrics
Traditional backend-only latency metrics often provide an incomplete picture of performance. Backend latencies may appear low, but real-world user experience depends on client-side latencies from exchanges like NSE and BSE to the end-user device. We measure
- End-to-end latency from exchange tick reception to user device
- End-to-end latency from exchange order update reception to user devices
- End-to-end order punching latencies
- Effects of network fluctuations and device processing capabilities
- A histogram view of latencies to cater to the number of users who are on the high end of spectrum w.r.t. Latencies
We punch a staggering 90% of orders to exchanges in under 150ms (from the time users initiate orders from client devices).
Challenges in Measuring End-to-End Latencies
Accurately measuring end-to-end tick latencies posed multiple challenges:
- Time Sync Issues: Clock skews across distributed systems made timestamping challenging. We implemented NTP synchronization to mitigate discrepancies.
- Logging Overhead: Excessive logging led to performance degradation, requiring us to use lightweight tracing mechanisms and use intelligent sampling on client devices.
- Skewed metrices: Traditional avg and p99 Metrices tend to get skewed due to a few users, so a histogram view presents a holistic view across multiple screens on the Apps.
Striking the Right Balance: We optimized sampling techniques on user devices to get a representative latency measure without adding overhead.
Over 95% of all our price ticks reach our usersβ client devices in under 200ms. This is the latency between tick reaching client devices and tick being received on our servers from exchanges.

Conclusion
Building a lightning-fast trading system which is fault-tolerant and self-healing, required a blend of Golangβs concurrency advantages, NATS messaging, AWS multicast groups, WebSockets, Clustered Redis & Aurora databases, interesting network routing strategies and experiments, and smart frontend optimizations. By focusing on client-side latencies and continuously refining our approach, we have successfully created one of the fastest trading systems in the Indian stock trading industry.
Stay tuned for more deep dives into how we continue to push the boundaries of stock market technology!