I’m going to take as showcase the build out we did for a startup hedge fund to explain how a typical and successful process works. Needless to say, in any HFT system, the most important thing is the response time between you get a tick price and how your system’s reaction. This is called “tick-to-trade”, and it’s the most important measure when looking for low latency trading systems. With today’s hardware, we can achieve “tick-to-trade” latencies under 3-5 microseconds, if, and only if, you follow best practices in handling multi-threading processes. Of course, the key factor here is being able to handle a big amount of incoming information, response time to external events, internal response time, and capabilities to provide the highest throughput and lowest latency.
About the FX market
The electronic forex market is an OTC market with different layers of participants, and as such it has the most fragmented trading environment. The advantage of this is that the liquidity is typically high, and spreads are low. The disadvantage is that information about prices, including bid/ask quotes, updates, and trade size data, isn’t available to all participants. When it comes to building an efficient, fast, and profitable trading system, we have to take into account certain things:
- Handling multiple Venues/Exchanges without losing performance.
- The system must be able to handle multiple strategies, making sure it will not underperform as we add more strategies.
- Since we are dealing with multiple venues at the same time, we need to make sure that the system will be able to reconstruct the Limit Order Book from each venue or exchange.
- Data normalization across all the different sources.
Building your Quant team
The goal must be to setup the best culture as a quant team. This includes specialized software engineers, network engineers, researchers, and quant traders. Researchers and quantitative traders are tasked with generating model ideas, based on mathematical models and market microstructure anomalies between FX pairs. While the software engineers will be implementing all these into the high-performing trading systems. The network engineer will be tasked with the colocation server setup, improving and monitoring network cross-connection with any and each ECN, making sure that the latencies between them keep constant.
The “arms race” in this area never stops, with market players continuously investing in more powerful solutions that can trade securities, derivatives, and other financial instruments in a matter of nanoseconds. Only HFT firms that are always at the forefront of technology will be able to stay ahead in the future. The following is a required list of the hardware needed:
- Collocated servers in any of the major financial centers: NY4, LD4, etc.
- Server Intel Xenon (or any x-86 based) with more than 12 cores & 32Gb of RAM
- Networking: SolarFlare network card
To reduce the market data round trip investment banks and hedge funds pay for high-quality software, networks and computing facilities that maximize their time efficiency. These include two forms: software improvements and hardware acceleration. We are going to be focused on the software part, since the hardware acceleration hardware could be as expensive as it gets.
FIX Protocol and connectivity to venues (ECNs and Exchanges)
FIX protocol is the way to get connected to any major ECN or venue. It is a well-known and highly standardized protocol that all financial institutions use on a regular basis. So, we need to make sure that our trading system can handle this protocol, to receive market data and send orders. However, FIX is not the fastest and most effective way to communicate. There are some variations of this protocol that allow electronic traders to communicate in better and fastest ways.
These variations are known as ITCH and OUTCH, a protocol derived from NASDAQ, which uses a binary format to enhance the communication speed. All this must be handled by a FIX Engine, which is going to be the interface between the data sources and our trading system.
Note that besides network and communication latency, there also will be a decode/parse latency, so we’ve to take care of it as well. Parsing is a string manipulation function, hence very expensive in terms of CPU cycles and memory management. The FIX Engine must be able to perform this in the best possible way. Some institutions opt to build their own FIX Engine, some others will use open-source options (i.e.: QuickFIX). I personally think that the best option is to use something in between. There are many commercial options for FIX Engines focused on low latency connectivity.
There are two options to implement a FIX Engine: custom made, or a commercial option. To implement a custom FIX engine, you will need to put too many man-hours and you can make sure that you will optimize the communication for a low latency environment. Firms like large banks and institutions will prefer their own FIX engine, so they can have ownership of the entire code. We always prefer to go with the commercial option, since there are many good options out there of high quality, which will save you a huge amount of money, not to mention developing time.
Data Feed Handler
The feed handler is one of the most important components of any algorithmic trading system. Without accurate market data, any high-frequency or algorithmic strategy won’t be able to make correct decisions. So, the idea of this module is to capture market data from different venues, to allow our strategies, generate correct decisions, and keep a representation of venues’ limit order book.
A large number of market updates can be received per second for each market venue, and your internal representation of each limit order book should change accordingly.
This process should go like this:
Receiving messages will be receiving market updates and order statuses, making sure that the system can handle high throughout, so we can be sure that our system will have a consistent behavior throughout the high volatile markets. This market data will be processed and transmitted to the algorithmic models and strategies, to take decisions and send back the corresponding orders (if any). For those venues that provide connectivity through newer and better protocols, since FIX is not the fastest, we can achieve this using FAST, ITCH/OUTCH which are binary protocols, and we always try to use them as long as they are available. These protocols have a very different behavior, so probably your data feed logic must be adjusted accordingly, but the main concept remains the same.
Unlike the stock or futures market, there is no centralized exchange for currencies. In fact, there are no unique prices for the same currency at any time, so quotes from different currency dealers may differ. Being this an OTC market (Over the Counter), means that there will be many different providers for the same pair, hence, different prices. So, depending on which provider you are looking at, you will see a different price. So, why do you need an FX aggregator?
- Minimize market impact when trading big sizes.
- Minimizing slippages.
- Being able to execute through smart order routers with a better price discovering.
- Lowering trading cost, having access to the best spreads at any specific moment.
- Risk diversification with different sources
I’ve covered more in this blog post Why do you need a FX Aggregator
Limit Order Book
Once the system has its connectivity between the venues, we will need to update all the events reported by them: order updates/adds or deletions (trades if needed as well). For each event received we must go into the internal order book and execute the actions received. This could be expensive, so we must be sure to be as efficient as possible. Usually, all venues send these updates using a unique identifier (EntryID) and an update type (update is an insert, update or delete) so you can accurately replicate their limit order book at your end. Here I show how the limit order book reconstruction works:
For high-performance systems such these, the most important thing to keep in mind is the data structure we want to use to hold the limit order book data. Remember that we can get several million updates per second, so our data structure must be fast enough to find an entry, update it or delete it. From previous recommendations, I stated that the best data structure is to use plain arrays: one for the bids, one for the asks. They provide the best performance. Moreover, pre-allocate as much memory as you can on the hot path, that way you will avoid a big amount of overhead. Dynamic allocation is expensive, avoid it at all costs, it should not be used in critical paths like updating a limit order book data structure, if you don’t follow this you will end up having a large amount of overhead per allocation. So, since we are building a low latency system, we need to pre-allocate our data structures. And I’m going to show you the difference between pre-allocating and dynamic allocation, and its performance impact. On the system initialization phase, pre-allocate these arrays: bidsArray and askArrays, and let’s say we are going to store a 10-level depth of the limit order book; hence, we will need to pre-allocate 10-element bid/ask array. And we are going to move/replace elements, NOT removing and creating since the allocation process will consume too much time.
The following code shows how dynamic allocation works:
As you can see, we add or delete, and then I sort the entire array, so we can keep an ordered structure.
In the next section, I will show how to pre-allocate and reuse.
You define the array with 10 elements on it and you will reuse all those elements. You may notice that instead of deleting the element, I’m clearing its values, and once ordered, send them at the beginning of the array. Once I run a performance test between them, counting CPU cycles taken on each case on a sample of 100,000 iterations, I get the following results.
Wow, a 60% difference. Not a surprise for me, allocation/deallocation is expensive. And this was a very basic/simple example!!! You can check this code example on my gist repository https://gist.github.com/silahian
Order Management System: OMS
This module will manage all orders sent to the venues and its statuses changes, based on signals generated by your strategy. It will handle sending, canceling, and replacing orders as well as accessing information about executed orders, including pending and open orders.
We must send these orders in a very efficient and cost-effective manner, routing each order, depending on one or more of the following:
- the signal strategy
- venue costs
- latencies between venues
- best prices available
- shares or contracts available at each venue
Also, the system needs to be smart enough to know when an order was:
- Rejected or canceled by the venue
- Partially filled
- Fully filled
So, depending on the above statuses you receive, your order management system may execute different paths.
Strategies and Models
As the brain of our system, strategies will take limit order book from each venue, as well as the aggregated feed, and make defined decisions based on different parameters and values. Some strategies will need to analyze the entire depth of books, others, just top of the book prices, that is the best bid and best ask. Here, you can apply an almost infinite type of strategies and, of course, making sure they will be profitable ideas.
One could be a simple latency arbitrage strategy: each venue receives market information at different times, and if our system can be faster enough, we can take advantage of that price gap. Usually, these discrepancies last no more than 500 microseconds, and after that, all market participants try to balance themselves.
Here’s one example. A big institution is in the market to buy a big order of a given stock. It will have algorithms execute the trade slowly, trying to get the best price and it will take whatever’s available at, say, $4.50 per share, and then what’s available at $4.51, and so on. This is where the “latency arbitrage” may come in. Our strategy can see that this fund’s algorithm is in the market and essentially buys up all the available shares at $4.50 an instant before they do. Now the firm’s algorithm moves on and looks for shares at $4.51. Our algorithm sells all the stock it just bought at $4.50, earning a completely risk-free penny a share.
Sounds small, but if you do this several thousand times per day, we will be adding up to many millions of dollars per trading day, and several billion per year.
Another example could be the well-known ‘triangular arbitrage’ – this is an arbitrage where there are price discrepancies between 3 currency pairs. Forex is traded in pairs, i.e., EUR/USD EUR/GBP EUR/CHF. What can happen during a big market event, for example, a failed coup in Turkey as an extreme example, EUR/USD will move faster than it should have to keep in ratio with the rate of EUR/GBP. That can be just a market function, traders sell EUR/USD before EUR/GBP without algorithms. Or large orders can cause the difference between EUR/USD and EUR/GBP to be off slightly. Even if only off by a fraction of a dollar, this can lead to millions in profits if you are fast enough.
Lastly, remember that nothing is forever. Once you or your team find a suitable model to run, and because the nature of the markets and its players, this will have short life, hence, you must have resources in place to adjust, tweak and improve as you go. And this is going to be an ongoing process.
Software Architecture and patterns
The fun starts when we must put all these pieces together, to interact concurrently and with the lowest latency possible between processes.
The basic architecture looks like this:
As we already know, concurrency is key, and to have it working properly we will need to use synchronization methods (to concurrently access data in memory) and a design architecture that fits our needs of low latency. I’m going to go through the well-known “software design patterns” for trading systems, and I will try to explain advantages and disadvantages of each of them (in terms of achieving the lowest latencies possible). After that, I will explain which one is the one we use to build ultra-low latency trading systems. So, let’s start talking about the most common patterns we see out there:
Observer Design Pattern: this is a software design pattern in which an object, called the subject, maintains a list of its dependents, called observers, and notifies them automatically of any state changes. This is fine, but if you have multiple strategies running on the same system, the notification process will be processed one by one. Meaning, the subject will first notify “strategy 1”, do its calculations, send orders if some criteria are met, and then continue with “strategy 2”, again do the calculation and see if some criteria are met. This sounds like a sequential process!!
In our specific case, and from what we are trying to do, our Limit Order Book module will be the subject, and the strategy our observer. As I show below, the implementation using C++ will look something like this:
The Limit Order Book will send notifications once a price or quantity has changed, so all the strategies can receive the update and act on it accordingly. If the strategy meets its specific criteria, it will trigger orders to the exchange. That process, of asserting criteria, could add a lot of latency, so it needs to be designed as good as possible.
As you can see, the Observer Pattern is a serial process if you have multiple observers, and if for any reason “strategy 1” takes a couple of milliseconds to do some fancy calculations, then by the time “strategy 2” gets the notification is too late, and so on. Until we get the notification for the last strategy, which is already making decisions on past information. That’s the main reason why we do not use this design pattern at all. Just in case you are thinking of throwing threads on each notification or doing it asynchronously, let me tell you that it will be even worse. The overhead will be such that doing it sequentially it will be faster, but not enough for what we intend to achieve. You can find this code on my gist repository https://gist.github.com/silahian
Signal and Slots Pattern: used for communication between objects or processes. The underlying implementation is like the Observer Pattern, and its concept is that the observer can send signals containing event information which can be received by others using special functions known as slots. Similarly, in C++ callbacks (function pointers), but signal/slot system ensures the type-correctness of callback arguments.
Since this pattern also is derived (or very similar) to the observer pattern, we prefer not to use it either. All types of event/messaging/signal patterns are kind of observer pattern, so is not suitable for our purposes of low latency.
Ring buffer pattern: now, we are getting closer. This pattern’s performance is very effective, and it is implemented in many low-level applications. The ring buffer is a circular queue data structure, and it has a first-in-first-out (FIFO) characteristic. It also has two indices, indicating from where your process can read, and from where it can write. So, no collisions will be in place, which will be translated in no need to synchronization, and that’s why this data structure is also known as “lock-free”, allowing to achieve much better performance than the others described. One big adopter of this pattern is LMAX with its disruptor, which allows their matching engines to be one of the quickest in the market.
Below is a diagram of their implementation using this pattern.
This kind of pattern is ideal for socket communications where serial data must be managed, so, in this case, is not suitable either for our trading architecture.
The design pattern and architecture we prefer
Busy/Wait or spinning: this is not categorized as a pattern, actually, it is considered the “anti-pattern” and usually is not recommended, because it could take huge amount of CPU time, blocking other processes to be processed. Which, in our specific case, is what we really want: full attention from the CPU with no interruptions. We only care about latency, and this pattern is perfect for this. The process running in this pattern will be in a tight loop waiting for something, and that loop will consume 100% of CPU cycles, avoiding context switching and cache misses, which are very expensive in terms of time. In our case, this process is going to be reading market data, from the limit order book module, and if it meets certain strategy criteria, it will send specific orders to execute that trade. This is by far the fastest way to get the data available from other modules. The following is a basic code snippet on how it works:
But, not everything is as good as it seems. The busy/wait processes are very hard to design and are too dangerous for the overall performance since it could take the entire CPU power, bringing down the entire system’s performance. The key part to using this pattern in our systems is to set a thread affinity to a specific CPU core: we need to pin the process. That means, that we will say to our system to run this busy/wait process in only one CPU core (could be core 1, 2, 3, etc.). And we will be able to “pin” as many processes as CPU cores we have. If we don’t do this, and because of how the thread scheduler works, it will use the entire CPU power and mess with our main purpose. Using this type of method, threading model, I/O model and memory management should be properly designed, to collaborate with each other and to achieve the best overall performance. This goes against the OOP concept of loose coupling, but it’s necessary to gain the most desired thing: low latency.
Of course, you still need to take care of synchronization methods (locks) where they are needed, that will keep the entire system secure from having corrupted data. Our approach to this is to design the data structures in a way that will need to have the lowest amount of synchronization. Conclusion, the designing part is the most sensitive thing in our system and using this technique will give you the best latency between your data structures and its processes.
Position and Risk Management
All orders sent by the strategy should be consolidated in positions, so the system can keep track of the open/close orders and most important, how your exposure to the market is. Ideally, the strategy should keep a flat exposure, but in certain strategies, like market making, you may allow having controlled exposures (if holding inventory). From having stop loss per position or for the overall exposure to portfolio management, the risk module it is an important piece that will interact with your strategy and will be monitoring in real-time all open positions and the overall exposure to the market.
The following are some popular risk management rules:
- Hedging and Credit monitoring
- Position limit: Control the upper limit of the position of a specified instrument, or the sum of all positions of instruments for a specified product.
- Single-order limit: Control the upper limit of the volume of single order. Sometimes, control the lower limit of the volume of single order, which means that the quantity of your order must be a multiple of it.
- Money control: Control the margin of all positions not to exceed the balance of the account.
- Illegal price detection: Ensure the price is within a reasonable range, such as not exceed price limit, or not too far from the current price.
- Self-trading detection: Ensure the orders from different strategies will not cause any possibility of trade between them.
- Order cancellation rate: Calculate the order cancellation situation and ensure it does not exceed the limitation of exchange.
Also, within this module, you may want to analyze different allocations on strategies or trades. There have been many studies proving that having an allocation strategy could lead you to lower volatility in your returns and great insurance if things go wrong.
Since we are building a fully automated system that must be able to open and close position within microseconds, we must ensure the availability of proper monitoring systems to control the operation, triggering alarms in situations where needs to be escalated. Imagine what would happen if a human realized that some strategy is not doing what it should, or if any venue is not providing prices as it should. When you realize this, you must stop the system, but unrecoverable losses may already be made.
How many minutes will it take for a human person to shut the system down? 5 minutes? 1 minute? You can have more than thousands of wrong opened orders within that time frame. Scary! That’s why any HFT operation needs to put monitoring systems in place, and to check some of the following:
- Overall PnL: if there is, let’s say, a flash crash, the system must be able to close all open positions and shut it down itself.
- Connectivity between venues: making sure that no one has been disconnected, activating reconnection systems in place.
- Monitoring latencies: let’s say some network switches start to fail, and you start to receive data with some delays. You will never realize that until you start to analyze some logs. We need to monitor latencies between venues, to ensure data delivery and alert us in the case of any issue.
The FX market has seen an exponential growth with High frequency trading players, and it is not a surprise. Because of its nature, the forex market is almost perfect to run HFT strategies and able to profit from small discrepancies in the highly fragmented FX market. Also, this allows researchers to have a huge playground for their new ideas. So, the future is still bright. Always keep in mind that when it comes to building low latency trading systems, it takes a great deal of work and expertise. Make sure you can get around with the best professionals to achieve this. Otherwise, chances are that your system won’t perform as good as others. It is very important that you can create a seamless development process, with the ability to adjust as you go, with very close performance monitoring. One of the first and most important steps is to choose the right hardware and how to efficiently use these resources with software techniques. Also, as with any high-performance system, scalability is also essential, so make sure that you can plan for that as well. Running an HFT business entails a number of risks, but no more than you could expect from any other business. With the right approach and expertise, you can avoid many of the troubles most face. So, make sure you follow the best practices and have the required expertise to create a powerful low latency trading system that can compete with all the players on the forex market!