Login Page - Create Account

Support Board


Date/Time: Tue, 11 Mar 2025 01:59:29 +0000



[Locked] - Teton Interruption Around 4 PM US Eastern time

View Count: 1178

[2022-04-25 20:54:25]
Sierra Chart Engineering - Posts: 104368
We had an interruption of the Teton service for about 15 minutes around 4 PM US Eastern time today.

We will have more details later.

We encountered an event, we have never encountered before (in the last 17 years or so), a server shut down due to a multi-bit memory error in one of the DRAM modules.

We are fully prepared for this kind of event and had real-time copy of all information another server and were prepared to failover, but we brought the server back up again. We likely will make a decision to failover to a backup server this evening during the CME downtime.

Today we had installed another server in the Aurora data center, that we are going to use also for Teton order routing. each server will serve half of the users so an event like this, only half of the users would be affected. That server is not yet ready.

We will have more comments a little later.

You need to examine your orders and positions and make sure they are all as they should be. During this time, OCO and bracket order functionality would not have been working so make sure any orders that should be canceled are canceled and verify your Positions are correct.

There is a very small possibility of an incorrect position quantity, due to a fill getting reprocessed if there was a fill event immediately preceding the up to 120 seconds before this server event. We will explain why that potentially can happen. It has to do with how redundancy is maintained, on the primary server versus the real-time copy on our backup server. There is a balance between maintaining high-performance, and redundancy which is rarely needed.

We do have triple redundancy with Internet connections but none of that was useful in an event like this which is extremely rare and as we said, we have not experienced a server shutdown previously like this.
Sierra Chart Support - Engineering Level

Your definitive source for support. Other responses are from users. Try to keep your questions brief and to the point. Be aware of support policy:
https://www.sierrachart.com/index.php?l=PostingInformation.php#GeneralInformation

For the most reliable, advanced, and zero cost futures order routing, *change* to the Teton service:
Sierra Chart Teton Futures Order Routing
Date Time Of Last Edit: 2022-04-26 07:09:08
[2022-04-25 23:30:47]
Sierra Chart Engineering - Posts: 104368
This is the error that was encountered:

The self-heal operation successfully completed at DIMM DIMM_B6. Mon 25 Apr 2022 20:01:56
Multi-bit memory errors detected on a memory device at location(s) DIMM_B6. Mon 25 Apr 2022 20:01:56
Multi-bit memory errors detected on a memory device at location(s) DIMM_B6. Mon 25 Apr 2022 19:58:05

We have some additional information about the issue. it is very important that when reestablishing the FIX connection to the exchange, that the last used sequence numbers are used. These are held in a file, normally this file is written every five seconds. It was being written every two minutes instead. We have to check back for when the change was made but likely was due to some problem, encountered with what we call our records file writer and there was concern about it blocking the main thread for longer than intended/expected. Any issues, with blocking were solved last year. And this time had never been decreased.

It has now been decreased backed if down to 5 seconds.

This is very important because what happened is that some execution reports were reprocessed. This is not necessarily an issue and there are many safeguards to prevent any significant consequences, But it can cause an incorrect position quantity for some accounts. The clearing firms will be notified of this, during the overnight processing.

We are aware of one account this occurred with and the broker made an adjustment to correct it.


If your position quantity looks incorrect, then contact your broker.
Sierra Chart Support - Engineering Level

Your definitive source for support. Other responses are from users. Try to keep your questions brief and to the point. Be aware of support policy:
https://www.sierrachart.com/index.php?l=PostingInformation.php#GeneralInformation

For the most reliable, advanced, and zero cost futures order routing, *change* to the Teton service:
Sierra Chart Teton Futures Order Routing
Date Time Of Last Edit: 2022-04-25 23:50:57
[2022-04-26 09:39:39]
Sierra Chart Engineering - Posts: 104368
The server that encountered the memory error, also is used for the Denali Exchange Data Feed . So any users receiving data from this server would have also had an interruption for no more than one minute.
Sierra Chart Support - Engineering Level

Your definitive source for support. Other responses are from users. Try to keep your questions brief and to the point. Be aware of support policy:
https://www.sierrachart.com/index.php?l=PostingInformation.php#GeneralInformation

For the most reliable, advanced, and zero cost futures order routing, *change* to the Teton service:
Sierra Chart Teton Futures Order Routing
[2022-04-26 12:22:15]
Sierra Chart Engineering - Posts: 104368
Information we sent to clearing firms:
Hello,

Yesterday at 4 PM US Eastern time we had an interruption with our Teton order routing server due to a hardware memory error. This is an extremely rare condition which has not previously occurred. This condition was recovered from 15 minutes later.

The state data for order routing is saved every two minutes to files. Any order fills that occurred within two minutes before this incident would have led to those fills getting reprocessed upon reconnection to the iLink. This would lead to incorrect position quantities.

Please check all of the current positions for users of the Teton order routing because we do not receive an end of day position file from you. You can find these in "Trade >> Trade Positions Window" within your installation of Sierra Chart used for risk management.

If you need to make an adjustment you can perform a manual fill following these instructions:
Trade Account and Risk Management: Add Manual/Correcting Order Fill

We have made a change to save the critical state information every five seconds.

Great care has been taken to ensure that order fills do not get reprocessed. There is also a backup safety check to a not allow fills to alter the position quantity if they are older than 30 minutes.

Due to the fact that we recovered on the same server and the state information was being saved every two minutes previously, is the reason that this potentially could have occurred. As we said we have reduced this down to 5 seconds. The decision to use two minutes was in order to maintain the highest performance and lowest latency at all times for order routing, and the fact that an incident like this, has never occurred with any of our dedicated servers in the last 15 years.

If we were not able to recover, on the same server, we did have a backup available with a real-time copy of all order and other trading information state information and could have failed over to that but it was not necessary. That backup server is refreshed every 3 to 5 seconds.

Yesterday morning we also had installed a second server in the Aurora data center colocated with the CME and we are going to be using that as a primary server and moving half of the users over to that server. This was already planned, and in progress before this incident. So in a rare event like this, fewer users would be affected.

We also last week, performed a failover to our backup server using a CME test iLink to verify failover procedures. This transition was successful. We do have in place all backup procedures. And we went through additional testing overnight.

We are going to be replacing one or more of the memory modules in the server that was affected, over the weekend. We do not expect any further incidents this week until the module is replaced.

Sierra Chart Support - Engineering Level

Your definitive source for support. Other responses are from users. Try to keep your questions brief and to the point. Be aware of support policy:
https://www.sierrachart.com/index.php?l=PostingInformation.php#GeneralInformation

For the most reliable, advanced, and zero cost futures order routing, *change* to the Teton service:
Sierra Chart Teton Futures Order Routing
[2022-04-26 12:27:37]
Sierra Chart Engineering - Posts: 104368
We are only aware of two inaccurate positions at this time.

We also want to point out in an incident like this, which is very rare, and has not happened previously , OCO and Bracket order management would not be functioning. So there was not management of these types of orders from approximately 4 PM to 4:15 PM US Eastern time 2022-04-25.

Additionally:
When we reestablish the connection to the CME iLink, any order that has already filled or was canceled during the interruption, and was previously open before the incident would still show open in Sierra Chart. You would have to cancel that order manually, and then once we get a rejection from the exchange that the order does not exist, it is internally marked as canceled and you will see it canceled in Sierra Chart.

Regarding the prior paragraph, why do we have this behavior? The reason is that the Sierra Chart order routing program maintains a connection to the exchange continuously. It is extremely rare for the connection to be lost especially being there is redundant connectivity. We just do not want to run the risk of of marking an order as canceled, upon reconnection to the iLink when it does not show up in the mass order status response, in case there is some issue with the mass order status response.

How this is handled, is switchable. We can do it either way depending upon how the clearing firm wants us to do it. In other words, if they want us to cancel orders, internally , that are no longer in the mass order status response, we can do that.

The above is the only other information we need to provide.



Why did we not make a decision to failover to the backup server? In general, we do not like to do that, to avoid any unforeseen consequences. For example, the backup server, is not actively used for order routing. While it does have dual connections to the CME that we use for market data, we have not recently used it for order routing. Perhaps we could have had some difficulty establishing the connection to the CME due to static route configuration issues (Although unlikely). So we would rather only failover as a last resort. But we will make a decision a failover, if we are looking at an interruption that is going to be longer than 10 minutes.

Once our second server in the Aurora DC is up and is ready to be used, in about a week, that will be actively used for order routing and will handle half of the order flow. So we will have two servers in the Aurora DC, Either one of which can serve as a backup to the other. And we will know that either one, can gracefully and quickly failover without any issues because both are used, for active order routing.
Sierra Chart Support - Engineering Level

Your definitive source for support. Other responses are from users. Try to keep your questions brief and to the point. Be aware of support policy:
https://www.sierrachart.com/index.php?l=PostingInformation.php#GeneralInformation

For the most reliable, advanced, and zero cost futures order routing, *change* to the Teton service:
Sierra Chart Teton Futures Order Routing
Date Time Of Last Edit: 2022-04-26 12:31:51

To post a message in this thread, you need to log in with your Sierra Chart account:

Login

Login Page - Create Account