Support Board
Date/Time: Sat, 23 Nov 2024 12:03:32 +0000
[User Discussion] - Python for Sierra Chart
View Count: 38855
[2024-02-13 16:44:01] |
User150671 - Posts: 69 |
If you don't need the convenience of a dataframe, you can save the extra overhead by just reading into and writing from a np.array. Example: read a scid file to a np.array: def get_scid_np(scidFile, limitsize=sys.maxsize): # Check Path Exists f = Path(scidFile) assert f.exists(), "SCID file not found" # set the offset from limitsize stat = f.stat() offset = 56 if stat.st_size < limitsize else stat.st_size - ( (limitsize // 40) * 40) ### Setup the SCID dtype sciddtype = np.dtype([ ('Time', '<u8'), ('Open', '<f4'), ('High', '<f4'), ('Low', '<f4'), ('Close', '<f4'), ('Trades','<i4'), ('Volume', '<i4'), ('BidVolume', '<i4'), ('AskVolume', '<i4') ]) scid = np.fromfile(scidFile, dtype=sciddtype, offset=offset) return scid Write a np.array to a scid file: def write_scid(scidarray,outfile,mode = 'w'): ## Write header file in write mode if mode == 'w': with open(outfile, 'wb') as fscid: header = b'SCID8\x00\x00\x00(\x00\x00\x00\x01\x00' fscid.write(header) for i in range(21): fscid.write(b'\x00\x00') fscid.write(scidarray.tobytes()) fscid.close() # Append the scid array else else: with open(outfile, 'ab') as fscid: fscid.write(scidarray.tobytes()) fscid.close() return Also to whoever wrote this piece of code originally which I see is still being used above: BCOLS = ['datetime', 'open', 'high', 'low', 'close', 'volume', 'trades', 'bidVolume', 'askVolume']
rectype = np.dtype([ (BCOLS[0], '<u8'), (BCOLS[1], '<f4'), (BCOLS[2], '<f4'), (BCOLS[3], '<f4'), (BCOLS[4], '<f4'), (BCOLS[6], '<i4'), (BCOLS[5], '<i4'), (BCOLS[7], '<i4'), (BCOLS[8], '<i4') ]) I didn't notice what you had done and just took your bcols and rectype and made it one variable... because why define something in order to then use it to define another variable with? Problem is, the bcols order DOES NOT MATCH THE ORDER THEY ARE DEFINED IN THE DTYPE (volume and trades are switched around). Took me like 40 minutes to figure out how I broke my scid file generation. Well played sir. You can define the np.dtype in one variable as follows: sciddtype = np.dtype([ ('Time', '<u8'), ('Open', '<f4'), ('High', '<f4'), ('Low', '<f4'), ('Close', '<f4'), ('Trades','<i4'), ('Volume', '<i4'), ('BidVolume', '<i4'), ('AskVolume', '<i4') ]) Date Time Of Last Edit: 2024-02-13 16:47:25
|
[2024-02-13 19:30:19] |
rudy5 - Posts: 5 |
Write a np.array to a scid file:
Nice thanks. I've been using this to plot my custom stuff. https://github.com/louisnw01/lightweight-charts-python/tree/main
|
[2024-02-19 22:08:45] |
Deiter - Posts: 3 |
Many thanks to everyone who has contributed to this thread. I have borrowed from your great work and added some Polars (rather than Pandas) work to further my own work with Sierra Chart data. I present this work here (it's an extract of a Jupyter Notebook that I am running in VS Code), in the hope of repaying some of the goodness y'all have provided. # Purpose: Getting Sierra Chart .scid file information into a Polars dataframe for analysis and backtesting # Approach: # - Use Polars, rather than Pandas, to reap increased speed and large-file handling benefits. # - Use Numpy to extract binary data from scid file and cast into appropriate data types in a Polars dataframe. # - Handling of the SCDateTime data is the hardest part, as it cleverly countains both timestamp and trade count information in a single 64 bit integer, so I use some string methods and regex to extract the desired information. # Outcome Notes: # - My final dataframe (df_extract) requires further work to get the timestamp info (in column "SCDateTime_ms", signifying that it is an integer of milliseconds), into a Python datatype for time analysis. I also need to account for the cases where the trade counter information maxes out at 999 (see the Reddit discussion referenced below for details). # - According to the Jupyter Notebook I ran this in, getting to my final dataframe (thus far) took 4.4 seconds for a 41 million row ES futures .scid file. # - I'm not interesting in comparing different approaches to see speed differences (there are many articles online about the speed of vectorized Numpy operations vs loops and Polars vs Pandas), so I don't plan to run a bunch of speed tests. 41M rows in 4.4s is good enough for me right now. # References: # - Sierra Chart Support Board thread Python for Sierra Chart # - Reddit post on reading scid data https://www.reddit.com/r/algotrading/comments/169ax2x/the_hello_world_guide_to_reading_execution_data/ # - Awesome Sierra Chart documentation Intraday Data File Format # - The Polars website https://pola.rs/ import sys
from pathlib import Path import numpy as np import polars as pl # The following section is borrowed, with one small change to the data type definition, from the Sierra Chart Support Board referenced above. def get_scid_np(scidFile, limitsize=sys.maxsize):
# Check Path Exists f = Path(scidFile) assert f.exists(), "SCID file not found" # set the offset from limitsize stat = f.stat() offset = 56 if stat.st_size < limitsize else stat.st_size - ((limitsize // 40) * 40) ### Setup the SCID dtype sciddtype = np.dtype( [ ("SCDateTime", "<u8"), ("Open", "<f4"), ("High", "<f4"), ("Low", "<f4"), ("Close", "<f4"), ("NumTrades", "<u4"), ("TotalVolume", "<u4"), ("BidVolume", "<u4"), ("AskVolume", "<u4"), ] ) scid_as_np_array = np.fromfile(scidFile, dtype=sciddtype, offset=offset) return scid_as_np_array intermediate_np_array = get_scid_np("ESH24_FUT_CME.scid")
df_raw = pl.DataFrame(intermediate_np_array)
# Now convert the SCDateTime 64 bit integer into a string. df_SCDateTime_as_string = df_raw.with_columns(
pl.col("SCDateTime").cast(pl.Utf8).alias("SCDateTime_string"), ) # Trickiest part thus far. Use regex to extract the timestamp info and trade counter info and cast them back into integers for further analysis. df_extract = df_SCDateTime_as_string.with_columns( pl.col("SCDateTime_string") .str.extract(r"^.{13}", 0) .cast(pl.Int64) .alias("SCDateTime_ms"), pl.col("SCDateTime_string") .str.extract(r".{3}$", 0) .cast(pl.Int32) .alias("SCDateTime_trade_counter"), ) Date Time Of Last Edit: 2024-02-19 22:19:11
|
[2024-02-19 23:15:07] |
User150671 - Posts: 69 |
Handling of the SCDateTime data is the hardest part, as it cleverly countains both timestamp and trade count information in a single 64 bit integer, so I use some string methods and regex to extract the desired information.
I was not tracking that it contains any trade count info whatsoever ref: Intraday Data File Format: DateTime I have following functions for converting sierra time integers: # takes a sierra Time and returns it as a dt.datetime def unSierraTime(timeint): exceldate = -2209161600.0 timeint = timeint/1000000 delta = timeint + exceldate delta = dt.datetime.fromtimestamp(delta).replace(tzinfo=dt.timezone.utc) return delta # Takes a Pandas TimeStamp or a datetime index and turns it into a sierra time integer def sierraTime(dtg): excelDate = (pd.to_datetime('1899-12-30')).tz_localize(tz="utc") delta = dtg - excelDate if type(dtg) == pd._libs.tslibs.timestamps.Timestamp: delta = int(delta.total_seconds()*1000000) else: delta = (delta.total_seconds()*1000000).astype('int64') return delta |
[2024-02-20 16:54:53] |
Deiter - Posts: 3 |
Yeah, I was surprised as well, but I think this is the case, based on three things: - The page in the SC docs you linked has this section: "When writing tick by tick data to an Intraday data file, where each tick is a single record in the file, and multiple ticks/trades have the same timestamp, then it is acceptable for the Date-Time of the record to be the same timestamp as prior records. What Sierra Chart itself does in this case, is increment the microsecond portion of the Date-Time to create unique timestamps for each trade within a millisecond." - The Reddit article that first got me thinking about this: https://www.reddit.com/r/algotrading/comments/169ax2x/the_hello_world_guide_to_reading_execution_data/ - and finally my own read of the .scid file. I have attached a screenshot of the raw SCDateTime data that I imported into my dataframe, showing two runs of tick data where the least significant digit/byte of the data is incrementing by one, while the rest of the SCDateTime is not changing. |
ES tick by tick SCDateTime extract.png / V - Attached On 2024-02-20 16:53:46 UTC - Size: 109.09 KB - 174 views |
[2024-02-20 23:03:17] |
sgne - Posts: 104 |
One thing I noticed is that if you use a custom chart bar function loaded on a chart, then the extracted datetime follows the time zone of the chart, but when you do a direct extract from the scid file, then the time is gmt/utc. The C++ commands to use in a {SCCustomChartBarInterfaceRef ChartBarInterface} function could be these: SCDateTime tTime=ChartBarInterface.NewFileRecord.DateTime; long long tickTime=reinterpret_cast<long long&>(tTime); tickTime/=1000; // to get rid of microseconds, which are trade numbers tTime can be used to get all sorts of data out of the SCDateTime constant reference, e.g.: double xlTime=tTime.GetAsDouble(); And this is all easily done in ahk as well, with commands like this: hdr:=56,numBytes:=40 scDateTimePosition:=0 scPricePosition:=12 ; scDateTime position (int64) - 0 bytes in fo.Pos := hdr+(A_Index-1)*numBytes+scDateTimePosition scDateTime := fo.ReadInt64()//1000 ; get rid of trade number dtetme := 18991230000000 dtetme += scDateTime//1000,seconds dte := SubStr(dtetme,1,8) ; scPrice position (float) - 12 bytes in fo.Pos := hdr+(A_Index-1)*numBytes+scPricePosition floatPrice := fo.ReadFloat() |
[2024-03-16 20:03:08] |
User726340 - Posts: 30 |
updated original func for users interested in pandas df. import sys from pathlib import Path import numpy as np import pandas as pd def get_scid_df(filename, limitsize=sys.maxsize): f = Path(filename) assert f.exists(), f"{f} file not found" stat = f.stat() offset = 56 if stat.st_size < limitsize else stat.st_size - ((limitsize // 40) * 40) sciddtype = np.dtype( [ ("Time", "<u8"), ("Open", "<f4"), ("High", "<f4"), ("Low", "<f4"), ("Close", "<f4"), ("Trades", "<i4"), ("Volume", "<i4"), ("BidVolume", "<i4"), ("AskVolume", "<i4"), ] ) df = pd.DataFrame( data=np.memmap(f, dtype=sciddtype, offset=offset, mode="r"), copy=False ) df.dropna(inplace=True) df["Time"] = df["Time"] - 2209161600000000 df.drop(df[(df.Time < 1)].index, inplace=True) df.set_index("Time", inplace=True) df.index = pd.to_datetime(df.index, unit="us") df.index = df.index.tz_localize(tz="utc") return df def resample_scdf(scdf_or_filename, period="1Min", tz="UTC", limitsize=sys.maxsize): df = ( get_scid_df(scdf_or_filename, limitsize) if isinstance(scdf_or_filename, str) else scdf_or_filename ) assert isinstance( df, pd.DataFrame ), f"{scdf_or_filename} has to be SC df or valid scid file" df = df.resample(period).agg( { "Open": "first", "High": "max", "Low": "min", "Close": "last", "Trades": "sum", "Volume": "sum", "BidVolume": "sum", "AskVolume": "sum", } ) if tz != "UTC": tz = "America/New_York" if tz == "EST" else tz tz = "America/Los_Angeles" if tz == "PST" else tz df.index = df.index.tz_convert(tz) return df to get a scid 1Min df in EST : print(resample_scdf("ESH24.scid", tz="EST")) to get a scid 30Min df in EST : print(resample_scdf("ESH24.scid", "30Min", tz="EST")) Date Time Of Last Edit: 2024-03-17 19:36:01
|
[2024-05-01 00:24:46] |
User656492 - Posts: 143 |
This is fun stuff. Thank you all for sharing! I just tried the code above for NQM4 and got this result. I think something's amiss with the time format, which causes Open to be a strange value: print(resample_scdf("NQM4.CME.scid", tz="EST")) Open High Low Close Trades \
Time 2024-02-25 19:01:00-05:00 0.000000e+00 18223.00 18221.25 18221.25 1 2024-02-25 19:02:00-05:00 -1.999001e+35 18207.75 18202.75 18207.75 2 2024-02-25 19:03:00-05:00 0.000000e+00 18208.50 18202.50 18208.75 4 2024-02-25 19:04:00-05:00 0.000000e+00 18209.00 18200.25 18206.25 10 2024-02-25 19:05:00-05:00 0.000000e+00 18207.00 18199.00 18206.00 6 also, because maybe it helps: df.describe() Open High Low Close Trades Volume BidVolume AskVolume
count 5.584600e+04 55846.000000 55846.000000 55846.000000 93144.000000 93144.000000 93144.000000 93144.000000 mean -inf 18176.611328 18170.449219 18173.523438 235.347591 256.042236 128.308200 127.734035 std inf 360.248291 361.233368 360.750702 571.431255 625.932788 316.699976 313.838125 min -1.999001e+35 17130.500000 17113.750000 17121.250000 0.000000 0.000000 0.000000 0.000000 25% 0.000000e+00 17912.500000 17908.750000 17910.750000 0.000000 0.000000 0.000000 0.000000 50% 0.000000e+00 18278.250000 18271.750000 18275.000000 21.000000 22.000000 10.000000 10.000000 75% 0.000000e+00 18463.500000 18458.750000 18461.250000 141.000000 154.000000 77.000000 77.000000 max 0.000000e+00 18708.500000 18702.750000 18706.000000 9990.000000 11664.000000 6341.000000 5972.000000 Date Time Of Last Edit: 2024-05-01 01:28:22
|
[2024-05-01 00:59:48] |
rudy5 - Posts: 5 |
Can you please edit that to put your code output into a code block?
|
[2024-05-01 01:15:46] |
User656492 - Posts: 143 |
I don't understand the suggestion. ...?
|
[2024-05-01 02:36:20] |
rudy5 - Posts: 5 |
You did it thanks 👌🏼
|
[2024-05-05 15:10:40] |
User656492 - Posts: 143 |
<BUMP> I'm using code from above: import sys
from pathlib import Path import numpy as np import pandas as pd def get_scid_df(filename, limitsize=sys.maxsize): f = Path('/Users/Me/SierraPython/NQM4.CME.scid' ) assert f.exists(), f"{f} file not found" stat = f.stat() offset = 56 if stat.st_size < limitsize else stat.st_size - ((limitsize // 40) * 40) sciddtype = np.dtype( [ ("Time", "<u8"), ("Open", "<f4"), ("High", "<f4"), ("Low", "<f4"), ("Close", "<f4"), ("Trades", "<i4"), ("Volume", "<i4"), ("BidVolume", "<i4"), ("AskVolume", "<i4"), ] ) df = pd.DataFrame( data=np.memmap(f, dtype=sciddtype, offset=offset, mode="r"), copy=False ) df.dropna(inplace=True) df["Time"] = df["Time"] - 2209161600000000 df.drop(df[(df.Time < 1)].index, inplace=True) df.set_index("Time", inplace=True) df.index = pd.to_datetime(df.index, unit="us") df.index = df.index.tz_localize(tz="utc") return df def resample_scdf(scdf_or_filename, period="1Min", tz="UTC", limitsize=sys.maxsize): df = ( get_scid_df(scdf_or_filename, limitsize) if isinstance(scdf_or_filename, str) else scdf_or_filename ) assert isinstance( df, pd.DataFrame ), f"{scdf_or_filename} has to be SC df or valid scid file" df = df.resample(period).agg( { "Open": "first", "High": "max", "Low": "min", "Close": "last", "Trades": "sum", "Volume": "sum", "BidVolume": "sum", "AskVolume": "sum", } ) if tz != "UTC": tz = "America/New_York" if tz == "EST" else tz tz = "America/Los_Angeles" if tz == "PST" else tz df.index = df.index.tz_convert(tz) return df print(resample_scdf("NQM4.CME.scid", tz="EST")) And I get this: Open High Low Close \ Time 2024-02-25 19:01:00-05:00 0.000000e+00 1822300.0 1822125.0 1822125.0 2024-02-25 19:02:00-05:00 -1.999001e+37 1820775.0 1820275.0 1820775.0 2024-02-25 19:03:00-05:00 0.000000e+00 1820850.0 1820250.0 1820875.0 2024-02-25 19:04:00-05:00 0.000000e+00 1820900.0 1820025.0 1820625.0 2024-02-25 19:05:00-05:00 0.000000e+00 1820700.0 1819900.0 1820600.0 ... ... ... ... ... 2024-04-30 12:20:00-04:00 0.000000e+00 1777300.0 1776650.0 1777175.0 2024-04-30 12:21:00-04:00 0.000000e+00 1777625.0 1776950.0 1777000.0 2024-04-30 12:22:00-04:00 0.000000e+00 1777350.0 1776950.0 1777150.0 2024-04-30 12:23:00-04:00 -1.999001e+37 1777575.0 1776900.0 1777175.0 2024-04-30 12:24:00-04:00 -1.999001e+37 1777350.0 1777125.0 1777175.0 Trades Volume BidVolume AskVolume Time 2024-02-25 19:01:00-05:00 1 1 1 0 2024-02-25 19:02:00-05:00 2 2 0 2 2024-02-25 19:03:00-05:00 4 4 0 4 2024-02-25 19:04:00-05:00 10 10 0 10 2024-02-25 19:05:00-05:00 6 6 2 4 ... ... ... ... ... 2024-04-30 12:20:00-04:00 869 959 376 583 2024-04-30 12:21:00-04:00 861 1023 430 593 2024-04-30 12:22:00-04:00 462 518 225 293 2024-04-30 12:23:00-04:00 522 548 279 269 2024-04-30 12:24:00-04:00 58 58 18 40 [93144 rows x 8 columns] As you can see the Open field is fubar. I think time stamp is being parsed incorrectly but have no idea how to address it. Any thoughts? |
[2024-05-06 18:47:20] |
sgne - Posts: 104 |
I think O is that way because the SCID file is 1-tick, instead of 1-second or greater. In 1-tick files, O is zero, and H and L are bid/ask values. Good luck.
|
[2024-05-07 21:48:56] |
User150671 - Posts: 69 |
This is explained in the documentation here: Intraday Data File Format |
[2024-08-17 16:42:00] |
User150671 - Posts: 69 |
Okay. So I fucked up my code to convert time to and from SCID files. I just noticed today as I was trying to do some backtesting on things in different timezones, and realised it was always converting from SCID to NYC time. Fixed Code: import datetime as dt import numpy as np # takes a scid time and returns it as a dt.datetime def un_sierra_time(scidtime): sierra_epoch = dt.datetime(1899, 12, 30, tzinfo=dt.timezone.utc) dtg = sierra_epoch + dt.timedelta(microseconds=int(scidtime)) return dtg # takes a dt.datetime object and returns it as a np.uint64 for working with scid files def re_sierra_time(dtg): sierra_epoch = dt.datetime(1899, 12, 30, tzinfo=dt.timezone.utc) delta = dtg - sierra_epoch scidtime = np.uint64(delta.total_seconds()*1000000) return scidtime Explanation: As per the Intraday File Format at Intraday Data File Format: DateTime The SCID DateTime is: "...a 64-bit integer. It represents the number of microseconds since the SCDateTime epoch of December 30, 1899."
un_sierra_time is fairly straight forward: We take the Sierra Epoch Date and add the value of the SCID DateTime to get a Date Time Object using dt.timedelta re_sierra_time is also fairly straight forward, we just do the opposite: We take the sierra epoch and subtract the SCID DateTime from this to get a datetime timedelta object. That delta object stores total days, seconds, and microseconds, and if you call it with total.seconds() you'll get a float with the total microseconds as the fraction. We can then multiply that float by 1000000 and cast it into an int in order to store in or compare with an SCID time value (in this case I use a numpy uint64 as thats the format I am working with SCIDs with) Apologies if anyone stole my code and was fucked around by it! |
To post a message in this thread, you need to log in with your Sierra Chart account: