Support Board
Date/Time: Sat, 23 Nov 2024 15:21:46 +0000
Post From: Python for Sierra Chart
[2024-02-19 22:08:45] |
Deiter - Posts: 3 |
Many thanks to everyone who has contributed to this thread. I have borrowed from your great work and added some Polars (rather than Pandas) work to further my own work with Sierra Chart data. I present this work here (it's an extract of a Jupyter Notebook that I am running in VS Code), in the hope of repaying some of the goodness y'all have provided. # Purpose: Getting Sierra Chart .scid file information into a Polars dataframe for analysis and backtesting # Approach: # - Use Polars, rather than Pandas, to reap increased speed and large-file handling benefits. # - Use Numpy to extract binary data from scid file and cast into appropriate data types in a Polars dataframe. # - Handling of the SCDateTime data is the hardest part, as it cleverly countains both timestamp and trade count information in a single 64 bit integer, so I use some string methods and regex to extract the desired information. # Outcome Notes: # - My final dataframe (df_extract) requires further work to get the timestamp info (in column "SCDateTime_ms", signifying that it is an integer of milliseconds), into a Python datatype for time analysis. I also need to account for the cases where the trade counter information maxes out at 999 (see the Reddit discussion referenced below for details). # - According to the Jupyter Notebook I ran this in, getting to my final dataframe (thus far) took 4.4 seconds for a 41 million row ES futures .scid file. # - I'm not interesting in comparing different approaches to see speed differences (there are many articles online about the speed of vectorized Numpy operations vs loops and Polars vs Pandas), so I don't plan to run a bunch of speed tests. 41M rows in 4.4s is good enough for me right now. # References: # - Sierra Chart Support Board thread Python for Sierra Chart # - Reddit post on reading scid data https://www.reddit.com/r/algotrading/comments/169ax2x/the_hello_world_guide_to_reading_execution_data/ # - Awesome Sierra Chart documentation Intraday Data File Format # - The Polars website https://pola.rs/ import sys
from pathlib import Path import numpy as np import polars as pl # The following section is borrowed, with one small change to the data type definition, from the Sierra Chart Support Board referenced above. def get_scid_np(scidFile, limitsize=sys.maxsize):
# Check Path Exists f = Path(scidFile) assert f.exists(), "SCID file not found" # set the offset from limitsize stat = f.stat() offset = 56 if stat.st_size < limitsize else stat.st_size - ((limitsize // 40) * 40) ### Setup the SCID dtype sciddtype = np.dtype( [ ("SCDateTime", "<u8"), ("Open", "<f4"), ("High", "<f4"), ("Low", "<f4"), ("Close", "<f4"), ("NumTrades", "<u4"), ("TotalVolume", "<u4"), ("BidVolume", "<u4"), ("AskVolume", "<u4"), ] ) scid_as_np_array = np.fromfile(scidFile, dtype=sciddtype, offset=offset) return scid_as_np_array intermediate_np_array = get_scid_np("ESH24_FUT_CME.scid")
df_raw = pl.DataFrame(intermediate_np_array)
# Now convert the SCDateTime 64 bit integer into a string. df_SCDateTime_as_string = df_raw.with_columns(
pl.col("SCDateTime").cast(pl.Utf8).alias("SCDateTime_string"), ) # Trickiest part thus far. Use regex to extract the timestamp info and trade counter info and cast them back into integers for further analysis. df_extract = df_SCDateTime_as_string.with_columns( pl.col("SCDateTime_string") .str.extract(r"^.{13}", 0) .cast(pl.Int64) .alias("SCDateTime_ms"), pl.col("SCDateTime_string") .str.extract(r".{3}$", 0) .cast(pl.Int32) .alias("SCDateTime_trade_counter"), ) Date Time Of Last Edit: 2024-02-19 22:19:11
|