Login Page - Create Account

Support Board


Date/Time: Sat, 23 Nov 2024 12:03:32 +0000



[User Discussion] - Python for Sierra Chart

View Count: 38855

[2024-02-13 16:44:01]
User150671 - Posts: 69
If you don't need the convenience of a dataframe, you can save the extra overhead by just reading into and writing from a np.array.
Example:

read a scid file to a np.array:

def get_scid_np(scidFile, limitsize=sys.maxsize):
  # Check Path Exists
  f = Path(scidFile)
  assert f.exists(), "SCID file not found"
  # set the offset from limitsize
  stat = f.stat()
  offset = 56 if stat.st_size < limitsize else stat.st_size - (
    (limitsize // 40) * 40)

  ### Setup the SCID dtype
  sciddtype = np.dtype([
    ('Time', '<u8'),
    ('Open', '<f4'),
    ('High', '<f4'),
    ('Low', '<f4'),
    ('Close', '<f4'),
    ('Trades','<i4'),
    ('Volume', '<i4'),
    ('BidVolume', '<i4'),
    ('AskVolume', '<i4')
  ])
  scid = np.fromfile(scidFile, dtype=sciddtype, offset=offset)
  return scid



Write a np.array to a scid file:

def write_scid(scidarray,outfile,mode = 'w'):
  ## Write header file in write mode
  if mode == 'w':
    with open(outfile, 'wb') as fscid:
      header = b'SCID8\x00\x00\x00(\x00\x00\x00\x01\x00'
      fscid.write(header)
      for i in range(21):
        fscid.write(b'\x00\x00')
      
      fscid.write(scidarray.tobytes())
      fscid.close()

  # Append the scid array else
  else:
    with open(outfile, 'ab') as fscid:
        fscid.write(scidarray.tobytes())
        fscid.close()
    
  return

Also to whoever wrote this piece of code originally which I see is still being used above:
BCOLS = ['datetime', 'open', 'high', 'low', 'close', 'volume', 'trades', 'bidVolume', 'askVolume']

rectype = np.dtype([

(BCOLS[0], '<u8'), (BCOLS[1], '<f4'), (BCOLS[2], '<f4'),

(BCOLS[3], '<f4'), (BCOLS[4], '<f4'), (BCOLS[6], '<i4'),

(BCOLS[5], '<i4'), (BCOLS[7], '<i4'), (BCOLS[8], '<i4')

])


I didn't notice what you had done and just took your bcols and rectype and made it one variable... because why define something in order to then use it to define another variable with?
Problem is, the bcols order DOES NOT MATCH THE ORDER THEY ARE DEFINED IN THE DTYPE (volume and trades are switched around). Took me like 40 minutes to figure out how I broke my scid file generation. Well played sir.

You can define the np.dtype in one variable as follows:

sciddtype = np.dtype([
('Time', '<u8'),
('Open', '<f4'),
('High', '<f4'),
('Low', '<f4'),
('Close', '<f4'),
('Trades','<i4'),
('Volume', '<i4'),
('BidVolume', '<i4'),
('AskVolume', '<i4')
])

Date Time Of Last Edit: 2024-02-13 16:47:25
[2024-02-13 19:30:19]
rudy5 - Posts: 5
Write a np.array to a scid file:
Nice thanks. I've been using this to plot my custom stuff. https://github.com/louisnw01/lightweight-charts-python/tree/main
[2024-02-19 22:08:45]
Deiter - Posts: 3
Many thanks to everyone who has contributed to this thread. I have borrowed from your great work and added some Polars (rather than Pandas) work to further my own work with Sierra Chart data. I present this work here (it's an extract of a Jupyter Notebook that I am running in VS Code), in the hope of repaying some of the goodness y'all have provided.

# Purpose: Getting Sierra Chart .scid file information into a Polars dataframe for analysis and backtesting

# Approach:
# - Use Polars, rather than Pandas, to reap increased speed and large-file handling benefits.
# - Use Numpy to extract binary data from scid file and cast into appropriate data types in a Polars dataframe.
# - Handling of the SCDateTime data is the hardest part, as it cleverly countains both timestamp and trade count information in a single 64 bit integer, so I use some string methods and regex to extract the desired information.

# Outcome Notes:
# - My final dataframe (df_extract) requires further work to get the timestamp info (in column "SCDateTime_ms", signifying that it is an integer of milliseconds), into a Python datatype for time analysis. I also need to account for the cases where the trade counter information maxes out at 999 (see the Reddit discussion referenced below for details).
# - According to the Jupyter Notebook I ran this in, getting to my final dataframe (thus far) took 4.4 seconds for a 41 million row ES futures .scid file.
# - I'm not interesting in comparing different approaches to see speed differences (there are many articles online about the speed of vectorized Numpy operations vs loops and Polars vs Pandas), so I don't plan to run a bunch of speed tests. 41M rows in 4.4s is good enough for me right now.

# References:
# - Sierra Chart Support Board thread Python for Sierra Chart
# - Reddit post on reading scid data https://www.reddit.com/r/algotrading/comments/169ax2x/the_hello_world_guide_to_reading_execution_data/
# - Awesome Sierra Chart documentation Intraday Data File Format
# - The Polars website https://pola.rs/

import sys
from pathlib import Path

import numpy as np
import polars as pl


# The following section is borrowed, with one small change to the data type definition, from the Sierra Chart Support Board referenced above.


def get_scid_np(scidFile, limitsize=sys.maxsize):
# Check Path Exists

f = Path(scidFile)

assert f.exists(), "SCID file not found"

# set the offset from limitsize

stat = f.stat()

offset = 56 if stat.st_size < limitsize else stat.st_size - ((limitsize // 40) * 40)

### Setup the SCID dtype

sciddtype = np.dtype(
[
("SCDateTime", "<u8"),
("Open", "<f4"),
("High", "<f4"),
("Low", "<f4"),
("Close", "<f4"),
("NumTrades", "<u4"),
("TotalVolume", "<u4"),
("BidVolume", "<u4"),
("AskVolume", "<u4"),
]
)

scid_as_np_array = np.fromfile(scidFile, dtype=sciddtype, offset=offset)

return scid_as_np_array

intermediate_np_array = get_scid_np("ESH24_FUT_CME.scid")

df_raw = pl.DataFrame(intermediate_np_array)


# Now convert the SCDateTime 64 bit integer into a string.
df_SCDateTime_as_string = df_raw.with_columns(
pl.col("SCDateTime").cast(pl.Utf8).alias("SCDateTime_string"),
)


# Trickiest part thus far. Use regex to extract the timestamp info and trade counter info and cast them back into integers for further analysis.

df_extract = df_SCDateTime_as_string.with_columns(
pl.col("SCDateTime_string")
.str.extract(r"^.{13}", 0)
.cast(pl.Int64)
.alias("SCDateTime_ms"),
pl.col("SCDateTime_string")
.str.extract(r".{3}$", 0)
.cast(pl.Int32)
.alias("SCDateTime_trade_counter"),
)

Date Time Of Last Edit: 2024-02-19 22:19:11
[2024-02-19 23:15:07]
User150671 - Posts: 69
Handling of the SCDateTime data is the hardest part, as it cleverly countains both timestamp and trade count information in a single 64 bit integer, so I use some string methods and regex to extract the desired information.

I was not tracking that it contains any trade count info whatsoever ref:
Intraday Data File Format: DateTime

I have following functions for converting sierra time integers:

# takes a sierra Time and returns it as a dt.datetime
def unSierraTime(timeint):
  exceldate = -2209161600.0
  timeint = timeint/1000000
  delta = timeint + exceldate
  delta = dt.datetime.fromtimestamp(delta).replace(tzinfo=dt.timezone.utc)
  return delta

# Takes a Pandas TimeStamp or a datetime index and turns it into a sierra time integer
def sierraTime(dtg):
  excelDate = (pd.to_datetime('1899-12-30')).tz_localize(tz="utc")
  delta = dtg - excelDate

  if type(dtg) == pd._libs.tslibs.timestamps.Timestamp:
    delta = int(delta.total_seconds()*1000000)

  else:
    delta = (delta.total_seconds()*1000000).astype('int64')
  
  return delta

[2024-02-20 16:54:53]
Deiter - Posts: 3
Yeah, I was surprised as well, but I think this is the case, based on three things:
- The page in the SC docs you linked has this section: "When writing tick by tick data to an Intraday data file, where each tick is a single record in the file, and multiple ticks/trades have the same timestamp, then it is acceptable for the Date-Time of the record to be the same timestamp as prior records. What Sierra Chart itself does in this case, is increment the microsecond portion of the Date-Time to create unique timestamps for each trade within a millisecond."
- The Reddit article that first got me thinking about this: https://www.reddit.com/r/algotrading/comments/169ax2x/the_hello_world_guide_to_reading_execution_data/
- and finally my own read of the .scid file. I have attached a screenshot of the raw SCDateTime data that I imported into my dataframe, showing two runs of tick data where the least significant digit/byte of the data is incrementing by one, while the rest of the SCDateTime is not changing.
imageES tick by tick SCDateTime extract.png / V - Attached On 2024-02-20 16:53:46 UTC - Size: 109.09 KB - 174 views
[2024-02-20 23:03:17]
sgne - Posts: 104
One thing I noticed is that if you use a custom chart bar function loaded on a chart, then the extracted datetime follows the time zone of the chart, but when you do a direct extract from the scid file, then the time is gmt/utc.

The C++ commands to use in a {SCCustomChartBarInterfaceRef ChartBarInterface} function could be these:

SCDateTime tTime=ChartBarInterface.NewFileRecord.DateTime;
long long tickTime=reinterpret_cast<long long&>(tTime);
tickTime/=1000; // to get rid of microseconds, which are trade numbers

tTime can be used to get all sorts of data out of the SCDateTime constant reference, e.g.:

double xlTime=tTime.GetAsDouble();


And this is all easily done in ahk as well, with commands like this:

hdr:=56,numBytes:=40
scDateTimePosition:=0
scPricePosition:=12

; scDateTime position (int64) - 0 bytes in
fo.Pos := hdr+(A_Index-1)*numBytes+scDateTimePosition
scDateTime := fo.ReadInt64()//1000 ; get rid of trade number
dtetme := 18991230000000
dtetme += scDateTime//1000,seconds
dte := SubStr(dtetme,1,8)

; scPrice position (float) - 12 bytes in
fo.Pos := hdr+(A_Index-1)*numBytes+scPricePosition
floatPrice := fo.ReadFloat()
[2024-03-16 20:03:08]
User726340 - Posts: 30
updated original func for users interested in pandas df.


import sys
from pathlib import Path
import numpy as np
import pandas as pd

def get_scid_df(filename, limitsize=sys.maxsize):
f = Path(filename)
assert f.exists(), f"{f} file not found"
stat = f.stat()
offset = 56 if stat.st_size < limitsize else stat.st_size - ((limitsize // 40) * 40)
sciddtype = np.dtype(
[
("Time", "<u8"),
("Open", "<f4"),
("High", "<f4"),
("Low", "<f4"),
("Close", "<f4"),
("Trades", "<i4"),
("Volume", "<i4"),
("BidVolume", "<i4"),
("AskVolume", "<i4"),
]
)
df = pd.DataFrame(
data=np.memmap(f, dtype=sciddtype, offset=offset, mode="r"), copy=False
)
df.dropna(inplace=True)
df["Time"] = df["Time"] - 2209161600000000
df.drop(df[(df.Time < 1)].index, inplace=True)
df.set_index("Time", inplace=True)
df.index = pd.to_datetime(df.index, unit="us")
df.index = df.index.tz_localize(tz="utc")
return df


def resample_scdf(scdf_or_filename, period="1Min", tz="UTC", limitsize=sys.maxsize):
df = (
get_scid_df(scdf_or_filename, limitsize)
if isinstance(scdf_or_filename, str)
else scdf_or_filename
)
assert isinstance(
df, pd.DataFrame
), f"{scdf_or_filename} has to be SC df or valid scid file"
df = df.resample(period).agg(
{
"Open": "first",
"High": "max",
"Low": "min",
"Close": "last",
"Trades": "sum",
"Volume": "sum",
"BidVolume": "sum",
"AskVolume": "sum",
}
)
if tz != "UTC":
tz = "America/New_York" if tz == "EST" else tz
tz = "America/Los_Angeles" if tz == "PST" else tz
df.index = df.index.tz_convert(tz)
return df

to get a scid 1Min df in EST : print(resample_scdf("ESH24.scid", tz="EST"))
to get a scid 30Min df in EST : print(resample_scdf("ESH24.scid", "30Min", tz="EST"))
Date Time Of Last Edit: 2024-03-17 19:36:01
[2024-05-01 00:24:46]
User656492 - Posts: 143
This is fun stuff. Thank you all for sharing! I just tried the code above for NQM4 and got this result. I think something's amiss with the time format, which causes Open to be a strange value:

print(resample_scdf("NQM4.CME.scid", tz="EST"))

Open High Low Close Trades \
Time
2024-02-25 19:01:00-05:00 0.000000e+00 18223.00 18221.25 18221.25 1
2024-02-25 19:02:00-05:00 -1.999001e+35 18207.75 18202.75 18207.75 2
2024-02-25 19:03:00-05:00 0.000000e+00 18208.50 18202.50 18208.75 4
2024-02-25 19:04:00-05:00 0.000000e+00 18209.00 18200.25 18206.25 10
2024-02-25 19:05:00-05:00 0.000000e+00 18207.00 18199.00 18206.00 6

also, because maybe it helps:

df.describe()

  Open  High  Low  Close  Trades  Volume  BidVolume  AskVolume
count  5.584600e+04  55846.000000  55846.000000  55846.000000  93144.000000  93144.000000  93144.000000  93144.000000
mean  -inf  18176.611328  18170.449219  18173.523438  235.347591  256.042236  128.308200  127.734035
std  inf  360.248291  361.233368  360.750702  571.431255  625.932788  316.699976  313.838125
min  -1.999001e+35  17130.500000  17113.750000  17121.250000  0.000000  0.000000  0.000000  0.000000
25%  0.000000e+00  17912.500000  17908.750000  17910.750000  0.000000  0.000000  0.000000  0.000000
50%  0.000000e+00  18278.250000  18271.750000  18275.000000  21.000000  22.000000  10.000000  10.000000
75%  0.000000e+00  18463.500000  18458.750000  18461.250000  141.000000  154.000000  77.000000  77.000000
max  0.000000e+00  18708.500000  18702.750000  18706.000000  9990.000000  11664.000000  6341.000000  5972.000000

Date Time Of Last Edit: 2024-05-01 01:28:22
[2024-05-01 00:59:48]
rudy5 - Posts: 5
Can you please edit that to put your code output into a code block?
[2024-05-01 01:15:46]
User656492 - Posts: 143
I don't understand the suggestion. ...?
[2024-05-01 02:36:20]
rudy5 - Posts: 5
You did it thanks 👌🏼
[2024-05-05 15:10:40]
User656492 - Posts: 143
<BUMP> I'm using code from above:

import sys

from pathlib import Path

import numpy as np

import pandas as pd



def get_scid_df(filename, limitsize=sys.maxsize):

f = Path('/Users/Me/SierraPython/NQM4.CME.scid' )

assert f.exists(), f"{f} file not found"

stat = f.stat()

offset = 56 if stat.st_size < limitsize else stat.st_size - ((limitsize // 40) * 40)

sciddtype = np.dtype(

[

("Time", "<u8"),

("Open", "<f4"),

("High", "<f4"),

("Low", "<f4"),

("Close", "<f4"),

("Trades", "<i4"),

("Volume", "<i4"),

("BidVolume", "<i4"),

("AskVolume", "<i4"),

]

)

df = pd.DataFrame(

data=np.memmap(f, dtype=sciddtype, offset=offset, mode="r"), copy=False

)

df.dropna(inplace=True)

df["Time"] = df["Time"] - 2209161600000000

df.drop(df[(df.Time < 1)].index, inplace=True)

df.set_index("Time", inplace=True)

df.index = pd.to_datetime(df.index, unit="us")

df.index = df.index.tz_localize(tz="utc")

return df





def resample_scdf(scdf_or_filename, period="1Min", tz="UTC", limitsize=sys.maxsize):

df = (

get_scid_df(scdf_or_filename, limitsize)

if isinstance(scdf_or_filename, str)

else scdf_or_filename

)

assert isinstance(

df, pd.DataFrame

), f"{scdf_or_filename} has to be SC df or valid scid file"

df = df.resample(period).agg(

{

"Open": "first",

"High": "max",

"Low": "min",

"Close": "last",

"Trades": "sum",

"Volume": "sum",

"BidVolume": "sum",

"AskVolume": "sum",

}

)

if tz != "UTC":

tz = "America/New_York" if tz == "EST" else tz

tz = "America/Los_Angeles" if tz == "PST" else tz

df.index = df.index.tz_convert(tz)

return df

print(resample_scdf("NQM4.CME.scid", tz="EST"))

And I get this:
Open High Low Close \
Time
2024-02-25 19:01:00-05:00 0.000000e+00 1822300.0 1822125.0 1822125.0
2024-02-25 19:02:00-05:00 -1.999001e+37 1820775.0 1820275.0 1820775.0
2024-02-25 19:03:00-05:00 0.000000e+00 1820850.0 1820250.0 1820875.0
2024-02-25 19:04:00-05:00 0.000000e+00 1820900.0 1820025.0 1820625.0
2024-02-25 19:05:00-05:00 0.000000e+00 1820700.0 1819900.0 1820600.0
... ... ... ... ...
2024-04-30 12:20:00-04:00 0.000000e+00 1777300.0 1776650.0 1777175.0
2024-04-30 12:21:00-04:00 0.000000e+00 1777625.0 1776950.0 1777000.0
2024-04-30 12:22:00-04:00 0.000000e+00 1777350.0 1776950.0 1777150.0
2024-04-30 12:23:00-04:00 -1.999001e+37 1777575.0 1776900.0 1777175.0
2024-04-30 12:24:00-04:00 -1.999001e+37 1777350.0 1777125.0 1777175.0

Trades Volume BidVolume AskVolume
Time
2024-02-25 19:01:00-05:00 1 1 1 0
2024-02-25 19:02:00-05:00 2 2 0 2
2024-02-25 19:03:00-05:00 4 4 0 4
2024-02-25 19:04:00-05:00 10 10 0 10
2024-02-25 19:05:00-05:00 6 6 2 4
... ... ... ... ...
2024-04-30 12:20:00-04:00 869 959 376 583
2024-04-30 12:21:00-04:00 861 1023 430 593
2024-04-30 12:22:00-04:00 462 518 225 293
2024-04-30 12:23:00-04:00 522 548 279 269
2024-04-30 12:24:00-04:00 58 58 18 40

[93144 rows x 8 columns]

As you can see the Open field is fubar. I think time stamp is being parsed incorrectly but have no idea how to address it. Any thoughts?
[2024-05-06 18:47:20]
sgne - Posts: 104
I think O is that way because the SCID file is 1-tick, instead of 1-second or greater. In 1-tick files, O is zero, and H and L are bid/ask values. Good luck.
[2024-05-07 21:48:56]
User150671 - Posts: 69
This is explained in the documentation here:
Intraday Data File Format
[2024-08-17 16:42:00]
User150671 - Posts: 69
Okay. So I fucked up my code to convert time to and from SCID files. I just noticed today as I was trying to do some backtesting on things in different timezones, and realised it was always converting from SCID to NYC time.

Fixed Code:

import datetime as dt
import numpy as np

# takes a scid time and returns it as a dt.datetime
def un_sierra_time(scidtime):
  sierra_epoch = dt.datetime(1899, 12, 30, tzinfo=dt.timezone.utc)
  dtg = sierra_epoch + dt.timedelta(microseconds=int(scidtime))
  return dtg

# takes a dt.datetime object and returns it as a np.uint64 for working with scid files
def re_sierra_time(dtg):
  sierra_epoch = dt.datetime(1899, 12, 30, tzinfo=dt.timezone.utc)
  delta = dtg - sierra_epoch
  scidtime = np.uint64(delta.total_seconds()*1000000)
  return scidtime

Explanation:

As per the Intraday File Format at Intraday Data File Format: DateTime The SCID DateTime is:
"...a 64-bit integer. It represents the number of microseconds since the SCDateTime epoch of December 30, 1899."


un_sierra_time is fairly straight forward:
We take the Sierra Epoch Date and add the value of the SCID DateTime to get a Date Time Object using dt.timedelta

re_sierra_time is also fairly straight forward, we just do the opposite:
We take the sierra epoch and subtract the SCID DateTime from this to get a datetime timedelta object.
That delta object stores total days, seconds, and microseconds, and if you call it with total.seconds() you'll get a float with the total microseconds as the fraction.
We can then multiply that float by 1000000 and cast it into an int in order to store in or compare with an SCID time value (in this case I use a numpy uint64 as thats the format I am working with SCIDs with)

Apologies if anyone stole my code and was fucked around by it!

To post a message in this thread, you need to log in with your Sierra Chart account:

Login

Login Page - Create Account