API Reference
This part of the documentation details the complete SeFEF API.
Modules
sefef.evaluation
This module contains functions to implement time-series cross validation (TSCV).
- copyright:
2024 by Ana Sofia Carmo
- license:
BSD 3-clause License, see LICENSE for more details.
- class sefef.evaluation.Dataset(timestamps, samples_duration, sz_onsets)[source]
Bases:
objectCreate a Dataset with metadata on the data that will be used for training and testing
- timestamps
The Unix-time timestamp (in seconds) of the start timestamp of each sample.
- Type:
array-like, shape (#samples,)
- samples_duration
Duration of samples in seconds.
- Type:
array-like, shape (#samples,)
- sz_onsets
Contains the Unix-time timestamps (in seconds) corresponding to the onsets of seizures.
- Type:
np.array
- sampling_frequency
Frequency at which the data is stored in each file.
- Type:
int
- class sefef.evaluation.TimeSeriesCV(preictal_duration, prediction_latency, n_min_events_train=3, n_min_events_test=1, post_sz_interval=3600, pre_lead_sz_interval=14400, initial_train_duration=None, test_duration=None)[source]
Bases:
objectImplements time series cross validation (TSCV).
- preictal_duration
Duration of the period (in seconds) that will be labeled as preictal, i.e. that we expect to contain useful information for the forecast
- Type:
int, defaults to 3600 (60min)
- prediction_latency
Latency (in seconds) of the preictal period with regards to seizure onset.
- Type:
int, defaults to 600 (10min)
- n_min_events_train
Minimum number of lead seizures to include in the train set. Should guarantee at least one lead seizure is left for testing.
- Type:
int, defaults to 3
- n_min_events_test
Minimum number of lead seizures to include in the test set. Should guarantee at least one lead seizure is left for testing.
- Type:
int, defaults to 1
- post_sz_interval
Time interval (in seconds) after a lead seizure that should be included in the same set as the corresponding seizure. This time will be removed from the train set, along with the seizure onset and prediction_latency.
- Type:
int
- pre_lead_sz_interval
Time interval (in seconds) free of seizures by which a seizure should be preceded to be considered a lead seizure.
- Type:
int
- initial_train_duration
Set duration of train for initial split (in seconds).
- Type:
int, defaults to 1/3 of total recorded duration
- test_duration
Set duration of test (in seconds).
- Type:
int, defaults to 1/2 of ‘initial_train_duration’
- method
Method for TSCV - can be either ‘expanding’ or ‘sliding’. Only ‘expanding’ is implemented atm.
- Type:
str
- n_folds
Number of folds for the TSCV, determined according to the attributes set by the user and available data.
- Type:
int
- split_ind_ts
Contains split timestamp indices (train_start_ts, test_start_ts, test_end_ts) for each fold. Is initiated as None and populated during ‘split’ method.
- Type:
array-like, shape (n_folds, 3)
- split(dataset, iteratively) :
Get timestamp indices to split data for time series cross-validation. - The train set can be obtained by metadata.loc[train_start_ts : test_start_ts]. - The test set can be obtained by metadata.loc[test_start_ts : test_end_ts].
- plot(dataset) :
Plots the TSCV folds with the available data.
- iterate() :
Iterates over the TSCV folds and at each iteration returns a train set and a test set.
- Raises:
ValueError : – Raised whenever TSCV is not passible to be performed under the attributes set by the user and available data.
AttributeError : – Raised when ‘plot’ is called before ‘split’.
- get_TSCV_fold(h5dataset, ifold, remove_non_preictal_interictal_samples=True)[source]
Returns a train set and a test set from corresponding TSCV fold.
- Parameters:
h5dataset (HDF5 file) – HDF5 file object with the following datasets: - “data”: each entry corresponds to a sample with shape (embedding shape), e.g. (#features, ) or (sample duration, #channels) - “timestamps”: contains the start timestamp (unix in seconds) of each sample in the “data” dataset, with shape (#samples, ). - “annotations”: contains the labels (0: interictal, 1: preictal) for each sample in the “data” dataset, with shape (#samples, ). - “sz_onsets”: contains the Unix timestamps of the onsets of seizures (#sz_onsets, ).
ifold (int) – Index corresponding to TSCV fold.
remove_non_preictal_interictal_samples (bool) – Whether to remove samples that are neither preictal or interical, i.e. samples containing the onsets of seizures, as well as the intervals corrsponding to “prediction_latency” and “lead_sz_post_interval”.
- Returns:
tuple –
((train_data, train_annotations, train_timestamps, train_sz_onsets), (test_data, test_annotations, test_timestamps, test_sz_onsets))
- Where:
”[]_data”: A slice of “h5dataset[“data”]”, with shape (#samples, embedding shape), e.g. (#samples, #features) or (#samples, sample duration, #channels), and dtype “float32”.
”[]_annotations”: A slice of “h5dataset[“annotations”]”, with shape (#samples, ) and dtype “bool”.
”[]_timestamps”: A slice of “h5dataset[“timestamps”]”, with shape (#samples, ) and dtype “int64”.
”[]_sz_onsets”: A slice of “h5dataset[“sz_onsets”]”, with shape (#sz onsets, ) and dtype “int64”.
- iterate(h5dataset, remove_non_preictal_interictal_samples=True)[source]
Iterates over the TSCV folds and at each iteration returns a train set and a test set.
- Parameters:
h5dataset (HDF5 file) – HDF5 file object with the following datasets: - “data”: each entry corresponds to a sample with shape (embedding shape), e.g. (#features, ) or (sample duration, #channels) - “timestamps”: contains the start timestamp (unix in seconds) of each sample in the “data” dataset, with shape (#samples, ). - “annotations”: contains the labels (0: interictal, 1: preictal) for each sample in the “data” dataset, with shape (#samples, ). - “sz_onsets”: contains the Unix timestamps of the onsets of seizures (#sz_onsets, ).
remove_non_preictal_interictal_samples (bool) – Whether to remove samples that are neither preictal or interical, i.e. samples containing the onsets of seizures, as well as the intervals corrsponding to “prediction_latency” and “lead_sz_post_interval”.
- Returns:
tuple –
((train_data, train_annotations, train_timestamps), (test_data, test_sz_onsets, test_timestamps))
- Where:
”[]_data”: A slice of “h5dataset[“data”]”, with shape (#samples, embedding shape), e.g. (#samples, #features) or (#samples, sample duration, #channels), and dtype “float32”.
”[]_annotations”: A slice of “h5dataset[“annotations”]”, with shape (#samples, ) and dtype “bool”.
”[]_sz_onsets”: A slice of “h5dataset[“sz_onsets”]”, with shape (#sz onsets, ) and dtype “int64”.
”[]_timestamps”: A slice of “h5dataset[“timestamps”]”, with shape (#samples, ) and dtype “int64”.
- plot(dataset, folder_path=None, filename=None, mode='lines')[source]
Plots the TSCV folds with the available data.
- Parameters:
dataset (Dataset) – Instance of Dataset.
mode (str) – Trace scatter mode (“lines” or “markers”), for sparse data, “markers” is a more suitable option, despite being heavier to plot.
- split(dataset, iteratively=False, plot=False, extend_final_test_set=False)[source]
Get timestamp indices to split data for time series cross-validation. - The train set would be given by metadata.loc[train_start_ts : test_start_ts]. - The test set would be given by metadata.loc[test_start_ts : test_end_ts].
Parameters:
- datasetDataset
Instance of Dataset.
- iterativelybool, defaults to False
If the split is meant to return the timestamp indices for each fold iteratively (True) or to simply update ‘split_ind_ts’ (False).
- plotbool, defaults to False
If a diagram illustrating the TSCV should be shown at the end. ‘iteratively’ cannot be set to True
- extend_final_test_setbool
Whether to extend test set in final fold to include all data or keep test duration approximately the same across folds.
Returns:
- train_start_tsint
Timestamp index for the start of the train set.
- test_start_tsint
Timestamp index for the start of the test set (and end of train set).
- test_end_tsint
Timestamp index for the end of the test set.
sefef.labeling
This module contains functions to automatically label samples according to the desired pre-ictal duration and prediction latency.
- copyright:
2024 by Ana Sofia Carmo
- license:
BSD 3-clause License, see LICENSE for more details.
- sefef.labeling.add_annotations(h5dataset, sz_onsets_ts, preictal_duration=3600, prediction_latency=600)[source]
Add “annotations”, with shape (#samples, ) and dtype “bool”, to HDF5 file object according to the variables “preictal_duration” and “prediction_latency”. Annotations are either 0 (inter-ictal), or 1 (pre-ictal).
- Parameters:
h5dataset (HDF5 file) – HDF5 file object with the following datasets: - “data”: each entry corresponds to a sample with shape (embedding shape), e.g. (#features, ) or (sample duration, #channels). - “timestamps”: contains the start timestamp (unix in seconds) of each sample in the “data” dataset, with shape (#samples, ). - “sz_onsets”: contains the Unix timestamps of the onsets of seizures (#sz_onsets, ). (optional)
sz_onsets_ts (array-like, shape (#sz onsets, )) – Contains the unix timestamps (in seconds) of the onsets of seizures.
preictal_duration (int, defaults to 3600 (60min)) – Duration of the period (in seconds) that will be labeled as preictal, i.e. that we expect to contain useful information for the forecast
prediction_latency (int, defaults to 600 (10min)) – Latency (in seconds) of the preictal period with regards to seizure onset.
- Returns:
None, but adds a dataset instance to the h5dataset file object.
- sefef.labeling.add_sz_onsets(h5dataset, sz_onsets_ts)[source]
Add “sz_onsets”, with shape (#seizures, ) and dtype “int64”, to HDF5 file object, corresponding to the Unix timestamps of each seizure onset.
- Parameters:
h5dataset (HDF5 file) – HDF5 file object with the following datasets: - “data”: each entry corresponds to a sample with shape (embedding shape), e.g. (#features, ) or (sample duration, #channels). - “timestamps”: contains the start timestamp (unix in seconds) of each sample in the “data” dataset, with shape (#samples, ). - “annotations”: contains the annotations (aka labels) of each sample. (optional)
sz_onsets_ts (array-like, shape (#sz onsets, )) – Contains the unix timestamps (in seconds) of the onsts of seizures.
- Returns:
None, but adds a dataset instance to the h5dataset file object.
sefef.postprocessing
This module contains functions to process individual predicted probabilities into a unified forecast according to the desired forecast horizon. Author: Ana Sofia Carmo
- copyright:
2024 by Ana Sofia Carmo
- license:
BSD 3-clause License, see LICENSE for more details.
- class sefef.postprocessing.Forecast(pred_proba, timestamps)[source]
Bases:
objectStores the forecasts made by the model and processes them.
- pred_proba
Contains the probability predicted by the model for each sample belonging to the pre-ictal class.
- Type:
array-like, shape (#samples, ), dtype “float64”
- timestamps
Contains the unix timestamps (in seconds) corresponding to the start-time of each sample.
- Type:
array-like, shape (#samples, ), dtype “int64”
- append(pred_proba, timestamps) :
Appends new predicted probabilities to the ones already in the Forecast object.
- postprocess(forecast_horizon) :
Applies postprocessing methodology to the predictions stored in “pred_proba”, according to “forecast horizon” (in seconds). Returns an array with the new probabilities.
- Raises:
ValueError : – Description
- postprocess(forecast_horizon, smooth_win, smooth_sliding=False, origin='clock-time')[source]
Applies post-processing methodology to the predictions stored in “pred_proba”. For each time period with duration equal to “forecast_horizon”, mean predicted probabilities are calculated for groups of consecutive samples (with a window of duration “smooth_win”, in seconds), with or without overlap, and the maximum across the full period is obtained.
- Parameters:
forecast_horizon (int) – Forecast horizon in seconds, i.e. time in the future for which the forecasts will be issued.
smooth_win (int) – Duration of window, in seconds, used to smooth the predicted probabilities. If “smooth_sliding” is set to False, the duration of this variable should sum up to “forecast_horizon”.
smooth_sliding (bool, defaults to False) – Whether to use a sliding-window approach during smoothing (with a step of 1 sample), or to use non-overlaping smoothing windows. When True, not yet implemented.
origin (str, defaults to "clock-time") – Determines if the forecasts are issued at clock-time (e.g. at the start of each hour) or according to the start-time of the first sample. Options are “clock-time” and “sample-time”, respectively.
- Returns:
result1 (array-like, shape (#forecasts, ), dtype “float64”) – Contains the predicted probabilites of seizure occurrence for the period with duration “forecast_horizon” and starting at the timestamps in “result2”.
result2 (array-like, shape (#forecasts, ), dtype “int64”) – Contains the Unix timestamps, in seconds, for the start of the period for which the forecasts (in “result1”) are valid.
sefef.scoring
This module contains functions to compute both deterministic and probvabilistic metrics according to the horizon of the forecast.
- copyright:
2024 by Ana Sofia Carmo
- license:
BSD 3-clause License, see LICENSE for more details.
- class sefef.scoring.Scorer(metrics2compute, sz_onsets, forecast_horizon, reference_method='prior_prob', hist_prior_prob=None)[source]
Bases:
objectClass description
- metrics2compute
List of metrics to compute. The metrics can be either deterministic or probabilistic and metric names should be the ones from the following list: - Deterministic: “Sen” (i.e. sensitivity), “FPR” (i.e. false positive rate), “TiW” (i.e. time in warning), “AUC_TiW” (i.e. area under the curve of Sen vs TiW). - Probabilistic: “resolution”, “reliability”, “BS” (i.e. Brier score), “skill” or “BSS” (i.e. Brier skill score).
- Type:
list<str>
- sz_onsets
Contains the Unix timestamps, in seconds, for the start of each seizure onset.
- Type:
array-like, shape (#seizures, ), dtype “int64”
- forecast_horizon
Forecast horizon in seconds, i.e. time in the future for which the forecasts are valid.
- Type:
int
- performance
Dictionary where the keys are the metrics’ names (as in “metrics2compute”) and the value is the corresponding performance. It is initialized as an empty dictionary and populated in “compute_metrics”.
- Type:
dict
- reference_method
Method to compute the reference forecasts.
- Type:
str, defaults to “prior_prob”
- hist_prior_prob
Prior probability, aka historical likelihood (relative frequency) of seizures in train data. Used only as the “hist_prior_prob” reference forecast compute the skill measure.
- Type:
float64, defaults to None
- compute_metrics(forecasts, timestamps):
Computes metrics in “metrics2compute” for the probabilities in “forecasts” and populates the “performance” attribute. This method uses techniques described in [Mason2004] and [Stephenson2008].
- reliability_diagram() :
Description
- Raises:
ValueError : – Raised when a metric name in “metrics2compute” is not a valid metric or when “reference_method” is not a valid method.
AttributeError : – Raised when ‘compute_metrics’ is called before ‘compute_metrics’.
References
[Mason2004]Mason, “On Using ‘Climatology’ as a Reference Strategy in the Brier and Ranked Probability Skill Scores,” Jul. 2004, Accessed: Nov. 06, 2024. [Online]. Available: https://journals.ametsoc.org/view/journals/mwre/132/7/1520-0493_2004_132_1891_oucaar_2.0.co_2.xml
[Stephenson2008]Stephenson, D. B. , C. A. S. Coelho, and I. T. Jolliffe. “Two Extra Components in the Brier Score Decomposition”, Weather and Forecasting 23, 4 (2008): 752-757, doi: https://doi.org/10.1175/2007WAF2006116.1
- compute_metrics(forecasts, timestamps, threshold=0.5, binning_method='quantile', num_bins=10, draw_diagram=True)[source]
Computes metrics in “metrics2compute” for the probabilities in “forecasts” and populates the “performance” attribute.
- Parameters:
forecasts (array-like, shape (#forecasts, ), dtype "float64") – Contains the predicted probabilites of seizure occurrence for the period with duration equal to the forecast horizon and starting at the timestamps in “timestamps”.
timestamps (array-like, shape (#forecasts, ), dtype "int64") – Contains the Unix timestamps, in seconds, for the start of the period for which the forecasts (in “forecasts”) are valid.
threshold (float64, defaults to 0.5) – Probability value to apply as the high-likelihood threshold.
binning_method (str, defaults to "equal_frequency") –
- Method used to determine the number of bins used to compute probabilistic metrics. Available methods are:
”uniform”: number of bins corresponds to np.ceil(#forecasts^(1/3)), set at approximately equal distances.
”quantile”: number of bins corresponds to np.ceil(#forecasts^(1/3)), which are populated with an approximately equal number of forecasts.
num_bins (int64, defaults to 10) – Number of bins used to compute probabilistic metrics. If None, it is calculated as np.ceil(#forecasts^(1/3)), otherwise “num_bins” is used as the number of bins.
draw_diagram (bool, defaults to True) – Whether to draw the reliability diagram after computing all required metrics.
- Returns:
performance (dict) – Dictionary where the keys are the metrics’ names (as in “metrics2compute”) and the value is the corresponding performance.
sefef.visualization
This is a helper module for visualization.
- copyright:
2024 by Ana Sofia Carmo
- license:
BSD 3-clause License, see LICENSE for more details.
- sefef.visualization.aggregate_plots(figs, folder_path=None, filename=None, show=True)[source]
Receives go.Figure objects created using “plot_forecasts” and aggregates them into a single Figure.
- Parameters:
figs (go.Figure) – Figures to aggregate into a single plot.
- sefef.visualization.hex_to_rgba(h, alpha)[source]
Converts color value in hex format to rgba format with alpha transparency
- sefef.visualization.plot_forecasts(forecasts, ts, sz_onsets, high_likelihood_thr, forecast_horizon, title='Seizure probability', folder_path=None, filename=None, show=True, return_plot=False, n_points=100)[source]
Provide visualization of forecasts.
- Parameters:
forecasts (array-like, shape (#forecasts, ), dtype "float64") – Contains the predicted probabilites of seizure occurrence for the period with duration “forecast_horizon” and starting at the timestamps in “result2”.
ts (array-like, shape (#forecasts, ), dtype "int64") – Contains the Unix timestamps, in seconds, for the start of the period for which the forecasts (in “result1”) are valid.
sz_onsets (array-like, shape (#sz onsets, )) – Contains the unix timestamps (in seconds) of the onsts of seizures.
high_likelihood_thr (float64) – Value between 0 and 1 corresponding to the threshold of high-likelihood.