gluonts.dataset.split module#

Train/test splitter#

This module defines strategies to split a whole dataset into train and test subsets. The split() function can also be used to trigger their logic.

For uniform datasets, where all time series start and end at the same point in time OffsetSplitter can be used:

splitter = OffsetSplitter(offset=7)
train, test_template = splitter.split(whole_dataset)

For all other datasets, the more flexible DateSplitter can be used:

splitter = DateSplitter(
    date=pd.Period('2018-01-31', freq='D')
)
train, test_template = splitter.split(whole_dataset)

In the above examples, the train output is a regular Dataset that can be used for training purposes; test_template can generate test instances as follows:

test_dataset = test_template.generate_instances(
    prediction_length=7,
    windows=2,
)

The windows argument controls how many test windows to generate from each entry in the original dataset. Each window will begin after the split point, and so will not contain any training data. By default, windows are non-overlapping, but this can be controlled with the distance optional argument.

test_dataset = test_template.generate_instances(
    prediction_length=7,
    windows=2,
    distance=3, # windows are three time steps apart from each other
)

class gluonts.dataset.split.AbstractBaseSplitter[source]#

Bases: abc.ABC

Base class for all other splitter.

generate_test_pairs(dataset: gluonts.dataset.Dataset, prediction_length: int, windows: int = 1, distance: Optional[int] = None, max_history: Optional[int] = None) → Generator[Tuple[Dict[str, Any], Dict[str, Any]], None, None][source]#

generate_training_entries(dataset: gluonts.dataset.Dataset) → Generator[Dict[str, Any], None, None][source]#

split(dataset: gluonts.dataset.Dataset) → Tuple[gluonts.dataset.split.TrainingDataset, gluonts.dataset.split.TestTemplate][source]#

abstract test_pair(entry: Dict[str, Any], prediction_length: int, offset: int = 0) → Tuple[Dict[str, Any], Dict[str, Any]][source]#

abstract training_entry(entry: Dict[str, Any]) → Dict[str, Any][source]#

class gluonts.dataset.split.DateSplitter(date: pandas._libs.tslibs.period.Period)[source]#

Bases: gluonts.dataset.split.AbstractBaseSplitter

A splitter that slices training and test data based on a pandas.Period.

Training entries obtained from this class will be limited to observations up to (including) the given date.

Parameters: date (pandas._libs.tslibs.period.Period) – pandas.Period determining where the training data ends.

date: pandas._libs.tslibs.period.Period#

test_pair(entry: Dict[str, Any], prediction_length: int, offset: int = 0) → Tuple[Dict[str, Any], Dict[str, Any]][source]#

training_entry(entry: Dict[str, Any]) → Dict[str, Any][source]#

class gluonts.dataset.split.InputDataset(test_data: gluonts.dataset.split.TestData)[source]#

Bases: object

test_data: gluonts.dataset.split.TestData#

class gluonts.dataset.split.LabelDataset(test_data: gluonts.dataset.split.TestData)[source]#

Bases: object

test_data: gluonts.dataset.split.TestData#

class gluonts.dataset.split.OffsetSplitter(offset: int)[source]#

Bases: gluonts.dataset.split.AbstractBaseSplitter

A splitter that slices training and test data based on a fixed integer offset.

Parameters: offset (int) – Offset determining where the training data ends. A positive offset indicates how many observations since the start of each series should be in the training slice; a negative offset indicates how many observations before the end of each series should be excluded from the training slice.

offset: int#

test_pair(entry: Dict[str, Any], prediction_length: int, offset: int = 0) → Tuple[Dict[str, Any], Dict[str, Any]][source]#

training_entry(entry: Dict[str, Any]) → Dict[str, Any][source]#

class gluonts.dataset.split.TestData(dataset: gluonts.dataset.Dataset, splitter: gluonts.dataset.split.AbstractBaseSplitter, prediction_length: int, windows: int = 1, distance: Optional[int] = None, max_history: Optional[int] = None)[source]#

Bases: object

An iterable type used for wrapping test data.

Elements of a TestData object are pairs (input, label), where input is input data for models, while label is the future ground truth that models are supposed to predict.

Parameters

dataset (gluonts.dataset.Dataset) – Whole dataset used for testing.
splitter (gluonts.dataset.split.AbstractBaseSplitter) – A specific splitter that knows how to slices training and test data.
prediction_length (int) – Length of the prediction interval in test data.
windows (int) – Indicates how many test windows to generate for each original dataset entry.
distance (Optional[int]) – This is rather the difference between the start of each test window generated, for each of the original dataset entries.
max_history (Optional[int]) – If given, all entries in the test-set have a max-length of max_history. This can be used to produce smaller file-sizes.

dataset: gluonts.dataset.Dataset#

distance: Optional[int] = None#

property input: gluonts.dataset.split.InputDataset#

property label: gluonts.dataset.split.LabelDataset#

max_history: Optional[int] = None#

prediction_length: int#

splitter: gluonts.dataset.split.AbstractBaseSplitter#

windows: int = 1#

class gluonts.dataset.split.TestTemplate(dataset: gluonts.dataset.Dataset, splitter: gluonts.dataset.split.AbstractBaseSplitter)[source]#

Bases: object

A class used for generating test data.

Parameters

dataset (gluonts.dataset.Dataset) – Whole dataset used for testing.
splitter (gluonts.dataset.split.AbstractBaseSplitter) – A specific splitter that knows how to slices training and test data.

dataset: gluonts.dataset.Dataset#

generate_instances(prediction_length: int, windows: int = 1, distance: Optional[int] = None, max_history: Optional[int] = None) → gluonts.dataset.split.TestData[source]#

Generate an iterator of test dataset, which includes input part and label part.

Parameters

prediction_length – Length of the prediction interval in test data.
windows – Indicates how many test windows to generate for each original dataset entry.
distance – This is rather the difference between the start of each test window generated, for each of the original dataset entries.
max_history – If given, all entries in the test-set have a max-length of max_history. This can be used to produce smaller file-sizes.

splitter: gluonts.dataset.split.AbstractBaseSplitter#

class gluonts.dataset.split.TimeSeriesSlice(entry: Dict[str, Any], prediction_length: int = 0)[source]#

Bases: object

property end: pandas._libs.tslibs.period.Period#

entry: Dict[str, Any]#

prediction_length: int = 0#

property start: pandas._libs.tslibs.period.Period#

to_data_entry() → Dict[str, Any][source]#

class gluonts.dataset.split.TrainingDataset(dataset: gluonts.dataset.Dataset, splitter: gluonts.dataset.split.AbstractBaseSplitter)[source]#

Bases: object

dataset: gluonts.dataset.Dataset#

splitter: gluonts.dataset.split.AbstractBaseSplitter#

gluonts.dataset.split.periods_between(start: pandas._libs.tslibs.period.Period, end: pandas._libs.tslibs.period.Period) → int[source]#

Count how many periods fit between start and end (inclusive). The frequency is taken from start.

For example:

>>> start = pd.Period("2021-01-01 00", freq="2H")
>>> end = pd.Period("2021-01-01 11", "2H")
>>> periods_between(start, end)
6

>>> start = pd.Period("2021-03-03 23:00", freq="30T")
>>> end = pd.Period("2021-03-04 03:29", freq="30T")
>>> periods_between(start, end)
9

gluonts.dataset.split.slice_data_entry(entry: Dict[str, Any], slice_: slice, prediction_length: int = 0) → Dict[str, Any][source]#

gluonts.dataset.split.split(dataset: gluonts.dataset.Dataset, *, offset: Optional[int] = None, date: Optional[pandas._libs.tslibs.period.Period] = None) → Tuple[gluonts.dataset.split.TrainingDataset, gluonts.dataset.split.TestTemplate][source]#

gluonts.dataset.split.to_integer_slice(slice_: slice, start: pandas._libs.tslibs.period.Period) → slice[source]#: Returns an equivalent slice with integer bounds, given the start timestamp of the sequence it will apply to.

gluonts.dataset.split.to_positive_slice(slice_: slice, length: int) → slice[source]#: Return an equivalent slice with positive bounds, given the length of the sequence it will apply to.