Datasets¶

Available datasets:

Name	Type of experiment	Channels	Sampling rate	Original size
CWRU	Initiated faults	up to 3 vibrations	12 kHz / 48 kHz	656 Mb
Dcase2020	Labelled data	1 acoustic	16 kHz	16.5 Gb
DIRG	Initiated faults & Run-to-failure	6 vibrations	51.2 kHz / 102.4 kHz	3 Gb
FEMTO	Run-to-failure	up to 3, 2 vibrations + 1 temperature	25.6 kHz / 10 Hz	3 Gb
Fraunhofer151	Labelled data	5, 1 voltage + 1 rpm + 3 vibrations	4096 Hz	11 Gb
IMS	Run-to-failure	up to 8 vibrations	20480 Hz	6.1 Gb
Paderborn	Initiated faults & Run-to-failure	8, 2 currents + 1 vibration + 3 mechanic + 1 temperature	64 kHz / 4 kHz / 1 Hz	20.8 Gb

Meaning of different types of experiment:

Initiated faults: fault is precisely labelled, e.g. by the name of the faulty component.
Labelled data: fault is roughly labelled, e.g. by 'normal' or 'faulty'.
Run-to-failure: continuous experiment, no information about the faulty state is available in general.

Installation & Preprocessing¶

Nomenclature

In dpmhm we distinguish three concepts:

Original dataset: what is provided by the original source
Built dataset: original dataset formatted by tensorflow-datasets
Preprocessed dataset: built dataset after preprocessing steps

As mentionned in Workflow a dataset needs to be installed and preprocessed before being fed to ML models.

Installation¶

Let's take example of the dataset CWRU. For installation simply use

from dpmhm import datasets

datasets.install('CWRU')

See the tutorial Installation of Datasets and Preprocessing of Datasets for a in-depth walkthrough.

Preprocessing¶

A preprocessing pipeline consists of 3 levels of transformations:

File-level preprocessing: data selection, label modification, signal resampling & truncation etc.
Data-level preprocessing: feature extraction, windowed view, data augmentation etc.
Model-level preprocessing: adaptation to the specification of a machine learning model.

Let's take example of a dataset of 3-channels acoustic records of 10 seconds each file. The corresponding steps would be:

Select only the first channel. Make finer labels.

Compute the spectrogram of each record and split into patches of shape (64, 64).
Make batches of paired view of patches, for the training of a contrastive learning model.

See the page Datasets for more details.

Convention for the data structure¶

Built datasets have standard interface that can be inspected using the property .element_spec. In dpmhm a built dataset contains the following fileds:

Name	Type	Description
`signal`	dict	data of this record
`sampling_rate`	int or dict	sampling rate (Hz) of `signal`
`metadata`	dict	all other information about this record

and the same structrue is followed by all items of the dataset. When sampling_rate is a dictionary, it has the same keys as signal and the values is the corresponding sampling rate of each channel in signal, otherwise it is a number representing the common sampling rate of all channels.

For example the structure of the dataset CWRU looks like:

>>> ds['train'].element_spec

{'metadata': {'FaultComponent': TensorSpec(shape=(), dtype=tf.string, name=None),
  'FaultLocation': TensorSpec(shape=(), dtype=tf.string, name=None),
  'FaultSize': TensorSpec(shape=(), dtype=tf.float32, name=None),
  'FileName': TensorSpec(shape=(), dtype=tf.string, name=None),
  'LoadForce': TensorSpec(shape=(), dtype=tf.uint32, name=None),
  'NominalRPM': TensorSpec(shape=(), dtype=tf.uint32, name=None),
  'RPM': TensorSpec(shape=(), dtype=tf.uint32, name=None)},
 'sampling_rate': TensorSpec(shape=(), dtype=tf.uint32, name=None),
 'signal': {'BA': TensorSpec(shape=(None,), dtype=tf.float32, name=None),
  'DE': TensorSpec(shape=(None,), dtype=tf.float32, name=None),
  'FE': TensorSpec(shape=(None,), dtype=tf.float32, name=None)}}

which contains 3 channels ['BA', 'DE', 'FE'] and a common sampling rate. The meaning of each field can be found in the description pages of the corresponding dataset.

Data sturcture

We follow the principle that all original information has to be preserved in the built dataset. However some of them may be non-essential and can be dropped in the subsequent preprocessing steps, which may modify the element specification of the dataset.

Performance¶

The subsequent preprocessing steps of a dataset actually define a pipeline of transformations, very often heavy-lifted by the method .map(). In the graph execution mode of Tensorflow these transformations are not effective until an element of the preprocessed dataset is loaded into memory. Some intermediate steps may be repetitive and hinder the performance. As a remedy one can first pre-compute the heavyweight intermediate transformations and export the transformed dataset to disk, then postpone the lightweight final transformations in memory.