Data Writing Step#

Introduction#

In order to facilitate getting data onto a device for experimentation and Tutorials, OctaiPipe offers a data writing step.

The data writing step has two main functionalities:

  1. Get a single dataset onto a device, e.g. read sample data from a SQL table and save it in a CSV on the device

  2. Write data in intervals to a device, simulating live data

The data is loaded from 1 of 3 sources and then written based on the output_data_specs.

Data sources#

The data writing steps has three main sources of data:

  1. Data downloaded from a link

  2. Data from a local file (can be uploaded through deployment)

  3. Data from input data specs

These are checked in turn, i.e. if a link is provided, data will not be retrieved from a file or input data specs. Input data specs will only be used if link and filepath are null.

Let’s go through these three sources individually:

Read from a file#

This allows data to be read from a local file, then written to the location specified by the output_data_specs.

NOTE: This also needs to be a CSV file similar to the link

A filepath can be provided per device to put different data on each device, or the same data on all devices.

The file(s) can also be transferred down to the device using OctaiPipe’s deployment mechanism. If the file to read from is in the configs folder of the workspace, it will be in the configs folder on the device as well. So, a file in ./configs/data.csv can be specified as ./configs/data.csv in the data writing step.

From input data specs#

This simply uses input_data_specs with the same pattern as those for Federated Learning or Federated EDA.

Input data specs are not necessary but can be used for the step, if you want to for example read from a SQL database and write to influx on your device.

Configuring and running step#

Configuring the step in OctaiPipe is similar to other pipeline steps. A YAML config file is used and the step can be run on edge devices using the deploy_to_edge.

An example config file is shown below:

 1  name: data_writing
 2
 3  # Input data specs not necessary
 4  input_data_specs:
 5    devices:
 6    - device: default
 7      datastore_type: influxdb
 8      query_type: dataframe
 9      query_template_path: ./configs/influx_query.txt
10      query_values:
11        start: "2024-04-11T16:00:00.000Z"
12        stop: "2024-04-11T16:30:00.000Z"
13        bucket: test-bucket
14        measurement: test-measurement
15      data_converter: {}
16    - device: feda-test-1
17      datastore_type: influxdb
18      query_type: dataframe
19      query_template_path: ./configs/influx_query.txt
20      query_values:
21        start: "2024-04-11T16:00:00.000Z"
22        stop: "2024-04-11T16:30:00.000Z"
23        bucket: test-bucket
24        measurement: test-measurement
25      data_converter: {}
26
27  output_data_specs:
28    - datastore_type: influxdb
29      settings:
30        bucket: test-bucket
31        measurement: live-model-predictions
32
33  data_feeding_specs:
34    from_link: # Link to download from
35      default: 'https://link-to-csv.org/my-csv.csv'
36      device-1: 'https://link-to-csv.org/my-csv-dev-1.csv'
37    from_file: # Filepath to CSV
38      default: './configs/my-data.csv'
39      device-1: './configs/my-data-dev-1.csv'
40    write_once: false # Whether to write all at once or simulate live data
41    chunk_size: 10 # Number of rows to write at once
42    interval: 10 # How long to sleep between writes
43
44  run_specs: {}

The key things from the above config are the following:

input_data_specs#

Where to read data from if from_link and from_file are None. Will be overwritten if link or file are provided.

data_feeding_specs#

Contains all configuration specific to data writing step.

from_file#

This is a dictionary where each key is a device ID and the value the filepath to read from. The key default can also be used for any device ID not listed.

This defaults to None.

write_once#

Whether to write all data at once or in intervals to mimic live data. If simulating live data, the data will be read from the top down.

This defaults to True, meaning data is written once only.

chunk_size#

How many rows of data to write at once if writing in intervals. Defaults to 10.

interval#

How often to write in seconds. Defaults to 10 seconds.

Notes on data formatting#

In general, the data needs to be in the correct format that you wish it to be saved in. For example, if you include a column in the dataset that you don not wish to save, remove this before running the step as the step does no formatting.

If you are writing to an influx database and write_once is set to True, you need the data to have a column called _time, which has the timestamps of the data. If you write continuously to influx, the time stamp is set by the data writing step.