Data Writing Step#
Introduction#
In order to facilitate getting data onto a device for experimentation and Tutorials, OctaiPipe offers a data writing step.
The data writing step has two main functionalities:
Get a single dataset onto a device, e.g. read sample data from a SQL table and save it in a CSV on the device
Write data in intervals to a device, simulating live data
The data is loaded from 1 of 3 sources and then written based on the output_data_specs
.
Data sources#
The data writing steps has three main sources of data:
Data downloaded from a link
Data from a local file (can be uploaded through deployment)
Data from input data specs
These are checked in turn, i.e. if a link is provided, data will not be retrieved from a file or input data specs. Input data specs will only be used if link and filepath are null.
Let’s go through these three sources individually:
Downloaded from a link#
This allows data to be downloaded from a link with a CSV file, then written to the
location specified by the output_data_specs
.
NOTE: The link has to be for a CSV file with comma separated items. No processing of the data will be done on the device, it is assumed to be ready.
A link can be provided per device to put different data on each device, or the same data on all devices.
Read from a file#
This allows data to be read from a local file, then written to the location specified
by the output_data_specs
.
NOTE: This also needs to be a CSV file similar to the link
A filepath can be provided per device to put different data on each device, or the same data on all devices.
The file(s) can also be transferred down to the device using OctaiPipe’s deployment
mechanism. If the file to read from is in the configs
folder of the workspace,
it will be in the configs
folder on the device as well. So, a file in ./configs/data.csv
can be specified as ./configs/data.csv
in the data writing step.
From input data specs#
This simply uses input_data_specs
with the same pattern as those for Data Loading and Writing Utilities.
Input data specs are not necessary but can be used for the step, if you want to for example read from a SQL database and write to influx on your device.
Configuring and running step#
Configuring the step in OctaiPipe is similar to other pipeline steps. A YAML config
file is used and the step can be run on edge devices using the deploy_to_edge
.
An example config file is shown below:
1 name: data_writing
2
3 # Input data specs not necessary
4 input_data_specs:
5 default:
6 - datastore_type: influxdb
7 settings:
8 query_type: dataframe
9 query_template_path: ./configs/influx_query.txt
10 query_config:
11 start: "2024-04-11T16:00:00.000Z"
12 stop: "2024-04-11T16:30:00.000Z"
13 bucket: test-bucket
14 measurement: test-measurement
15 feda-test-1:
16 - datastore_type: influxdb
17 settings:
18 query_type: dataframe
19 query_template_path: ./configs/influx_query.txt
20 query_config:
21 start: "2024-04-11T16:00:00.000Z"
22 stop: "2024-04-11T16:30:00.000Z"
23 bucket: test-bucket
24 measurement: test-measurement
25
26 output_data_specs:
27 default:
28 - datastore_type: influxdb
29 settings:
30 bucket: test-bucket
31 measurement: live-model-predictions
32
33 data_feeding_specs:
34 from_link: # Link to download from
35 default: 'https://link-to-csv.org/my-csv.csv'
36 device-1: 'https://link-to-csv.org/my-csv-dev-1.csv'
37 from_file: # Filepath to CSV
38 default: './configs/my-data.csv'
39 device-1: './configs/my-data-dev-1.csv'
40 write_once: false # Whether to write all at once or simulate live data
41 chunk_size: 10 # Number of rows to write at once
42 interval: 10 # How long to sleep between writes
43 index_cols: ['_time'] # columns to use as index if any
44 exclude_cols: ['x_log_3'] # columns to exclude if any
45
46 run_specs: {}
The key things from the above config are the following:
input_data_specs
#
Where to read data from if from_link
and from_file
are None. Will be overwritten
if link or file are provided.
data_feeding_specs
#
Contains all configuration specific to data writing step.
from_link
#
This is a dictionary where each key is a device ID and the value the link to download from.
The key default
can also be used for any device ID not listed.
This defaults to None.
from_file
#
This is a dictionary where each key is a device ID and the value the filepath to read from.
The key default
can also be used for any device ID not listed.
This defaults to None.
write_once
#
Whether to write all data at once or in intervals to mimic live data. If simulating live data, the data will be read from the top down.
This defaults to True, meaning data is written once only.
chunk_size
#
How many rows of data to write at once if writing in intervals. Defaults to 10.
interval
#
How often to write in seconds. Defaults to 10 seconds.
index_cols
#
List of columns in data to use as index if using from_file
or from_link
.
exclude_cols
#
List of columns in data to exclude if using from_file
or from_link
.
Notes on data formatting#
In general, the data needs to be in the correct format that you wish it to be saved in. For example, if you include a column in the dataset that you don not wish to save, remove this before running the step as the step does no formatting.
If you are writing to an influx database and write_once
is set to True, you need
the data to have a column called _time
, which has the timestamps of the data. If
you write continuously to influx, the time stamp is set by the data writing step.