Binder Colab Deepnote Kaggle

Encoding ODB-2 Data

Trivial Example

Given a pandas DataFrame to encode it, the data should simply be passed to encode_odb() function:

[1]:
import sys
import os
sys.path.insert(0, os.path.abspath('../../..'))
[2]:
import pandas as pd
import pyodc as odc

df = pd.read_csv('data-1.csv')

odc.encode_odb(df, 'example-1.odb')

File Type Object

Encoding of ODB-2 data works with file-like objects as well as with file names:

[3]:
with open('example-1.odb', 'wb') as f:
    odc.encode_odb(df, f)

Configuring Encoded Columns

By default, pyodc will always encode ODB-2 data in a lossless manner. In particular, most values are encoded as 8-byte DOUBLE values.

Typically, the encoder will automatically select a data type and corresponding encoder to use. This data type can be overridden by supplying a types dictionary, for example to encode a column as a 4-byte REAL value:

[4]:
odc.encode_odb(df, 'example-3.odb', types={'obsvalue@body': odc.REAL})

The interrogation of the frame headers shows that the data type has changed:

[5]:
r1 = odc.Reader('example-1.odb', aggregated=False)
r3 = odc.Reader('example-3.odb', aggregated=False)

print('original:', r1.frames[0].column_dict['obsvalue@body'].dtype)
print('updated: ', r3.frames[0].column_dict['obsvalue@body'].dtype)
original: DataType.DOUBLE
updated:  DataType.REAL

Decoded data also confirms that the precision has been appropriately reduced:

[6]:
df_decoded = odc.read_odb('example-3.odb', single=True)
print(df_decoded)
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat00  0-12345-0-67890       0.000000
1       1  20210420     stat01  0-12345-0-67891      12.345600
2       1  20210420     stat02  0-12345-0-67892      24.691200
3       1  20210420     stat03  0-12345-0-67893      37.036800
4       1  20210420     stat04  0-12345-0-67894      49.382401
5       1  20210420     stat05  0-12345-0-67895      61.728001
6       1  20210420     stat06  0-12345-0-67896      74.073601
7       1  20210420     stat07  0-12345-0-67897      86.419197
8       1  20210420     stat08  0-12345-0-67898      98.764801
9       1  20210420     stat09  0-12345-0-67899     111.110397

   integer_missing  double_missing  bf_column  bf_missing
0           1234.0           12.34          0         0.0
1           4321.0           43.21          9         9.0
2              NaN             NaN          6         6.0
3           1234.0           12.34         10        10.0
4           4321.0           43.21          5         5.0
5              NaN             NaN          7         NaN
6           1234.0           12.34         15        15.0
7           4321.0           43.21          0         0.0
8              NaN             NaN          9         9.0
9           1234.0           12.34          6         6.0

Configuring Frame Structure

ODB-2 data is broken down into frames. By default a maximum of 10 000 rows of data will be encoded into each frame. If more than 10 000 rows are supplied, then the data will be split into a sequence of frames with at maximum 10 000 rows.

To modify the threshold, pass rows_per_frame argument:

[7]:
odc.encode_odb(df, 'example-4.odb', rows_per_frame=3)

Examination of the frame structure clearly shows that the data now contains multiple frames:

[8]:
r1 = odc.Reader('example-1.odb', aggregated=False)
r4 = odc.Reader('example-4.odb', aggregated=False)

print('original frames:', r1.frames)
print('updated  frames:', r4.frames)

print('original row counts:', [f.nrows for f in r1.frames])
print('updated  row counts:', [f.nrows for f in r4.frames])
original frames: [<pyodc.frame.Frame object at 0x1218d4b80>]
updated  frames: [<pyodc.frame.Frame object at 0x1080e3c70>, <pyodc.frame.Frame object at 0x121867610>, <pyodc.frame.Frame object at 0x121867d30>, <pyodc.frame.Frame object at 0x121866e60>]
original row counts: [10]
updated  row counts: [3, 3, 3, 1]

Despite these differences, if decoded the data is the same:

[9]:
df_decoded = odc.read_odb('example-4.odb', single=True)
print(df_decoded)
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat00  0-12345-0-67890         0.0000
1       1  20210420     stat01  0-12345-0-67891        12.3456
2       1  20210420     stat02  0-12345-0-67892        24.6912
3       1  20210420     stat03  0-12345-0-67893        37.0368
4       1  20210420     stat04  0-12345-0-67894        49.3824
5       1  20210420     stat05  0-12345-0-67895        61.7280
6       1  20210420     stat06  0-12345-0-67896        74.0736
7       1  20210420     stat07  0-12345-0-67897        86.4192
8       1  20210420     stat08  0-12345-0-67898        98.7648
9       1  20210420     stat09         0-12345-       111.1104

   integer_missing  double_missing  bf_column  bf_missing
0           1234.0           12.34          0         0.0
1           4321.0           43.21          9         9.0
2              NaN             NaN          6         6.0
3           1234.0           12.34         10        10.0
4           4321.0           43.21          5         5.0
5              NaN             NaN          7         NaN
6           1234.0           12.34         15        15.0
7           4321.0           43.21          0         0.0
8              NaN             NaN          9         9.0
9           1234.0           12.34          6         6.0

Additional Properties

To encode additional properties as part of frame’s data, specify properties parameter to encode_odb() function with a dictionary value you want to include:

[10]:
metadata = {
    'encoded_by': 'ECMWF',
    'data_source': 'pyodc_docs',
}
odc.encode_odb(df, 'example-5.odb', properties=metadata)

Encoded properties are accessible via properties key of the frame object:

[11]:
r1 = odc.Reader('example-5.odb')
print([f.properties for f in r1.frames])
[{'encoded_by': 'ECMWF', 'data_source': 'pyodc_docs'}]

Encoding Bitfields

Bitfield columns encode integer values accompanied by metadata describing the nature of the bits. A columns cannot be auto-detected as a bitfield, as the data will be considered integral, so the type must be set explicitly.

And additional dictionary object may be passed to the encode function containing the bitfield structure. For each bitfield column a sequence of values should be supplied corresponding to the specific bit fields. Each of these values can take one of two forms:

  • A string, naming the bit field (which will be assumed to comprise a single bit)

  • A tuple of the name of the bit field and the number of corresponding bits

The sequence supplied should match the number of bits set in the values.

[12]:
types = {
    'bf_column': odc.BITFIELD,
    'bf_missing': odc.BITFIELD,
}

bitfields = {
    'bf_column': ['bit1', ('bitpair', 2), ('bit4', 1)],
    'bf_missing': ['bit1', ('bitpair', 2), ('bit4', 1)]
}

odc.encode_odb(df, 'example-6.odb', types=types, bitfields=bitfields)

This data can be seen by explicitly decoding the bit fields.

[13]:
df_decoded = odc.read_odb('example-6.odb',
                          columns=['bf_column.bit1', 'bf_column.bitpair', 'bf_column.bit4',
                                   'bf_missing.bit1', 'bf_missing.bitpair', 'bf_missing.bit4'],
                          single=True)
print(df_decoded)
   bf_column.bit1  bf_column.bitpair  bf_column.bit4 bf_missing.bit1  \
0           False                  0           False           False
1            True                  0            True            True
2           False                  3           False           False
3           False                  1            True           False
4            True                  2           False            True
5            True                  3           False            None
6            True                  3            True            True
7           False                  0           False           False
8            True                  0            True            True
9           False                  3           False           False

   bf_missing.bitpair bf_missing.bit4
0                 0.0           False
1                 0.0            True
2                 3.0           False
3                 1.0            True
4                 2.0           False
5                 NaN            None
6                 3.0            True
7                 0.0           False
8                 0.0            True
9                 3.0           False

A Sequence of (Unrelated) Data

ODB-2 frames are self-contained and passed as a stream of data, which means there is no requirement that they are related with each other.

For example, we can encode frames of two different structures (also known as incompatible data):

[14]:
df2 = pd.read_csv('data-2.csv')

with open('example-2.odb', 'wb') as f:
   odc.encode_odb(df, f)
   odc.encode_odb(df2, f)

The trivial decoder will now result in a DataFrame with a substantial number of missing values:

[15]:
with open('example-2.odb', 'rb') as f:
    df_decoded = odc.read_odb(f, single=True)

print(df_decoded)
    expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0        1  20210420     stat00  0-12345-0-67890         0.0000
1        1  20210420     stat01  0-12345-0-67891        12.3456
2        1  20210420     stat02  0-12345-0-67892        24.6912
3        1  20210420     stat03  0-12345-0-67893        37.0368
4        1  20210420     stat04  0-12345-0-67894        49.3824
5        1  20210420     stat05  0-12345-0-67895        61.7280
6        1  20210420     stat06  0-12345-0-67896        74.0736
7        1  20210420     stat07  0-12345-0-67897        86.4192
8        1  20210420     stat08  0-12345-0-67898        98.7648
9        1  20210420     stat09  0-12345-0-67899       111.1104
10       2  20210420     stat00             None         0.0000
11       2  20210420     stat01             None        12.3456
12       2  20210420     stat02             None        24.6912
13       2  20210420     stat03             None        37.0368
14       2  20210420     stat04             None        49.3824
15       2  20210420     stat05             None        61.7280
16       2  20210420     stat06             None        74.0736
17       2  20210420     stat07             None        86.4192
18       2  20210420     stat08             None        98.7648
19       2  20210420     stat09             None       111.1104

    integer_missing  double_missing  bf_column  bf_missing
0            1234.0           12.34        0.0         0.0
1            4321.0           43.21        9.0         9.0
2               NaN             NaN        6.0         6.0
3            1234.0           12.34       10.0        10.0
4            4321.0           43.21        5.0         5.0
5               NaN             NaN        7.0         NaN
6            1234.0           12.34       15.0        15.0
7            4321.0           43.21        0.0         0.0
8               NaN             NaN        9.0         9.0
9            1234.0           12.34        6.0         6.0
10              NaN             NaN        NaN         NaN
11              NaN             NaN        NaN         NaN
12              NaN             NaN        NaN         NaN
13              NaN             NaN        NaN         NaN
14              NaN             NaN        NaN         NaN
15              NaN             NaN        NaN         NaN
16              NaN             NaN        NaN         NaN
17              NaN             NaN        NaN         NaN
18              NaN             NaN        NaN         NaN
19              NaN             NaN        NaN         NaN