convtools Table - stream processing

1. Installation

pip install convtools

For the sake of conciseness, let’s assume the following imports are in place:

from convtools.contrib.tables import Table
from convtools import conversion as c

This is an object which exposes public API.

Please make sure you’ve read - base info here.

2. Basics

  • table conversion - an instance of convtools.contrib.tables.Table, which works with iterables, so it cannot be reused once consumed

  • duplicate column names - there’s a special treatment to these:

    • when initializing from iterable of rows, by default it raises ValueError when a duplicate column name is detected (duplicate_columns option is set to "raise" by default)

    • when initializing from CSV, by default it mangles duplicate column names like: a, a_1, a_2, etc. (duplicate_columns option is set to "mangle" by default)

    duplicate_columns option accepts the following values: "raise", "keep", "drop", "mangle", please see convtools.contrib.tables.Table.from_rows docs for more information

3. Reading CSV

Let’s:
  • read tab-delimited tests/csvs/ac.csv file, which contains two columns a and c and a header in the first row

  • take a and c columns, omitting other imaginary ones :)

  • add a new column B, which is a sum of a and c (casting to types is needed, since the simple csv.reader is used under the hood, so everything is a string)

  • rename a column to A

  • drop c column in the end

  • store the result in tests/csvs/out.csv file

(
    Table.from_csv(
        "tests/csvs/ac.csv",
        header=True,
        dialect=Table.csv_dialect(delimiter="\t"),
    )
    .take("a", "c")
    .update(B=c.col("a").as_type(int) + c.col("c").as_type(int))
    .rename({"a": "A"})
    .drop("c")
    .into_csv("tests/csvs/out.csv")
)

Reading dicts like csv.DictReader, but faster because dicts are initialized as literals:

Table.from_csv("tests/csvs/ac.csv", True).into_iter_rows(dict)

Custom dialect (e.g. different delimiter), custom header:

Table.from_csv(
    "tests/csvs/ac.csv",
    header={"A": 1, "B": 0},  # indices of list (row from csv.reader)
    skip_rows=1,  # skipping the heading row
    dialect=Table.csv_dialect(delimiter="\t"),
).into_iter_rows(dict)

Warning

Providing own headers, be sure Table will raise ValueError if numbers of columns don’t match.

For more details see:

  1. convtools.contrib.tables.Table.from_csv

  2. convtools.contrib.tables.Table.into_iter_rows

  3. convtools.contrib.tables.Table.into_csv


It’s important to note what is going on under the hood. We can wrap the above like below to see ad hoc code convtools.contrib.tables.Table generates under the hood (using convtools conversions):

with c.OptionsCtx() as options:
    options.debug = True
    Table.from_csv("tests/csvs/ab.csv", header=True).update(
        c=c.col("a") + c.col("b")
    ).into_csv("tests/csvs/out.csv")

Points to comprehend:

  1. table conversions embed indices and don’t have superfluous loops inside. This allows them to work just as fast as simple bare python code.

  2. table conversions work with iterables, so they cannot be reused once consumed

  3. table conversions do their best to be lazy except for cases where it’s impossible (e.g. when convtools.contrib.tables.Table.join decides to use hash-join, it builds a full hashmap, consuming the right side iterable

4. Reading rows

Just pass an iterable of one of tuple/list/dict:

# if no header passed, columns get names like: "COLUMN_0", "COLUMN_1", etc.
Table.from_rows([(1, 2, 3), (2, 3, 4)])

Table.from_rows([[1, 2, 3], [2, 3, 4]], header=["a", "b", "c"])

Table.from_rows([{"a": 1, "b": 2}, {"a": 2, "b": 3}])

For more details see:

  1. convtools.contrib.tables.Table.from_rows

  2. convtools.contrib.tables.Table.into_iter_rows

  3. convtools.contrib.tables.Table.into_csv

5. Rename, take / rearrange, drop columns

These methods operate with column names and can accept multiple values:

# just to show all at once
list(
    Table.from_rows([("A", "b", "c"), (1, 2, 3), (2, 3, 4)], header=True)
    .rename({"A": "a"})
    .drop("b")
    .take("c", ...)  # put "c" column as the first one
    .into_iter_rows(dict)
)
# Out[2]: [{'c': 3, 'a': 1}, {'c': 4, 'a': 2}]

For more details see:

  1. convtools.contrib.tables.Table.rename

  2. convtools.contrib.tables.Table.drop

  3. convtools.contrib.tables.Table.take

6. Add, update columns

To process data:
  • you should be comfortable with convtools conversions

  • use c.col("a") syntax to reference "a" column values (all conversions are element-wise).

list(
    Table.from_rows([(1, -2), (2, -3)], ["a", "b"])
    .update(c=c.col("a") + c.col("b"))  # adding new column: "c"
    .update(c=c.call_func(abs, c.col("c")))  # updating new column: "c"
    .into_iter_rows(dict)
)
# Out[3]: [{'a': 1, 'b': -2, 'c': 1}, {'a': 2, 'b': -3, 'c': 1}]

For more details see:

  1. convtools.contrib.tables.Table.update

7. Filter rows

You can filter rows by passing a conversion - convtools.contrib.tables.Table.filter

list(
    Table.from_rows([(1, -2), (2, -3)], ["a", "b"])
    .filter(c.col("b") < -2)
    .into_iter_rows(dict)
)

8. Join tables

When you join two tables, conflicting columns (except for ones, specified as list of columns, passed as on argument) get suffixed with “_LEFT” and “_RIGHT” for columns of left and right tables correspondingly. Suffixes can be overridden using suffixes option.

There are two ways to join tables:

  1. passing list of column names as on argument, joining rows based on equality

    list(
        Table.from_rows([(1, 2), (2, 3)], ["a", "b"])
        .join(
            Table.from_rows([(1, 3), (2, 4)], ["a", "c"]),
            how="inner",
            on=["a"],
        )
        .into_iter_rows(dict)
    )
    
  2. passing a custom join condition as on argument, where c.LEFT.col("a") references an element in column "a" of the left table and c.RIGHT.col("a") references an element in column "a" of the right table

    list(
        Table.from_rows([(1, 2), (2, 3)], ["a", "b"])
        .join(
            Table.from_rows([(1, 3), (2, 4)], ["a", "c"]),
            how="full",
            on=c.and_(
                c.LEFT.col("a") == c.RIGHT.col("a"),
                c.LEFT.col("b") < c.RIGHT.col("c")
            )
        )
        .into_iter_rows(dict)
    )
    

For more details see:

  1. convtools.contrib.tables.Table.join

9. Chain tables

Table.chain(table: convtools.contrib.tables.Table, fill_value=None) convtools.contrib.tables.Table[source]

Chain tables, putting them one after another.

Let’s assume fill_value is set to ” “:

>>> Table 1      Table 2
>>> | a | b |    | b | c |
>>> | 1 | 2 |    | 3 | 4 |
>>>
>>> table1.chain(table2, fill_value=" ")
>>>
>>> Result:
>>> | a | b | c |
>>> | 1 | 2 |   |
>>> |   | 3 | 4 |
Parameters
  • table (-) – table to be chained

  • fill_value (-) – value to use for filling gaps

10. Zip tables

Table.zip(table: convtools.contrib.tables.Table, fill_value=None) convtools.contrib.tables.Table[source]

Zip tables one to another. Before using this method, make sure you are not looking for convtools.contrib.tables.Table.join

Let’s assume fill_value is set to ” “:

>>> Table 1      Table 2
>>> | a | b |    | b | c |
>>> | 1 | 2 |    | 3 | 4 |
>>>              | 5 | 6 |
>>>
>>> table1.zip(table2, fill_value=" ")
>>>
>>> Result:
>>> | a | b | b | c |
>>> | 1 | 2 | 3 | 4 |
>>> |   |   | 5 | 6 |
Parameters
  • table (-) – table to be chained

  • fill_value (-) – value to use for filling gaps

11. Using inside other conversions

It’s impossible to make Table work directly inside other conversions, because it would introduce ambiguity on which code generating layer is to transform the conversion into code: Table or the parent conversion.

But you most definitely can leverage piping to callables like this:

input_data = [["a", "b"], [1, 2], [3, 4]]
conversion = c.this.pipe(
    lambda it: Table.from_rows(it, header=True).into_iter_rows(dict)
).as_type(list)
conversion.execute(input_data)

12. Explode tables

Table.explode(column_name: str)[source]

Explodes a table to a long format by exploding a column with iterables, e.g.:

>>> | a |   b    |
>>> | 1 | [2, 3] |
>>> | 4 | [5, 6] |
>>>
>>> table.explode("b")
>>>
>>> | a | b |
>>> | 1 | 2 |
>>> | 1 | 3 |
>>> | 2 | 5 |
>>> | 2 | 6 |
Parameters

column_name (-) – column with iterables to be exploded