Skip to content

Group By and Aggregate

Warning

Please, make sure you've covered Reference / Basics first.

The syntax to define aggregations is as follows:

  • c.group_by(key1, key2, ...).aggregate(result) returns list of results
  • c.aggregate(result) returns the result

where result is any conversion (dict, c.call_func, whatever) made up of:

  • keys - key1, keys, ...
  • reducers - e.g. c.ReduceFuncs.Sum(c.item("abc"))

c.group_by

from convtools import conversion as c

input_data = [
    {"a": 5, "b": "foo"},
    {"a": 10, "b": "foo"},
    {"a": 10, "b": "bar"},
    {"a": 10, "b": "bar"},
    {"a": 20, "b": "bar"},
]

conv = (
    c.group_by(c.item("b"))
    .aggregate(
        {
            "b": c.item("b"),
            "a_first": c.ReduceFuncs.First(c.item("a"), where=c.item("a") > 5),
            "a_max": c.ReduceFuncs.Max(c.item("a")),
        }
    )
    .gen_converter(debug=True)
)

assert conv(input_data) == [
    {"b": "foo", "a_first": 10, "a_max": 10},
    {"b": "bar", "a_first": 10, "a_max": 20},
]
class AggData_:
    __slots__ = ["v0", "v1"]

    def __init__(self, _none=__none__):
        self.v0 = _none
        self.v1 = _none

def group_by_(_none, data_):
    signature_to_agg_data_ = defaultdict(AggData_)

    for row_ in data_:
        agg_data_ = signature_to_agg_data_[row_["b"]]
        _r0_ = row_["a"]
        if row_["a"] > 5:
            if agg_data_.v0 is _none:
                agg_data_.v0 = row_["a"]
            else:
                pass
        if _r0_ is not None:
            if agg_data_.v1 is _none:
                agg_data_.v1 = _r0_
            else:
                if agg_data_.v1 < _r0_:
                    agg_data_.v1 = _r0_

    return [
        {"b": signature_, "a_first": ((None if (agg_data_.v0 is _none) else agg_data_.v0)), "a_max": ((None if (agg_data_.v1 is _none) else agg_data_.v1))}
        for signature_, agg_data_ in signature_to_agg_data_.items()
    ]

def converter(data_):
    global __none__
    _none = __none__
    try:
        return group_by_(_none, data_)
    except __exceptions_to_dump_sources:
        __convtools__code_storage.dump_sources()
        raise

c.aggregate

from convtools import conversion as c

input_data = [
    {"a": 5, "b": "foo"},
    {"a": 10, "b": "foo"},
    {"a": 10, "b": "bar"},
    {"a": 10, "b": "bar"},
    {"a": 20, "b": "bar"},
]

# list of "a" values where "b" equals to "bar"
# "b" value of a row where "a" has Max value
conv = c.aggregate(
    {
        "a": c.ReduceFuncs.Array(c.item("a"), where=c.item("b") == "bar"),
        "b": c.ReduceFuncs.MaxRow(
            c.item("a"),
        ).item("b", default=None),
    }
).gen_converter(debug=True)

assert conv(input_data) == {"a": [10, 10, 20], "b": "bar"}
def aggregate_(_none, data_, *, __get_1_or_default=__naive_values__["__get_1_or_default"]):
    agg_data__v0 = agg_data__v1 = _none

    checksum_ = 0
    it_ = iter(data_)
    for row_ in it_:
        _r0_ = row_["a"]
        if row_["b"] == "bar":
            if agg_data__v0 is _none:
                checksum_ += 1
                agg_data__v0 = [row_["a"]]
            else:
                agg_data__v0.append(row_["a"])
        if _r0_ is not None:
            if agg_data__v1 is _none:
                checksum_ += 1
                agg_data__v1 = (_r0_, row_)
            else:
                if agg_data__v1[0] < _r0_:
                    agg_data__v1 = (_r0_, row_)
        if checksum_ == 2:
            globals()["__BROKEN_EARLY__"] = True  # DEBUG ONLY
            break
    for row_ in it_:
        _r0_ = row_["a"]
        if row_["b"] == "bar":
            agg_data__v0.append(row_["a"])
        if _r0_ is not None:
            if agg_data__v1[0] < _r0_:
                agg_data__v1 = (_r0_, row_)

    return {
        "a": ((None if (agg_data__v0 is _none) else agg_data__v0)),
        "b": __get_1_or_default(((None if (agg_data__v1 is _none) else agg_data__v1[1])), "b", None),
    }

def converter(data_):
    global __none__
    _none = __none__
    try:
        return aggregate_(_none, data_)
    except __exceptions_to_dump_sources:
        __convtools__code_storage.dump_sources()
        raise

c.ReduceFuncs

Here is the list of available reducers like c.ReduceFuncs.Sum with info on:

* Sum - auto-replaces False values with 0; default=0
* SumOrNone - sum or None if at least one None is encountered; default=None
* Max - max not None
* MaxRow - row with max not None
* Min - min not None
* MinRow - row with min not None
* Count
    - when 0-args: count of rows
    - when 1-args: count of not None values
* CountDistinct - len of resulting set of values
* First - first encountered value
* Last - last encountered value
* Average(value, weight=1) - pass custom weight conversion for weighted average
* Median
* Percentile(percentile, value, interpolation="linear")
    c.ReduceFuncs.Percentile(95.0, c.item("x"))
    interpolation is one of:
      - "linear"
      - "lower"
      - "higher"
      - "midpoint"
      - "nearest"
* Mode
* TopK - c.ReduceFuncs.TopK(3, c.item("x"))
* Array
* ArrayDistinct
* ArraySorted
    c.ReduceFuncs.ArraySorted(c.item("x"), key=lambda v: v, reverse=True)

DICT REDUCERS ARE IN FACT AGGREGATIONS THEMSELVES, BECAUSE VALUES GET REDUCED:
* Dict
    c.ReduceFuncs.Dict(c.item("key"), c.item("x"))
* DictArray - dict values are lists of encountered values
* DictSum - dict values are reduced by Sum
* DictSumOrNone
* DictMax
* DictMin
* DictCount
    - when 1-args: dict values are counts of reduced rows
    - when 2-args: dict values are counts of not None values
* DictCountDistinct
* DictFirst
* DictLast

AND LASTLY YOU CAN DEFINE YOUR OWN REDUCER BY PASSING ANY REDUCE FUNCTION
OF TWO ARGUMENTS TO ``c.reduce`` (it may be slower because of extra
function call):
  - c.reduce(lambda a, b: a + b, c.item("amount"), initial=0)

Reducers API

Every reducer keyword arguments:

  • where - a condition to filter input values of a reducer
  • default - a value in case a reducer hasn't encountered any values

The table below gives the following info on builtin reducers:

  • how many positional arguments they can accept
  • what are their default values (returned when no rows are reduced)
  • and whether they support initial keyword argument.
Reducer 0-args 1-args 2-args default supports initial
Array v None v
ArrayDistinct v None
ArraySorted v None
Average v None
Count v v 0 v
CountDistinct v 0
First v None
Last v None
Max v None v
MaxRow v None
Median v None
Min v None v
MinRow v None
Mode v None
Percentile v None
Sum v 0 v
SumOrNone v None v
TopK v None
Dict v None
DictArray v None
DictCount v v None
DictCountDistinct v None
DictFirst v None
DictLast v None
DictMax v None
DictMin v None
DictSum v None
DictSumOrNone v None