Group By and Aggregate¶
Warning
Please, make sure you've covered Reference / Basics first.
The syntax to define aggregations is as follows:
c.group_by(key1, key2, ...).aggregate(result)
returns list of resultsc.aggregate(result)
returns the result
where result
is any conversion (dict
, c.call_func
, whatever) made up
of:
- keys -
key1, keys, ...
- reducers - e.g.
c.ReduceFuncs.Sum(c.item("abc"))
c.group_by¶
from convtools import conversion as c
input_data = [
{"a": 5, "b": "foo"},
{"a": 10, "b": "foo"},
{"a": 10, "b": "bar"},
{"a": 10, "b": "bar"},
{"a": 20, "b": "bar"},
]
conv = (
c.group_by(c.item("b"))
.aggregate(
{
"b": c.item("b"),
"a_first": c.ReduceFuncs.First(c.item("a"), where=c.item("a") > 5),
"a_max": c.ReduceFuncs.Max(c.item("a")),
}
)
.gen_converter(debug=True)
)
assert conv(input_data) == [
{"b": "foo", "a_first": 10, "a_max": 10},
{"b": "bar", "a_first": 10, "a_max": 20},
]
class AggData_:
__slots__ = ["v0", "v1"]
def __init__(self, _none=__none__):
self.v0 = _none
self.v1 = _none
def group_by_(_none, data_):
signature_to_agg_data_ = defaultdict(AggData_)
for row_ in data_:
agg_data_ = signature_to_agg_data_[row_["b"]]
_r0_ = row_["a"]
if row_["a"] > 5:
if agg_data_.v0 is _none:
agg_data_.v0 = row_["a"]
else:
pass
if _r0_ is not None:
if agg_data_.v1 is _none:
agg_data_.v1 = _r0_
else:
if agg_data_.v1 < _r0_:
agg_data_.v1 = _r0_
return [
{"b": signature_, "a_first": ((None if (agg_data_.v0 is _none) else agg_data_.v0)), "a_max": ((None if (agg_data_.v1 is _none) else agg_data_.v1))}
for signature_, agg_data_ in signature_to_agg_data_.items()
]
def converter(data_):
global __none__
_none = __none__
try:
return group_by_(_none, data_)
except __exceptions_to_dump_sources:
__convtools__code_storage.dump_sources()
raise
c.aggregate¶
from convtools import conversion as c
input_data = [
{"a": 5, "b": "foo"},
{"a": 10, "b": "foo"},
{"a": 10, "b": "bar"},
{"a": 10, "b": "bar"},
{"a": 20, "b": "bar"},
]
# list of "a" values where "b" equals to "bar"
# "b" value of a row where "a" has Max value
conv = c.aggregate(
{
"a": c.ReduceFuncs.Array(c.item("a"), where=c.item("b") == "bar"),
"b": c.ReduceFuncs.MaxRow(
c.item("a"),
).item("b", default=None),
}
).gen_converter(debug=True)
assert conv(input_data) == {"a": [10, 10, 20], "b": "bar"}
def aggregate_(_none, data_, *, __get_1_or_default=__naive_values__["__get_1_or_default"]):
agg_data__v0 = agg_data__v1 = _none
checksum_ = 0
it_ = iter(data_)
for row_ in it_:
_r0_ = row_["a"]
if row_["b"] == "bar":
if agg_data__v0 is _none:
checksum_ += 1
agg_data__v0 = [row_["a"]]
else:
agg_data__v0.append(row_["a"])
if _r0_ is not None:
if agg_data__v1 is _none:
checksum_ += 1
agg_data__v1 = (_r0_, row_)
else:
if agg_data__v1[0] < _r0_:
agg_data__v1 = (_r0_, row_)
if checksum_ == 2:
globals()["__BROKEN_EARLY__"] = True # DEBUG ONLY
break
for row_ in it_:
_r0_ = row_["a"]
if row_["b"] == "bar":
agg_data__v0.append(row_["a"])
if _r0_ is not None:
if agg_data__v1[0] < _r0_:
agg_data__v1 = (_r0_, row_)
return {
"a": ((None if (agg_data__v0 is _none) else agg_data__v0)),
"b": __get_1_or_default(((None if (agg_data__v1 is _none) else agg_data__v1[1])), "b", None),
}
def converter(data_):
global __none__
_none = __none__
try:
return aggregate_(_none, data_)
except __exceptions_to_dump_sources:
__convtools__code_storage.dump_sources()
raise
c.ReduceFuncs¶
Here is the list of available reducers like c.ReduceFuncs.Sum
with info on:
* Sum - auto-replaces False values with 0; default=0
* SumOrNone - sum or None if at least one None is encountered; default=None
* Max - max not None
* MaxRow - row with max not None
* Min - min not None
* MinRow - row with min not None
* Count
- when 0-args: count of rows
- when 1-args: count of not None values
* CountDistinct - len of resulting set of values
* First - first encountered value
* Last - last encountered value
* Average(value, weight=1) - pass custom weight conversion for weighted average
* Median
* Percentile(percentile, value, interpolation="linear")
c.ReduceFuncs.Percentile(95.0, c.item("x"))
interpolation is one of:
- "linear"
- "lower"
- "higher"
- "midpoint"
- "nearest"
* Mode
* TopK - c.ReduceFuncs.TopK(3, c.item("x"))
* Array
* ArrayDistinct
* ArraySorted
c.ReduceFuncs.ArraySorted(c.item("x"), key=lambda v: v, reverse=True)
DICT REDUCERS ARE IN FACT AGGREGATIONS THEMSELVES, BECAUSE VALUES GET REDUCED:
* Dict
c.ReduceFuncs.Dict(c.item("key"), c.item("x"))
* DictArray - dict values are lists of encountered values
* DictSum - dict values are reduced by Sum
* DictSumOrNone
* DictMax
* DictMin
* DictCount
- when 1-args: dict values are counts of reduced rows
- when 2-args: dict values are counts of not None values
* DictCountDistinct
* DictFirst
* DictLast
AND LASTLY YOU CAN DEFINE YOUR OWN REDUCER BY PASSING ANY REDUCE FUNCTION
OF TWO ARGUMENTS TO ``c.reduce`` (it may be slower because of extra
function call):
- c.reduce(lambda a, b: a + b, c.item("amount"), initial=0)
Reducers API¶
Every reducer keyword arguments:
where
- a condition to filter input values of a reducerdefault
- a value in case a reducer hasn't encountered any values
The table below gives the following info on builtin reducers:
- how many positional arguments they can accept
- what are their default values (returned when no rows are reduced)
- and whether they support
initial
keyword argument.
Reducer | 0-args | 1-args | 2-args | default | supports initial |
---|---|---|---|---|---|
Array | v | None | v | ||
ArrayDistinct | v | None | |||
ArraySorted | v | None | |||
Average | v | None | |||
Count | v | v | 0 | v | |
CountDistinct | v | 0 | |||
First | v | None | |||
Last | v | None | |||
Max | v | None | v | ||
MaxRow | v | None | |||
Median | v | None | |||
Min | v | None | v | ||
MinRow | v | None | |||
Mode | v | None | |||
Percentile | v | None | |||
Sum | v | 0 | v | ||
SumOrNone | v | None | v | ||
TopK | v | None | |||
Dict | v | None | |||
DictArray | v | None | |||
DictCount | v | v | None | ||
DictCountDistinct | v | None | |||
DictFirst | v | None | |||
DictLast | v | None | |||
DictMax | v | None | |||
DictMin | v | None | |||
DictSum | v | None | |||
DictSumOrNone | v | None |