Joins¶

Prerequisites

Start with Basics for conversion fundamentals and Collections for shaping or iterating inputs before joining.

Left and right references¶

c.LEFT and c.RIGHT point to the current left and right rows being compared. They behave like normal conversions, so you can use item, attr, call_method, operators, and type conversions on them.

The example below joins two collections from a tuple input. The right-side IDs are strings, so the condition casts them with .as_type(int) before comparing. It also keeps only right-side rows where age >= 18.

from convtools import conversion as c

collection_1 = [
    {"id": 1, "name": "Nick"},
    {"id": 2, "name": "Joash"},
    {"id": 3, "name": "Bob"},
]
collection_2 = [
    {"ID": "3", "age": 17, "country": "GB"},
    {"ID": "2", "age": 21, "country": "US"},
    {"ID": "1", "age": 18, "country": "CA"},
]
input_data = (collection_1, collection_2)

conv = (
    c.join(
        c.item(0),
        c.item(1),
        c.and_(
            c.LEFT.item("id") == c.RIGHT.item("ID").as_type(int),
            c.RIGHT.item("age") >= 18,
        ),
        how="left",
    )
    .pipe(
        c.list_comp(
            {
                "id": c.item(0, "id"),
                "name": c.item(0, "name"),
                "age": c.item(1, "age", default=None),
                "country": c.item(1, "country", default=None),
            }
        )
    )
    .gen_converter(debug=True)
)

assert conv(input_data) == [
    {"id": 1, "name": "Nick", "age": 18, "country": "CA"},
    {"id": 2, "name": "Joash", "age": 21, "country": "US"},
    {"id": 3, "name": "Bob", "age": None, "country": None},
]

Multi-key joins¶

Use c.and_(...) with multiple equality conditions to join on more than one key. Equi-joinable conditions are used to build the right-side hash map.

from convtools import conversion as c


input_data = {
    "orders": [
        {"store": "east", "sku": "A", "qty": 2},
        {"store": "east", "sku": "B", "qty": 1},
        {"store": "west", "sku": "A", "qty": 3},
    ],
    "prices": [
        {"store": "east", "sku": "A", "price": 10},
        {"store": "east", "sku": "B", "price": 8},
        {"store": "west", "sku": "A", "price": 12},
    ],
}

conv = (
    c.join(
        c.item("orders"),
        c.item("prices"),
        c.and_(
            c.LEFT.item("store") == c.RIGHT.item("store"),
            c.LEFT.item("sku") == c.RIGHT.item("sku"),
        ),
    )
    .pipe(
        c.list_comp(
            {
                "store": c.item(0, "store"),
                "sku": c.item(0, "sku"),
                "qty": c.item(0, "qty"),
                "price": c.item(1, "price"),
            }
        )
    )
    .gen_converter()
)

assert conv(input_data) == [
    {"store": "east", "sku": "A", "qty": 2, "price": 10},
    {"store": "east", "sku": "B", "qty": 1, "price": 8},
    {"store": "west", "sku": "A", "qty": 3, "price": 12},
]

Non-equi joins¶

Conditions do not have to be equality checks. For range matching and other non-equi joins, write the condition directly with operators on c.LEFT and c.RIGHT.

from convtools import conversion as c


input_data = {
    "people": [
        {"name": "Nick", "age": 17},
        {"name": "Ann", "age": 22},
    ],
    "age_bands": [
        {"band": "junior", "min_age": 0, "max_age": 17},
        {"band": "adult", "min_age": 18, "max_age": 64},
    ],
}

conv = (
    c.join(
        c.item("people"),
        c.item("age_bands"),
        c.and_(
            c.LEFT.item("age") >= c.RIGHT.item("min_age"),
            c.LEFT.item("age") <= c.RIGHT.item("max_age"),
        ),
    )
    .pipe(
        c.list_comp(
            {
                "name": c.item(0, "name"),
                "age": c.item(0, "age"),
                "band": c.item(1, "band"),
            }
        )
    )
    .gen_converter()
)

assert conv(input_data) == [
    {"name": "Nick", "age": 17, "band": "junior"},
    {"name": "Ann", "age": 22, "band": "adult"},
]

Duplicate matches¶

c.join emits every matching pair. If two left rows and two right rows have the same join key, the result contains four pairs for that key. This is the same many-to-many behavior as a relational join.

from convtools import conversion as c


left = [{"id": 1, "value": "a"}, {"id": 1, "value": "b"}]
right = [{"id": 1, "value": "x"}, {"id": 1, "value": "y"}]

conv = (
    c.join(c.item(0), c.item(1), c.LEFT.item("id") == c.RIGHT.item("id"))
    .as_type(list)
    .gen_converter()
)

assert conv((left, right)) == [
    ({"id": 1, "value": "a"}, {"id": 1, "value": "x"}),
    ({"id": 1, "value": "a"}, {"id": 1, "value": "y"}),
    ({"id": 1, "value": "b"}, {"id": 1, "value": "x"}),
    ({"id": 1, "value": "b"}, {"id": 1, "value": "y"}),
]

Memory and performance¶

When the condition contains equality predicates that compare left expressions to right expressions, c.join uses those predicates as hash keys and stores the right side in memory. Additional one-sided filters can be pushed to the relevant side before matching.

For non-equi conditions, or conditions that cannot be split into hash keys, c.join keeps the right side available and checks candidate pairs in a nested loop. This is flexible, but it is usually more expensive for large inputs.

left, right, and full joins may yield None for the missing side. full joins also track which right rows were already matched so they can emit unmatched right rows at the end.