Skip to content

Addresses

rigour.addresses

This module provides a set of tools for handling postal/geographic addresses. It includes functions for normalising addresses for comparison purposes, and for formatting addresses given in parts for display as a single string.

Postal address formatting

This set of helpers is designed to help with the processing of real-world addresses, including composing an address from individual parts, and cleaning it up.

from rigour.addresses import format_address_line

address = {
    "road": "Bahnhofstr.",
    "house_number": "10",
    "postcode": "86150",
    "city": "Augsburg",
    "state": "Bayern",
    "country": "Germany",
}
address_text = format_address_line(address, country="DE")
Acknowledgements

The address formatting database contained in rigour/data/addresses/formats.yml is derived from worldwide.yml in the OpenCageData address-formatting repository. It is used to format addresses according to customs in the country that is been encoded.

clean_address(full)

Remove common formatting errors from addresses.

Source code in rigour/addresses/cleaning.py
def clean_address(full: str) -> str:
    """Remove common formatting errors from addresses."""
    while True:
        full, count = REPL.subn(_sub_match, full)
        if count == 0:
            break
    return full.strip()

format_address(address, country=None)

Format the given address part into a multi-line string that matches the conventions of the country of the given address.

Parameters:

Name Type Description Default
address Dict[str, Optional[str]]

The address parts to be combined. Common parts include: summary: A short description of the address. po_box: The PO box/mailbox number. street: The street or road name. house: The descriptive name of the house. house_number: The number of the house on the street. postal_code: The postal code or ZIP code. city: The city or town name. county: The county or district name. state: The state or province name. state_district: The state or province district name. state_code: The state or province code. country: The name of the country (words, not ISO code). country_code: A pre-normalized country code.

required
country Optional[str]

ISO code for the country of the address.

None

Returns:

Type Description
str

A single-line string with the formatted address.

Source code in rigour/addresses/format.py
def format_address(
    address: Dict[str, Optional[str]], country: Optional[str] = None
) -> str:
    """Format the given address part into a multi-line string that matches the
    conventions of the country of the given address.

    Args:
        address: The address parts to be combined. Common parts include:
            summary: A short description of the address.
            po_box: The PO box/mailbox number.
            street: The street or road name.
            house: The descriptive name of the house.
            house_number: The number of the house on the street.
            postal_code: The postal code or ZIP code.
            city: The city or town name.
            county: The county or district name.
            state: The state or province name.
            state_district: The state or province district name.
            state_code: The state or province code.
            country: The name of the country (words, not ISO code).
            country_code: A pre-normalized country code.
        country: ISO code for the country of the address.

    Returns:
        A single-line string with the formatted address.
    """
    text = _format(address, country=country)
    prev: Optional[str] = None
    while prev != text:
        prev = text
        text = text.replace("\n\n", "\n").replace("\n ", "\n").strip()
    return text

format_address_line(address, country=None)

Format the given address part into a single-line string that matches the conventions of the country of the given address.

Parameters:

Name Type Description Default
address Dict[str, Optional[str]]

The address parts to be combined. Common parts include: summary: A short description of the address. po_box: The PO box/mailbox number. street: The street or road name. house: The descriptive name of the house. house_number: The number of the house on the street. postal_code: The postal code or ZIP code. city: The city or town name. county: The county or district name. state: The state or province name. state_district: The state or province district name. state_code: The state or province code. country: The name of the country (words, not ISO code). country_code: A pre-normalized country code.

required
country Optional[str]

ISO code for the country of the address.

None

Returns:

Type Description
str

A single-line string with the formatted address.

Source code in rigour/addresses/format.py
def format_address_line(
    address: Dict[str, Optional[str]], country: Optional[str] = None
) -> str:
    """Format the given address part into a single-line string that matches the
    conventions of the country of the given address.

    Args:
        address: The address parts to be combined. Common parts include:
            summary: A short description of the address.
            po_box: The PO box/mailbox number.
            street: The street or road name.
            house: The descriptive name of the house.
            house_number: The number of the house on the street.
            postal_code: The postal code or ZIP code.
            city: The city or town name.
            county: The county or district name.
            state: The state or province name.
            state_district: The state or province district name.
            state_code: The state or province code.
            country: The name of the country (words, not ISO code).
            country_code: A pre-normalized country code.
        country: ISO code for the country of the address.

    Returns:
        A single-line string with the formatted address.
    """
    line = ", ".join(_format(address, country=country).split("\n"))
    return clean_address(line)

normalize_address(address, latinize=False, min_length=4)

Build a comparison key from an address.

Casefolds, replaces punctuation/symbols with whitespace, tokenises on Unicode general-category, and rejoins with single-space separators. The output is a flat lowercase token sequence suitable for substring matching, equality keys, or feeding :func:shorten_address_keywords / :func:remove_address_keywordsnot a display form.

Parameters:

Name Type Description Default
address str

The address to normalise.

required
latinize bool

When True, transliterate non-ASCII tokens to ASCII via normality.ascii_text. Default False preserves the original script.

False
min_length int

Reject the result as None if it would be shorter than this many characters. Defaults to 4 to filter out single-token noise.

4

Returns:

Type Description
Optional[str]

Normalised address, or None when the result is shorter

Optional[str]

than min_length.

Source code in rigour/addresses/normalize.py
def normalize_address(
    address: str, latinize: bool = False, min_length: int = 4
) -> Optional[str]:
    """Build a comparison key from an address.

    Casefolds, replaces punctuation/symbols with whitespace,
    tokenises on Unicode general-category, and rejoins with
    single-space separators. The output is a flat lowercase token
    sequence suitable for substring matching, equality keys, or
    feeding :func:`shorten_address_keywords` /
    :func:`remove_address_keywords` — **not** a display form.

    Args:
        address: The address to normalise.
        latinize: When `True`, transliterate non-ASCII tokens to
            ASCII via `normality.ascii_text`. Default `False`
            preserves the original script.
        min_length: Reject the result as `None` if it would be
            shorter than this many characters. Defaults to 4 to
            filter out single-token noise.

    Returns:
        Normalised address, or `None` when the result is shorter
        than `min_length`.
    """
    tokens: List[List[str]] = []
    token: List[str] = []
    for char in address.casefold():
        if char in CHARS_ALLOWED:
            chr: Optional[str] = char
        else:
            cat = unicodedata.category(char)
            chr = TOKEN_SEP_CATEGORIES.get(cat, char)
        if chr is None:
            continue
        if chr == WS:
            if len(token):
                tokens.append(token)
            token = []
            continue
        token.append(chr)
    if len(token):
        tokens.append(token)

    parts: List[str] = []
    for token in tokens:
        token_str = "".join(token)
        if latinize:
            token_str = ascii_text(token_str)
        if len(token_str) == 0:
            continue
        parts.append(token_str)
    norm_address = WS.join(parts)
    if len(norm_address) < min_length:
        return None
    return norm_address

remove_address_keywords(address, latinize=False, replacement=WS)

Strip common address keywords from a normalised address.

Removes recognised forms ("street", "road", "south", territory names, ordinals, …) by substituting each match with replacement. Consecutive matches produce consecutive replacement runs — whitespace is not collapsed, so the output may contain multi-space gaps. Use normality.squash_spaces afterwards if a single-space output is wanted.

Input must already be normalised with :func:normalize_address using the same latinize flag — the alias table is built against that normalised form.

Parameters:

Name Type Description Default
address str

A pre-normalised address string.

required
latinize bool

Must match the flag passed to :func:normalize_address. Default False.

False
replacement str

String substituted in place of each match. Defaults to a single ASCII space.

WS

Returns:

Type Description
Optional[str]

The address with recognised keywords removed.

Source code in rigour/addresses/normalize.py
def remove_address_keywords(
    address: str, latinize: bool = False, replacement: str = WS
) -> Optional[str]:
    """Strip common address keywords from a normalised address.

    Removes recognised forms (`"street"`, `"road"`, `"south"`,
    territory names, ordinals, …) by substituting each match with
    `replacement`. Consecutive matches produce consecutive
    `replacement` runs — whitespace is **not** collapsed, so the
    output may contain multi-space gaps. Use
    `normality.squash_spaces` afterwards if a single-space
    output is wanted.

    Input must already be normalised with :func:`normalize_address`
    using the same `latinize` flag — the alias table is built
    against that normalised form.

    Args:
        address: A pre-normalised address string.
        latinize: Must match the flag passed to
            :func:`normalize_address`. Default `False`.
        replacement: String substituted in place of each match.
            Defaults to a single ASCII space.

    Returns:
        The address with recognised keywords removed.
    """
    with resource_lock:
        pattern, _ = _address_replacer(latinize=latinize)
    return pattern.sub(replacement, address)

shorten_address_keywords(address, latinize=False)

Shorten common address keywords in a normalised address.

Replaces recognised forms with their canonical short form ("street""st", "avenue""av", "united arab emirates""ae", …). Multi-token forms beat single-token components via longest-form-first ordering in the alias pattern, so country names win over their constituent words.

Input must already be normalised with :func:normalize_address using the same latinize flag — the alias table is built against that normalised form.

Parameters:

Name Type Description Default
address str

A pre-normalised address string.

required
latinize bool

Must match the flag passed to :func:normalize_address. Default False.

False

Returns:

Type Description
Optional[str]

The address with recognised keywords shortened. Tokens

Optional[str]

that don't match any alias pass through unchanged.

Source code in rigour/addresses/normalize.py
def shorten_address_keywords(address: str, latinize: bool = False) -> Optional[str]:
    """Shorten common address keywords in a normalised address.

    Replaces recognised forms with their canonical short form
    (`"street"` → `"st"`, `"avenue"` → `"av"`, `"united arab
    emirates"` → `"ae"`, …). Multi-token forms beat single-token
    components via longest-form-first ordering in the alias
    pattern, so country names win over their constituent words.

    Input must already be normalised with :func:`normalize_address`
    using the same `latinize` flag — the alias table is built
    against that normalised form.

    Args:
        address: A pre-normalised address string.
        latinize: Must match the flag passed to
            :func:`normalize_address`. Default `False`.

    Returns:
        The address with recognised keywords shortened. Tokens
        that don't match any alias pass through unchanged.
    """
    pattern, mapping = _address_replacer(latinize=latinize)

    def _sub(match: re.Match[str]) -> str:
        value = match.group(1)
        return mapping.get(value.lower(), value)

    return pattern.sub(_sub, address)