Skip to content

Text

rigour.text

is_levenshtein_plausible(left, right, max_edits=env.LEVENSHTEIN_MAX_EDITS, max_percent=env.LEVENSHTEIN_MAX_PERCENT, max_length=env.MAX_NAME_LENGTH)

A sanity check to post-filter name matching results based on a budget of allowed Levenshtein distance. This basically cuts off results where the Jaro-Winkler or Metaphone comparison was too lenient.

Parameters:

Name Type Description Default
left str

A string.

required
right str

A string.

required
max_edits Optional[int]

The maximum number of edits allowed.

LEVENSHTEIN_MAX_EDITS
max_percent float

The maximum percentage of edits allowed.

LEVENSHTEIN_MAX_PERCENT

Returns:

Type Description
bool

A boolean.

Source code in rigour/text/distance.py
def is_levenshtein_plausible(
    left: str,
    right: str,
    max_edits: Optional[int] = env.LEVENSHTEIN_MAX_EDITS,
    max_percent: float = env.LEVENSHTEIN_MAX_PERCENT,
    max_length: int = env.MAX_NAME_LENGTH,
) -> bool:
    """A sanity check to post-filter name matching results based on a budget
    of allowed Levenshtein distance. This basically cuts off results where
    the Jaro-Winkler or Metaphone comparison was too lenient.

    Args:
        left: A string.
        right: A string.
        max_edits: The maximum number of edits allowed.
        max_percent: The maximum percentage of edits allowed.

    Returns:
        A boolean.
    """
    left = left[:max_length]
    right = right[:max_length]
    pct_edits = math.ceil(min(len(left), len(right)) * max_percent)
    max_edits_ = min(max_edits, pct_edits) if max_edits is not None else pct_edits
    distance = levenshtein(left, right, max_length, max_edits=max_edits_)
    return distance <= max_edits_

is_nullplace(form, *, normalizer=normalize_text, normalize=False)

Check whether form is a nullplace.

Nullplaces are place names that don't refer to a specific location: "overseas", "abroad", "stateless", "international waters", etc. Useful for filtering out records where a country / address slot was populated with a placeholder rather than a real geography.

Parameters:

Name Type Description Default
form str

The string to check.

required
normalizer Normalizer

Normalizer applied to the wordlist at load time, and to form when normalize=True.

normalize_text
normalize bool

When True, run normalizer(form) before the lookup. When False (default), form is assumed to be pre-normalised.

False

Returns:

Type Description
bool

True iff the (possibly normalised) form is in the

bool

nullplace list.

Source code in rigour/text/stopwords.py
def is_nullplace(
    form: str, *, normalizer: Normalizer = normalize_text, normalize: bool = False
) -> bool:
    """Check whether `form` is a nullplace.

    Nullplaces are place names that don't refer to a specific
    location: `"overseas"`, `"abroad"`, `"stateless"`,
    `"international waters"`, etc. Useful for filtering out
    records where a country / address slot was populated with a
    placeholder rather than a real geography.

    Args:
        form: The string to check.
        normalizer: Normalizer applied to the wordlist at load
            time, and to `form` when `normalize=True`.
        normalize: When `True`, run `normalizer(form)` before the
            lookup. When `False` (default), `form` is assumed to
            be pre-normalised.

    Returns:
        `True` iff the (possibly normalised) form is in the
        nullplace list.
    """
    norm_form = normalizer(form) if normalize else form
    if norm_form is None:
        return False
    nullplaces = _load_nullplaces(normalizer)
    return norm_form in nullplaces

is_nullword(form, *, normalizer=normalize_text, normalize=False)

Check whether form is a nullword.

Nullwords are tokens that imply a missing value: "none", "not available", "n/a", "unknown", etc. Useful for filtering out records where an alias slot was populated with a placeholder rather than real data.

Parameters:

Name Type Description Default
form str

The token to check.

required
normalizer Normalizer

Normalizer applied to the wordlist at load time, and to form when normalize=True.

normalize_text
normalize bool

When True, run normalizer(form) before the lookup. When False (default), form is assumed to be pre-normalised.

False

Returns:

Type Description
bool

True iff the (possibly normalised) form is in the

bool

nullword list.

Source code in rigour/text/stopwords.py
def is_nullword(
    form: str, *, normalizer: Normalizer = normalize_text, normalize: bool = False
) -> bool:
    """Check whether `form` is a nullword.

    Nullwords are tokens that imply a missing value:
    `"none"`, `"not available"`, `"n/a"`, `"unknown"`, etc.
    Useful for filtering out records where an alias slot was
    populated with a placeholder rather than real data.

    Args:
        form: The token to check.
        normalizer: Normalizer applied to the wordlist at load
            time, and to `form` when `normalize=True`.
        normalize: When `True`, run `normalizer(form)` before the
            lookup. When `False` (default), `form` is assumed to
            be pre-normalised.

    Returns:
        `True` iff the (possibly normalised) form is in the
        nullword list.
    """
    norm_form = normalizer(form) if normalize else form
    if norm_form is None:
        return False
    nullwords = _load_nullwords(normalizer) if normalize else _load_nullwords(noop_normalizer)
    return norm_form in nullwords

is_stopword(form, *, normalizer=normalize_text, normalize=False)

Check whether form is a stopword.

Stopwords are common words that carry no identifying signal in name-matching contexts ("the", "and", "of", etc.). Both the wordlist and the runtime input must be normalised with the same normalizer for the membership check to be meaningful.

Parameters:

Name Type Description Default
form str

The token to check.

required
normalizer Normalizer

Normalizer applied to the wordlist at load time, and to form when normalize=True.

normalize_text
normalize bool

When True, run normalizer(form) before the lookup. When False (default), form is assumed to be pre-normalised by the caller.

False

Returns:

Type Description
bool

True iff the (possibly normalised) form is in the

bool

stopword list.

Source code in rigour/text/stopwords.py
def is_stopword(
    form: str, *, normalizer: Normalizer = normalize_text, normalize: bool = False
) -> bool:
    """Check whether `form` is a stopword.

    Stopwords are common words that carry no identifying signal
    in name-matching contexts (`"the"`, `"and"`, `"of"`, etc.).
    Both the wordlist and the runtime input must be normalised
    with the same `normalizer` for the membership check to be
    meaningful.

    Args:
        form: The token to check.
        normalizer: Normalizer applied to the wordlist at load
            time, and to `form` when `normalize=True`.
        normalize: When `True`, run `normalizer(form)` before the
            lookup. When `False` (default), `form` is assumed to
            be pre-normalised by the caller.

    Returns:
        `True` iff the (possibly normalised) form is in the
        stopword list.
    """
    norm_form = normalizer(form) if normalize else form
    if norm_form is None:
        return False
    stopwords = _load_stopwords(normalizer)
    return norm_form in stopwords

jaro_winkler(left, right, max_length=env.MAX_NAME_LENGTH) cached

Compute the Jaro-Winkler similarity of two strings.

Parameters:

Name Type Description Default
left str

A string.

required
right str

A string.

required

Returns:

Type Description
float

A float between 0.0 and 1.0.

Source code in rigour/text/distance.py
@lru_cache(maxsize=MEMO_SMALL)
def jaro_winkler(left: str, right: str, max_length: int = env.MAX_NAME_LENGTH) -> float:
    """Compute the Jaro-Winkler similarity of two strings.

    Args:
        left: A string.
        right: A string.

    Returns:
        A float between 0.0 and 1.0.
    """
    score = raw_jaro_winkler(left[:max_length], right[:max_length])
    return score if score > 0.6 else 0.0

levenshtein(left, right, max_length=env.MAX_NAME_LENGTH, max_edits=None) cached

Compute the Levenshtein distance between two strings.

Parameters:

Name Type Description Default
left str

A string.

required
right str

A string.

required

Returns:

Type Description
int

An integer of changed characters.

Source code in rigour/text/distance.py
@lru_cache(maxsize=MEMO_SMALL)
def levenshtein(
    left: str,
    right: str,
    max_length: int = env.MAX_NAME_LENGTH,
    max_edits: Optional[int] = None,
) -> int:
    """Compute the Levenshtein distance between two strings.

    Args:
        left: A string.
        right: A string.

    Returns:
        An integer of changed characters.
    """
    if left == right:
        return 0
    left = left[:max_length]
    right = right[:max_length]
    if max_edits is None:
        return raw_levenshtein(left, right)
    return raw_levenshtein_cutoff(left, right, max_edits)

levenshtein_similarity(left, right, max_edits=env.LEVENSHTEIN_MAX_EDITS, max_percent=env.LEVENSHTEIN_MAX_PERCENT, max_length=env.MAX_NAME_LENGTH)

Compute the Levenshtein similarity of two strings. The similiarity is the percentage distance measured against the length of the longest string.

Parameters:

Name Type Description Default
left str

A string.

required
right str

A string.

required
max_edits Optional[int]

The maximum number of edits allowed.

LEVENSHTEIN_MAX_EDITS
max_percent float

The maximum fraction of the shortest string that is allowed to be edited.

LEVENSHTEIN_MAX_PERCENT

Returns:

Type Description
float

A float between 0.0 and 1.0.

Source code in rigour/text/distance.py
def levenshtein_similarity(
    left: str,
    right: str,
    max_edits: Optional[int] = env.LEVENSHTEIN_MAX_EDITS,
    max_percent: float = env.LEVENSHTEIN_MAX_PERCENT,
    max_length: int = env.MAX_NAME_LENGTH,
) -> float:
    """Compute the Levenshtein similarity of two strings. The similiarity is
    the percentage distance measured against the length of the longest string.

    Args:
        left: A string.
        right: A string.
        max_edits: The maximum number of edits allowed.
        max_percent: The maximum fraction of the shortest string that is allowed to be edited.

    Returns:
        A float between 0.0 and 1.0.
    """
    left_len = len(left)
    right_len = len(right)
    if left_len == 0 or right_len == 0:
        # Do not produce matches for empty strings
        return 0.0

    # Skip results with an overall distance of more than N characters:
    pct_edits = math.ceil(min(left_len, right_len) * max_percent)
    max_edits_ = min(max_edits, pct_edits) if max_edits is not None else pct_edits
    if abs(left_len - right_len) > max_edits_:
        return 0.0

    distance = levenshtein(left, right, max_length=max_length, max_edits=max_edits_)
    if distance > max_edits_:
        return 0.0
    return 1.0 - (float(distance) / max(left_len, right_len))

metaphone(token) cached

Get the metaphone phonetic representation of a token.

Thin Python-level LRU cache over the Rust implementation. The cache pays off in matching workloads where the same name tokens recur across millions of entities — a cache hit skips the FFI crossing entirely.

Source code in rigour/text/phonetics.py
@lru_cache(maxsize=MEMO_LARGE)
def metaphone(token: str) -> str:
    """Get the metaphone phonetic representation of a token.

    Thin Python-level LRU cache over the Rust implementation. The cache pays
    off in matching workloads where the same name tokens recur across millions
    of entities — a cache hit skips the FFI crossing entirely.
    """
    return _metaphone(token)

remove_bracketed_text(text)

Remove any text in brackets. This is meant to handle names of companies which include the jurisdiction, like: Turtle Management (Seychelles) Ltd.

Parameters:

Name Type Description Default
text str

A text including text in brackets.

required

Returns:

Type Description
str

Text where this has been substituted for whitespace.

Source code in rigour/text/cleaning.py
def remove_bracketed_text(text: str) -> str:
    """Remove any text in brackets. This is meant to handle names of companies
    which include the jurisdiction, like: Turtle Management (Seychelles) Ltd.

    Args:
        text: A text including text in brackets.

    Returns:
        Text where this has been substituted for whitespace.
    """
    return BRACKETED.sub(WS, text)

remove_emoji(string)

Remove unicode ranges used by emoticons, symbols, flags and other visual codepoints from a piece of text. Primary use case is to remove shit emojis from the names of political office holders coming from Wikidata.

Parameters:

Name Type Description Default
string str

Text that may include emoji and pictographs.

required

Returns:

Type Description
str

Text that doesn't include those.

Source code in rigour/text/cleaning.py
def remove_emoji(string: str) -> str:
    """Remove unicode ranges used by emoticons, symbols, flags and other visual codepoints from
    a piece of text. Primary use case is to remove shit emojis from the names of political office
    holders coming from Wikidata.

    Args:
        string: Text that may include emoji and pictographs.

    Returns:
        Text that doesn't include those.
    """
    return RANGE_PATTERN.sub(r"", string)

soundex(token) cached

Get the soundex phonetic representation of a token.

Thin Python-level LRU cache over the Rust implementation. Same rationale as metaphone above.

Source code in rigour/text/phonetics.py
@lru_cache(maxsize=MEMO_LARGE)
def soundex(token: str) -> str:
    """Get the soundex phonetic representation of a token.

    Thin Python-level LRU cache over the Rust implementation. Same rationale
    as metaphone above.
    """
    return _soundex(token)

text_hash(text)

Generate a hash for the given text, ignoring whitespace and punctuation.

Parameters:

Name Type Description Default
text str

The input text to hash.

required

Returns:

Name Type Description
str str

The SHA-1 hash of the processed text.

Source code in rigour/text/checksum.py
def text_hash(text: str) -> str:
    """Generate a hash for the given text, ignoring whitespace and punctuation.

    Args:
        text (str): The input text to hash.

    Returns:
        str: The SHA-1 hash of the processed text.
    """
    text = normalize("NFKD", remove_unsafe_chars(text.lower()))
    substantial = [c for c in text if c.isalnum()]
    text = "".join(substantial)
    return sha1(text.encode(ENCODING)).hexdigest()

rigour.text.normalize

Flag-based text normalisation — the canonical reference for the normalize_flags / cleanup parameters used across rigour.

This module exposes three things:

  • Normalize — a bit-flag set selecting individual normalisation steps (strip, casefold, NFC/NFKC/NFKD, transliterate, squash spaces).
  • Cleanup — an enum picking one of two fixed Unicode-category replacement profiles (Strong, Slug), or Noop to skip the step.
  • normalize — the single entry point that runs the composed pipeline on a string.

Two distinct uses of these flags

The same Normalize vocabulary shows up in two places across rigour, with different lifecycles:

  1. Input normalisation. The caller runs normalize(text, flags, cleanup) on a single runtime string and passes the result downstream. This is what this module does directly.
  2. Reference-data normalisation. A lookup/tagger function (e.g. replace_org_types_compare, or the AC tagger inside analyze_names) builds an internal regex/automaton from static YAML data (aliases, stopwords, AC patterns) and uses normalize_flags + cleanup to decide how that static data gets normalised at build time. The caller is expected to normalise its runtime input with the same flags before calling. Functions in this bucket cache one compiled automaton per distinct flag combination.

Pipeline order

Steps run in a fixed order regardless of bit-ordering in the flag value:

1. STRIP                — trim leading/trailing whitespace
2. NFKD / NFKC / NFC    — at most one is meaningful; if multiple
                          are set, Rust applies the first one
                          listed in its dispatch (NFKD)
3. CASEFOLD             — Unicode full casefold (ß → ss, not lowercase)
4. Cleanup              — category_replace, unless Cleanup.Noop
5. SQUASH_SPACES        — collapse whitespace runs, trim ends
6. NAME                 — tokenize via
                          [tokenize_name][rigour.names.tokenize.tokenize_name]
                          and rejoin with a single ASCII space

Transliteration is NOT part of this pipeline. rigour's public transliteration surface is rigour.text.translit — opportunistic, limited to Latin/Cyrillic/Greek/Armenian/Georgian/Hangul. For broader-script lossy romanisation use normality.ascii_text / normality.latinize_text.

Empty output is coalesced to None, matching the contract of the pre-flags Optional[str] normalisers.

Common compositions

The flag sets pinned as defaults across the rigour API:

  • Normalize.CASEFOLD — production default for comparison keys that should preserve whitespace and script.
  • Normalize.CASEFOLD | Normalize.SQUASH_SPACES — adds whitespace collapsing on top. Used when input whitespace is unreliable, and by display-style replacers that need case-insensitive matching with tidied whitespace.
  • Normalize.SQUASH_SPACES — whitespace-tidy without case change. Used by display-form replacers that want to preserve caller case.
  • Normalize.CASEFOLD | Normalize.NAME — casefold and tokenise with rigour.names.tokenize.tokenize_name, yielding a stable space-separated name key for matching.

Implementation note

The actual work runs in Rust via rigour._core._normalize. This module is the idiomatic Python surface — IntFlag for the bit set, IntEnum for the variant, both crossing the FFI boundary as plain ints at ~zero marshalling cost.

Cleanup

Bases: IntEnum

Unicode-category-based cleanup variants.

Cleanup picks one of a small set of fixed category → action tables that drive the pipeline's category_replace step (pipeline step 5). The step rewrites or deletes characters based on their Unicode general category (e.g. punctuation, control, mark, symbol). The tables are deliberately closed — callers compose via the flag set, not by passing ad-hoc category maps.

Attributes:

Name Type Description
Noop

Skip the category_replace step entirely. The default for all rigour functions that expose cleanup.

Strong

Aggressive cleanup — punctuation and symbols become whitespace; controls, formats, and marks are deleted. Use when you want a matching key stripped of all decoration.

Slug

URL-slug-style cleanup. Differs from Strong in two places: preserves modifier letters (Lm) and nonspacing marks (Mn) that Strong deletes, and deletes control characters (Cc) that Strong turns into whitespace. Use for stopword keys and slug generation.

Source code in rigour/text/normalize.py
class Cleanup(IntEnum):
    """Unicode-category-based cleanup variants.

    `Cleanup` picks one of a small set of fixed category → action
    tables that drive the pipeline's `category_replace` step
    (pipeline step 5). The step rewrites or deletes characters based
    on their Unicode general category (e.g. punctuation, control,
    mark, symbol). The tables are deliberately closed — callers
    compose via the flag set, not by passing ad-hoc category maps.

    Attributes:
        Noop: Skip the `category_replace` step entirely. The default
            for all rigour functions that expose `cleanup`.
        Strong: Aggressive cleanup — punctuation and symbols become
            whitespace; controls, formats, and marks are deleted. Use
            when you want a matching key stripped of all decoration.
        Slug: URL-slug-style cleanup. Differs from `Strong` in two
            places: preserves modifier letters (Lm) and nonspacing
            marks (Mn) that `Strong` deletes, and deletes control
            characters (Cc) that `Strong` turns into whitespace. Use
            for stopword keys and slug generation.
    """

    # Values MUST match the tag encoding in rust/src/lib.rs py_normalize().
    Noop = 0
    Strong = 1
    Slug = 2

Normalize

Bases: IntFlag

Bit-flag set selecting individual normalisation steps.

Compose flags with bitwise OR and pass to normalize or to any rigour function that exposes a normalize_flags= parameter (e.g. :func:rigour.names.org_types.replace_org_types_compare). See the module docstring for the fixed pipeline order the flags compose into and for common flag compositions.

Attributes:

Name Type Description
STRIP

Trim leading and trailing whitespace.

SQUASH_SPACES

Collapse runs of whitespace (including newlines, tabs, Unicode whitespace) into single spaces and trim the edges. Runs as the final pipeline step, so cleaning up whitespace introduced by earlier steps (e.g. category replacement) comes out right.

CASEFOLD

Unicode full casefold (e.g. ß → ss). This is not the same as str.lower() — casefold is the correct tool for case-insensitive comparison across Unicode.

NFC

Apply Unicode Normal Form C (canonical composition). Rarely needed on its own; most callers want NFKC or NFKD. Mutually exclusive with NFKC/NFKD.

NFKC

Apply Unicode Normal Form KC (compatibility composition). Folds compatibility variants (e.g. full-width digits → ASCII) while keeping a composed form. Mutually exclusive with NFC/NFKD.

NFKD

Apply Unicode Normal Form KD (compatibility decomposition). Splits composed characters apart. Mutually exclusive with NFC/NFKC.

NAME

Run the string through tokenize_name and rejoin the tokens with a single ASCII space. Runs as the final pipeline step, so it also subsumes whitespace squashing.

Source code in rigour/text/normalize.py
class Normalize(IntFlag):
    """Bit-flag set selecting individual normalisation steps.

    Compose flags with bitwise OR and pass to [normalize][rigour.text.normalize.normalize] or to
    any rigour function that exposes a `normalize_flags=` parameter
    (e.g. :func:`rigour.names.org_types.replace_org_types_compare`).
    See the module docstring for the fixed pipeline order the flags
    compose into and for common flag compositions.

    Attributes:
        STRIP: Trim leading and trailing whitespace.
        SQUASH_SPACES: Collapse runs of whitespace (including newlines,
            tabs, Unicode whitespace) into single spaces and trim the
            edges. Runs as the final pipeline step, so cleaning up
            whitespace introduced by earlier steps (e.g. category
            replacement) comes out right.
        CASEFOLD: Unicode full casefold (e.g. ``ß → ss``). This is
            *not* the same as `str.lower()` — casefold is the correct
            tool for case-insensitive comparison across Unicode.
        NFC: Apply Unicode Normal Form C (canonical composition).
            Rarely needed on its own; most callers want NFKC or NFKD.
            Mutually exclusive with NFKC/NFKD.
        NFKC: Apply Unicode Normal Form KC (compatibility composition).
            Folds compatibility variants (e.g. full-width digits → ASCII)
            while keeping a composed form. Mutually exclusive with
            NFC/NFKD.
        NFKD: Apply Unicode Normal Form KD (compatibility decomposition).
            Splits composed characters apart. Mutually exclusive
            with NFC/NFKC.
        NAME: Run the string through
            [tokenize_name][rigour.names.tokenize.tokenize_name]
            and rejoin the tokens with a single ASCII space. Runs
            as the final pipeline step, so it also subsumes
            whitespace squashing.
    """

    # Bit values MUST match rust/src/text/normalize.rs `bitflags! Normalize`.
    STRIP = 1 << 0
    SQUASH_SPACES = 1 << 1
    CASEFOLD = 1 << 2
    NFC = 1 << 3
    NFKC = 1 << 4
    NFKD = 1 << 5
    NAME = 1 << 6

noop_normalizer(text)

Identity normalizer that strips whitespace and rejects empty.

Default :data:Normalizer for callers whose input is already in the desired shape — only edge whitespace is removed.

Parameters:

Name Type Description Default
text Optional[str]

A string, or None.

required

Returns:

Type Description
Optional[str]

The stripped string, or None for None input or

Optional[str]

empty / whitespace-only input.

Source code in rigour/text/normalize.py
def noop_normalizer(text: Optional[str]) -> Optional[str]:
    """Identity normalizer that strips whitespace and rejects empty.

    Default :data:`Normalizer` for callers whose input is already
    in the desired shape — only edge whitespace is removed.

    Args:
        text: A string, or `None`.

    Returns:
        The stripped string, or `None` for `None` input or
        empty / whitespace-only input.
    """
    if text is None:
        return None
    text = text.strip()
    if len(text) == 0:
        return None
    return text

normalize(text, flags=Normalize(0), cleanup=Cleanup.Noop)

Apply a composed sequence of text normalisation steps.

The pipeline order and semantics of each flag are described in the module docstring. This function is the canonical entry point; other rigour functions that take normalize_flags= + cleanup= apply the same pipeline to their internal reference data at regex/automaton build time.

Parameters:

Name Type Description Default
text Optional[str]

The text to normalise. If None, the function short-circuits to None without calling into Rust.

required
flags Normalize

Which normalisation steps to apply (see Normalize). Default Normalize(0) runs no steps — the function is effectively a type-safe identity under the default, use explicit flags to do work.

Normalize(0)
cleanup Cleanup

Which category-replacement variant to apply as the fifth pipeline step (see Cleanup). Default Cleanup.Noop skips that step.

Noop

Returns:

Type Description
Optional[str]

The normalised string, or None if text was None

Optional[str]

or if the pipeline reduced the text to an empty string.

Optional[str]

The empty-output-to-None coalescence is the

Optional[str]

Optional-string contract.

Source code in rigour/text/normalize.py
def normalize(
    text: Optional[str],
    flags: Normalize = Normalize(0),
    cleanup: Cleanup = Cleanup.Noop,
) -> Optional[str]:
    """Apply a composed sequence of text normalisation steps.

    The pipeline order and semantics of each flag are described in the
    module docstring. This function is the canonical entry point;
    other rigour functions that take `normalize_flags=` + `cleanup=`
    apply the same pipeline to their internal reference data at
    regex/automaton build time.

    Args:
        text: The text to normalise. If ``None``, the function
            short-circuits to ``None`` without calling into Rust.
        flags: Which normalisation steps to apply (see `Normalize`).
            Default ``Normalize(0)`` runs no steps — the function is
            effectively a type-safe identity under the default, use
            explicit flags to do work.
        cleanup: Which category-replacement variant to apply as the
            fifth pipeline step (see `Cleanup`). Default `Cleanup.Noop`
            skips that step.

    Returns:
        The normalised string, or ``None`` if `text` was ``None``
        or if the pipeline reduced the text to an empty string.
        The empty-output-to-``None`` coalescence is the
        Optional-string contract.
    """
    if text is None:
        return None
    return _normalize(text, int(flags), int(cleanup))

rigour.text.translit

Opportunistic transliteration: should_ascii + maybe_ascii.

The minimal transliteration surface rigour exposes going forward. Covers only the scripts listed in rigour.text.scripts.LATINIZE_SCRIPTS (Latin, Cyrillic, Greek, Armenian, Georgian, Hangul). Anything outside that set passes through unchanged (default) or becomes empty (drop=True).

For broader-script, lossy transliteration (Han, Arabic, Devanagari, etc.) use normality.ascii_text / normality.latinize_text — rigour deliberately does not try to duplicate that surface.

rigour.text.scripts

can_latinize(word)

Check whether every script in a word is latinizable.

Equivalent to text_scripts(word) <= LATINIZE_SCRIPTS. When True, :func:rigour.text.translit.maybe_ascii will produce an ASCII output; when False, it returns the input unchanged. Characters with no distinguishing script (digits, punctuation, spaces, combining marks) are ignored. Empty input and pure-Common input ("123") return True vacuously.

Parameters:

Name Type Description Default
word str

A string.

required

Returns:

Type Description
bool

True iff every distinguishing script is in

bool

data:LATINIZE_SCRIPTS.

Source code in rigour/text/scripts.py
def can_latinize(word: str) -> bool:
    """Check whether every script in a word is latinizable.

    Equivalent to `text_scripts(word) <= LATINIZE_SCRIPTS`. When
    True, :func:`rigour.text.translit.maybe_ascii` will produce
    an ASCII output; when False, it returns the input unchanged.
    Characters with no distinguishing script (digits, punctuation,
    spaces, combining marks) are ignored. Empty input and
    pure-Common input (`"123"`) return True vacuously.

    Args:
        word: A string.

    Returns:
        True iff every distinguishing script is in
        :data:`LATINIZE_SCRIPTS`.
    """
    return text_scripts(word) <= LATINIZE_SCRIPTS

can_latinize_cp(cp) cached

Check whether a single codepoint can be latinized.

Three-valued: distinguishing-script-bearing codepoints get a True/False; others (digits, punctuation, combining marks) return None because the question doesn't apply to them.

Parameters:

Name Type Description Default
cp int

Codepoint as an integer.

required

Returns:

Type Description
Optional[bool]

True if the codepoint's script is in

Optional[bool]

data:LATINIZE_SCRIPTS, False if it has a real script

Optional[bool]

outside that set, None for non-alphanumeric or

Optional[bool]

Common/Inherited/Unknown codepoints.

Source code in rigour/text/scripts.py
@lru_cache(maxsize=MEMO_MEDIUM)
def can_latinize_cp(cp: int) -> Optional[bool]:
    """Check whether a single codepoint can be latinized.

    Three-valued: distinguishing-script-bearing codepoints get a
    True/False; others (digits, punctuation, combining marks)
    return `None` because the question doesn't apply to them.

    Args:
        cp: Codepoint as an integer.

    Returns:
        `True` if the codepoint's script is in
        :data:`LATINIZE_SCRIPTS`, `False` if it has a real script
        outside that set, `None` for non-alphanumeric or
        Common/Inherited/Unknown codepoints.
    """
    char = chr(cp)
    if not char.isalnum():
        return None
    script = codepoint_script(cp)
    if script is None or script in ("Common", "Inherited"):
        return None
    return script in LATINIZE_SCRIPTS

codepoint_script(cp) cached

Return the Unicode Script long name for a codepoint.

Faithful exposure of the Unicode Script property via ICU4X. Returns the pseudo-scripts "Common" (digits, punctuation, spaces) and "Inherited" (combining marks) as-is — callers that want to filter them out should do so explicitly via :func:text_scripts, which already does.

Parameters:

Name Type Description Default
cp int

Codepoint as an integer. Accepts any u32 value including surrogates and unassigned codepoints, so callers can pass ord(c) of any character without a TypeError at the FFI boundary.

required

Returns:

Type Description
Optional[str]

Script long name ("Latin", "Cyrillic", "Han",

Optional[str]

"Common", "Inherited", …), or None for unassigned

Optional[str]

or invalid codepoints (including lone surrogates).

Source code in rigour/text/scripts.py
@lru_cache(maxsize=MEMO_MEDIUM)
def codepoint_script(cp: int) -> Optional[str]:
    """Return the Unicode Script long name for a codepoint.

    Faithful exposure of the Unicode Script property via ICU4X.
    Returns the pseudo-scripts `"Common"` (digits, punctuation,
    spaces) and `"Inherited"` (combining marks) as-is — callers
    that want to filter them out should do so explicitly via
    :func:`text_scripts`, which already does.

    Args:
        cp: Codepoint as an integer. Accepts any `u32` value
            including surrogates and unassigned codepoints, so
            callers can pass `ord(c)` of any character without a
            `TypeError` at the FFI boundary.

    Returns:
        Script long name (`"Latin"`, `"Cyrillic"`, `"Han"`,
        `"Common"`, `"Inherited"`, …), or `None` for unassigned
        or invalid codepoints (including lone surrogates).
    """
    return _codepoint_script(cp)

common_scripts(a, b)

Return the scripts both strings have in common.

Equivalent to text_scripts(a) & text_scripts(b). Cheap pruning predicate — two strings that share no real script have no textual bridge unless both are individually latinizable.

The empty-result caveat is the main subtlety. An empty return is ambiguous between "scripts are disjoint" (e.g. Latin vs Han) and "one side has no real scripts" (numeric-only, punctuation-only, empty). The two cases have different matching implications — a numeric-only name like "007" can still match "Agent 007" via shared NUMERIC symbols even though common_scripts is empty. Pruning callers should treat empty-script inputs as wildcards that bypass the script gate, falling through to symbol-overlap or scoring. Callers that need to distinguish the two cases should call :func:text_scripts on each side explicitly.

Parameters:

Name Type Description Default
a str

A string.

required
b str

Another string.

required

Returns:

Type Description
set[str]

Intersection of the two strings' real-script sets.

Source code in rigour/text/scripts.py
def common_scripts(a: str, b: str) -> set[str]:
    """Return the scripts both strings have in common.

    Equivalent to ``text_scripts(a) & text_scripts(b)``. Cheap
    pruning predicate — two strings that share no real script
    have no textual bridge unless both are individually
    latinizable.

    The empty-result caveat is the main subtlety. An empty return
    is ambiguous between *"scripts are disjoint"* (e.g. Latin vs
    Han) and *"one side has no real scripts"* (numeric-only,
    punctuation-only, empty). The two cases have different
    matching implications — a numeric-only name like `"007"` can
    still match `"Agent 007"` via shared `NUMERIC` symbols even
    though `common_scripts` is empty. Pruning callers should
    treat empty-script inputs as wildcards that bypass the script
    gate, falling through to symbol-overlap or scoring. Callers
    that need to distinguish the two cases should call
    :func:`text_scripts` on each side explicitly.

    Args:
        a: A string.
        b: Another string.

    Returns:
        Intersection of the two strings' real-script sets.
    """
    return _common_scripts(a, b)

is_dense_script(word)

Check whether a word uses any dense (logographic or syllabic) script.

Rough proxy for "doesn't use whitespace to separate name parts" — though Hangul actually does use spaces, it's still grouped here because it encodes syllables rather than individual sounds.

Parameters:

Name Type Description Default
word str

A string.

required

Returns:

Name Type Description
bool

True iff word contains any character whose script is

in bool

data:DENSE_SCRIPTS.

Source code in rigour/text/scripts.py
def is_dense_script(word: str) -> bool:
    """Check whether a word uses any dense (logographic or
    syllabic) script.

    Rough proxy for "doesn't use whitespace to separate name
    parts" — though Hangul actually does use spaces, it's still
    grouped here because it encodes syllables rather than
    individual sounds.

    Args:
        word: A string.

    Returns:
        True iff `word` contains any character whose script is
        in :data:`DENSE_SCRIPTS`.
    """
    return bool(text_scripts(word) & DENSE_SCRIPTS)

is_latin(word)

Check whether a word is written in the Latin alphabet only.

Parameters:

Name Type Description Default
word str

A string.

required

Returns:

Type Description
bool

True iff the only distinguishing script in word is

bool

Latin. Pure-ASCII input short-circuits to True.

Source code in rigour/text/scripts.py
def is_latin(word: str) -> bool:
    """Check whether a word is written in the Latin alphabet only.

    Args:
        word: A string.

    Returns:
        True iff the only distinguishing script in `word` is
        Latin. Pure-ASCII input short-circuits to True.
    """
    if word.isascii():
        return True
    return text_scripts(word) <= {"Latin"}

is_modern_alphabet(word)

Check whether a word uses only modern alphabets.

Modern alphabets (Latin, Cyrillic, Greek, Armenian, Georgian) are letter-based systems with explicit vowels that transliterate reliably to Latin without language hints. Excludes Hangul (syllabic) and the dense logographic scripts.

Parameters:

Name Type Description Default
word str

A string.

required

Returns:

Type Description
bool

True iff every distinguishing script is in

bool

data:MODERN_ALPHABETS. Pure-ASCII input short-circuits

bool

to True without script detection.

Source code in rigour/text/scripts.py
def is_modern_alphabet(word: str) -> bool:
    """Check whether a word uses only modern alphabets.

    Modern alphabets (Latin, Cyrillic, Greek, Armenian, Georgian)
    are letter-based systems with explicit vowels that
    transliterate reliably to Latin without language hints.
    Excludes Hangul (syllabic) and the dense logographic scripts.

    Args:
        word: A string.

    Returns:
        True iff every distinguishing script is in
        :data:`MODERN_ALPHABETS`. Pure-ASCII input short-circuits
        to True without script detection.
    """
    if word.isascii():
        return True
    return text_scripts(word) <= MODERN_ALPHABETS

text_scripts(text)

Return the set of distinct real scripts present in text.

The right primitive for "which writing systems does this string use?". Only letters (General_Category L*) and numbers (N*) contribute; Common, Inherited, and Unknown are excluded — shared characters (digits, punctuation) and combining marks don't count as their own script.

Parameters:

Name Type Description Default
text str

Any string, including empty.

required

Returns:

Type Description
set[str]

Set of script long names. Empty when the input has no

set[str]

letters or numbers (numeric-only, punctuation-only,

set[str]

empty).

Source code in rigour/text/scripts.py
def text_scripts(text: str) -> set[str]:
    """Return the set of distinct real scripts present in text.

    The right primitive for *"which writing systems does this
    string use?"*. Only letters (General_Category `L*`) and
    numbers (`N*`) contribute; Common, Inherited, and Unknown are
    excluded — shared characters (digits, punctuation) and
    combining marks don't count as their own script.

    Args:
        text: Any string, including empty.

    Returns:
        Set of script long names. Empty when the input has no
        letters or numbers (numeric-only, punctuation-only,
        empty).
    """
    return _text_scripts(text)