Names

`rigour.names`

Name handling utilities for person and organisation names. This module contains a large (and growing) set of tools for handling names. In general, there are three types of names: people, organizations, and objects. Different normalization may be required for each of these types, including prefix removal for person names (e.g. "Mr." or "Ms.") and type normalization for organization names (e.g. "Incorporated" -> "Inc" or "Limited" -> "Ltd").

The Name class is meant to provide a structure for a name, including its original form, normalized form, metadata on the type of thing described by the name, and the language of the name. The NamePart class is used to represent individual parts of a name, such as the first name, middle name, and last name.

Falsehoods Programmers Believe About Names

`Alignment`

One unit of name-comparison evidence.

Three modes:

Symbol-paired edge — symbol is Some and both sides carry the same Symbol. Returned by pair_symbols. Default score is 1.0; consumers may override with a category default (e.g. SYM_SCORES[NAME] = 0.9).
Residue cluster — symbol is None, both sides non-empty. Returned by compare_parts for parts that aligned by edit distance.
Extra — symbol is None, exactly one side is empty. Represents a part that found no counterpart on the other side; the matcher applies a side-specific weight.

qps / rps / symbol / qstr / rstr are immutable post-construction. score and weight are mutable to support the matcher's policy passes (literal-equality rescue, extras-weight override, family-name boost). Both stored as Py<PyFloat> so Python-side reads are an INCREF rather than a fresh allocation per access.

__hash__ and __eq__ key on (symbol, qps, rps) — NamePart already hashes by (index, form) so position is preserved. score and weight are not part of identity.

doc = "One unit of name-comparison evidence.\n\nThree modes:\n\n- Symbol-paired edge — `symbol` is `Some` and both sides\n carry the same `Symbol`. Returned by `pair_symbols`. Default\n `score` is `1.0`; consumers may override with a category\n default (e.g. `SYM_SCORES[NAME] = 0.9`).\n- Residue cluster — `symbol` is `None`, both sides\n non-empty. Returned by `compare_parts` for parts that\n aligned by edit distance.\n- Extra — `symbol` is `None`, exactly one side is empty.\n Represents a part that found no counterpart on the other\n side; the matcher applies a side-specific weight.\n\n`qps` / `rps` / `symbol` / `qstr` / `rstr` are immutable\npost-construction. `score` and `weight` are mutable to support\nthe matcher's policy passes (literal-equality rescue,\nextras-weight override, family-name boost). Both stored as\n`Py<PyFloat>` so Python-side reads are an INCREF rather than a\nfresh allocation per access.\n\n`hash` and `eq` key on `(symbol, qps, rps)` —\n`NamePart` already hashes by `(index, form)` so position is\npreserved. `score` and `weight` are not part of identity." `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`qps` `property`

Query-side parts covered by this alignment.

`qstr` `property`

" ".join(p.comparable for p in qps), cached.

`rps` `property`

Result-side parts covered by this alignment.

`rstr` `property`

" ".join(p.comparable for p in rps), cached.

`score` `property`

Similarity in [0, 1]. For symbol-paired edges, defaults to 1.0; consumers override with a category default. For residue clusters, the per-cluster product. For extras, 0.0.

`symbol` `property`

Shared Symbol for symbol-paired edges; None for residue clusters and extras.

`weight` `property`

Aggregation weight in the matcher's weighted average. Defaults to 1.0; consumers override per category (SYM_WEIGHTS), for extras (nm_extra_*_name), for family-name boost (nm_family_name_weight), and for stopword down-weight.

`eq(value)` `method descriptor`

Return self==value.

`ge(value)` `method descriptor`

Return self>=value.

`gt(value)` `method descriptor`

Return self>value.

`hash()` `method descriptor`

Return hash(self).

`le(value)` `method descriptor`

Return self<=value.

`lt(value)` `method descriptor`

Return self<value.

`ne(value)` `method descriptor`

Return self!=value.

`new(*args, **kwargs)` `builtin`

Create and return a new object. See help(type) for accurate signature.

`repr()` `method descriptor`

Return repr(self).

`CompareConfig`

Tunable cost / budget / clustering scalars for [py_compare_parts].

Frozen by design: a sweep iteration constructs a fresh CompareConfig, the matcher caches one per request. Mutability would buy nothing (the values are read once per name pair) and would cost a runtime borrow check on each Rust-side access.

The default values reproduce the constants this struct replaced; compare_parts(qry, res) with no config argument is exactly equivalent to the pre-CompareConfig call.

doc = 'Tunable cost / budget / clustering scalars for [`py_compare_parts`].\n\nFrozen by design: a sweep iteration constructs a fresh\n`CompareConfig`, the matcher caches one per request. Mutability\nwould buy nothing (the values are read once per name pair) and\nwould cost a runtime borrow check on each Rust-side access.\n\nThe default values reproduce the constants this struct replaced;\n`compare_parts(qry, res)` with no `config` argument is exactly\nequivalent to the pre-`CompareConfig` call.' `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`budget_log_base` `property`

Logarithm base in the per-side cost-budget formula log_budget_log_base(max(len - budget_short_floor, 1)) * budget_tolerance. The base controls how aggressively the budget grows with token length — smaller base = faster growth = more permissive on long names.

`budget_short_floor` `property`

Short-token floor: tokens shorter than this contribute zero to the budget, so any non-zero edit fails the cap. This is the fail-closed property — the matcher refuses to fuzzy- match on 1-2 character tokens (vessel hull suffixes, isolated initials, 2-char Chinese given names) where typo / distinct-entity signal is too weak.

`budget_tolerance` `property`

Multiplier on the per-side cost budget. Lower is stricter (less edit tolerated before a cluster scores zero); higher is more permissive. Callers tune this per scenario — KYC at onboarding runs more permissive than payment screening.

`cluster_overlap_min` `property`

Overlap fraction (matched chars / shorter-side length) above which two parts pair into a cluster. A pair below this threshold surfaces as solo records — the matched-character evidence is too thin to claim the parts are talking about the same token. The 0.51 default (i.e. "more than half") is the lowest value where majority of the shorter token agrees.

`cost_confusable` `property`

Substitute between a confusable pair from resources/names/compare.yml (0/o, 1/l, …). OCR / transliteration / homoglyph noise — the writer was probably aiming at the same character.

`cost_digit` `property`

Edit involving a digit on either side. Digits identify specific things — vintage years, vessel hull numbers, fund vintages — so a digit mismatch is evidence of a different entity, not a typo.

`cost_sep_drop` `property`

Token boundary lost or gained on one side. Token merge/split (vanderbilt ↔ van der bilt) is a common surface-form variant of the same name; charging it almost nothing keeps the alignment from refusing to bridge whitespace artifacts.

`new(*args, **kwargs)` `builtin`

Create and return a new object. See help(type) for accurate signature.

`repr()` `method descriptor`

Return repr(self).

`Name`

A personal, organisational, or object name.

Equality and hashing are over form. A Name's tag can change and spans grows without affecting either.

doc = "A personal, organisational, or object name.\n\nEquality and hashing are over `form`. A `Name`'s `tag` can change\nand `spans` grows without affecting either." `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`comparable` `property`

Space-joined part.comparable across the parts. Precomputed.

`form` `property`

Normalised form. Defaults to casefold(original) if not supplied at construction.

`norm_form` `property`

Space-joined part.form across the parts. Precomputed.

`original` `property`

Input string, verbatim.

`parts` `property`

Tokens of form, one [NamePart] per token. Exposed as a tuple so it's hashable — downstream code keys on (span.parts, span.symbol.category) etc.

`spans` `property`

Tagger output — grows over the name's lifetime via apply_phrase / apply_part.

`symbols` `property`

Aggregate view of every symbol the tagger has attached to this name. Useful when you want the symbol set regardless of which parts carry them (e.g. indexing the name's semantic annotations into a flat field).

`tag` `property`

What kind of thing the name describes. Mutable — infer_part_tags may upgrade ENT → ORG after tagging.

`eq(value)` `method descriptor`

Return self==value.

`ge(value)` `method descriptor`

Return self>=value.

`gt(value)` `method descriptor`

Return self>value.

`hash()` `method descriptor`

Return hash(self).

`le(value)` `method descriptor`

Return self<=value.

`lt(value)` `method descriptor`

Return self<value.

`ne(value)` `method descriptor`

Return self!=value.

`new(*args, **kwargs)` `builtin`

Create and return a new object. See help(type) for accurate signature.

`repr()` `method descriptor`

Return repr(self).

`str()` `method descriptor`

Return str(self).

`apply_part(part, symbol)` `method descriptor`

Record that a single [NamePart] carries symbol.

The single-part variant of [Name::apply_phrase]. Used for symbols that inherently apply to one token: INITIAL on a single-character latin part, NUMERIC inferred from a part like "123456789" that the ordinal tagger didn't cover.

`apply_phrase(phrase, symbol)` `method descriptor`

Record that phrase in this name carries symbol.

The tagger's output path: when the AC automaton reports a recognised phrase (e.g. "limited liability company" → ORG_CLASS:LLP), the match is attached as a [Span] so downstream matching and inference can see which tokens the symbol covers. Every non-overlapping occurrence of phrase in the name gets its own Span.

Idempotent on (phrase, symbol): if any existing Span on this name already carries the same symbol over the same joined-form sequence, the call is a no-op. This keeps the invariant "no duplicate (phrase, symbol) Spans" intact even when more than one tagger fires on the same Name (e.g. the org and person taggers both running on an ENT input).

`consolidate_names(names)` `builtin`

Drop short names that are contained in longer names.

Useful when building a matcher to prevent a scenario where a short version of a name ("John Smith") is matched against a query "John K Smith" — where the longer candidate version would have correctly disqualified the match ("John K Smith" != "John R Smith"). Keeping only the longer form forces the matcher to reckon with the full evidence.

Containment uses [Name::contains]; see there for the PER-aware subset rule. Accepts any Python iterable of Name; returns a new set.

`contains(other)` `method descriptor`

True iff this name structurally contains other.

Used by matcher pipelines to detect when one name's evidence is a subset of another's — e.g. "John Smith" is contained in "John K Smith", and the longer form supersedes the shorter when consolidating candidate names before scoring (see [Name::consolidate_names]). Also backs middle-initial matching: "John Smith" contains "J. Smith" when the J carries an INITIAL symbol that self also has.

Rule: for PER names, every part of other must have a (not-necessarily-adjacent) comparable-equal counterpart in self. For non-PER names, or when the PER rule doesn't find a full subset, falls back to substring containment of norm_form. Returns False when self.tag == UNK or when the two names are equal.

`tag_text(text, tag, max_matches=1)` `method descriptor`

Tag the parts that spell out text with the given tag.

Used when external metadata tells the caller the structural role of a subset of the name's tokens. For example, an FTM firstName property of "Jean Claude" on a name "Jean Claude Juncker" marks both the jean and claude parts as GIVEN; a lastName of "Juncker" then marks the remaining part as FAMILY.

Walks self.parts looking for a contiguous (adjacency-insensitive) match of the tokenised text. On a hit, each matched part's tag is set to tag; parts that already carry a tag that conflicts under [NamePartTag::can_match] demote to AMBIGUOUS instead. Stops after max_matches successful matches.

`NamePart`

A single tagged component of a [crate::names::name::Name].

Equality and hashing are over (index, form) — the immutable identity of the part. tag can be re-written after construction without invalidating either.

doc = 'A single tagged component of a [`crate::names::name::Name`].\n\nEquality and hashing are over `(index, form)` — the immutable\nidentity of the part. `tag` can be re-written after construction\nwithout invalidating either.' `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`ascii` `property`

ASCII-ified form of form for admitted-script parts; None when the part is outside the admitted scripts or reduces to empty after stripping non-alphanumerics.

`comparable` `property`

Best-effort matchable form: integer string for numerics, form for non-latinize parts, ascii otherwise.

`form` `property`

Token text, as tokenised from the parent name's form.

`index` `property`

Position of this part within the parent name's parts list.

`integer` `property`

Parsed integer value for numeric parts, or None when the part isn't numeric or doesn't fit an i64.

`latinize` `property`

True if form is in an admitted-script set (Latin, Cyrillic, Greek, Armenian, Georgian, Hangul) and thus can be meaningfully ASCII-ified.

`metaphone` `property`

Metaphone phonetic key, or None when phonetics were disabled at construction or the part doesn't qualify (non-latinize, numeric, or shorter than three characters).

`numeric` `property`

True if form is entirely numeric characters.

`tag` `property`

Structural role of this part. Set by the tagging pipeline; UNSET at construction.

`eq(value)` `method descriptor`

Return self==value.

`ge(value)` `method descriptor`

Return self>=value.

`gt(value)` `method descriptor`

Return self>value.

`hash()` `method descriptor`

Return hash(self).

`le(value)` `method descriptor`

Return self<=value.

`len()` `method descriptor`

Return len(self).

`lt(value)` `method descriptor`

Return self<value.

`ne(value)` `method descriptor`

Return self!=value.

`new(*args, **kwargs)` `builtin`

Create and return a new object. See help(type) for accurate signature.

`repr()` `method descriptor`

Return repr(self).

`tag_sort(parts)` `builtin`

Sort name parts into canonical display order.

Used when rendering a name back out for humans: honorifics come first, then given names, middle, family, suffixes, legal forms, and stopwords — independent of the input word order. A tokeniser might hand the parts over as "Guttenberg zu Karl-Theodor" (order from the source data); tag_sort restores "Karl-Theodor zu Guttenberg" shape once the parts have been tagged. Sort is stable across parts with the same tag; see [crate::names::tag::NAME_TAGS_ORDER] for the full ordering.

`NamePartTag`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`AMBIGUOUS = NamePartTag.AMBIGUOUS` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`FAMILY = NamePartTag.FAMILY` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`GIVEN = NamePartTag.GIVEN` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`HONORIFIC = NamePartTag.HONORIFIC` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`LEGAL = NamePartTag.LEGAL` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`MATRONYMIC = NamePartTag.MATRONYMIC` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`MIDDLE = NamePartTag.MIDDLE` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`NICK = NamePartTag.NICK` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`NUM = NamePartTag.NUM` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`PATRONYMIC = NamePartTag.PATRONYMIC` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`STOP = NamePartTag.STOP` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`SUFFIX = NamePartTag.SUFFIX` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`TITLE = NamePartTag.TITLE` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`TRIBAL = NamePartTag.TRIBAL` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

`UNSET = NamePartTag.UNSET` `class-attribute`

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

doc = 'The structural role of a part within a name. A newly-constructed\n[`crate::names::part::NamePart`] starts as `UNSET`; the tagging\npipeline promotes it based on external hints (firstName,\nlastName, …) or pattern matches (numeric, stopword, legal form).' `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`value` `property`

`eq(value)` `method descriptor`

Return self==value.

`ge(value)` `method descriptor`

Return self>=value.

`gt(value)` `method descriptor`

Return self>value.

`hash()` `method descriptor`

Return hash(self).

`int()` `method descriptor`

int(self)

`le(value)` `method descriptor`

Return self<=value.

`lt(value)` `method descriptor`

Return self<value.

`ne(value)` `method descriptor`

Return self!=value.

`repr()` `method descriptor`

Return repr(self).

`NameTypeTag`

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

`ENT = NameTypeTag.ENT` `class-attribute`

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

`OBJ = NameTypeTag.OBJ` `class-attribute`

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

`ORG = NameTypeTag.ORG` `class-attribute`

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

`PER = NameTypeTag.PER` `class-attribute`

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

`UNK = NameTypeTag.UNK` `class-attribute`

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

doc = 'What kind of thing a name describes. Drives which pipeline passes\napply when a [`crate::names::name::Name`] is analysed.' `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`value` `property`

`eq(value)` `method descriptor`

Return self==value.

`ge(value)` `method descriptor`

Return self>=value.

`gt(value)` `method descriptor`

Return self>value.

`hash()` `method descriptor`

Return hash(self).

`int()` `method descriptor`

int(self)

`le(value)` `method descriptor`

Return self<=value.

`lt(value)` `method descriptor`

Return self<value.

`ne(value)` `method descriptor`

Return self!=value.

`repr()` `method descriptor`

Return repr(self).

`Span`

A contiguous group of [NamePart]s annotated with a [crate::names::symbol::Symbol] — the tagger's output unit.

doc = "A contiguous group of [`NamePart`]s annotated with a\n[`crate::names::symbol::Symbol`] — the tagger's output unit." `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`comparable` `property`

Space-joined part.comparable over the covered parts, for use in matcher-side substring checks.

`parts` `property`

The [NamePart]s covered by this span. Same Py<NamePart> references that live in the parent [crate::names::name::Name]'s .parts, so span.parts[0] is name.parts[i] is True from Python. Exposed as a tuple — hashable, so downstream code can key on (span.parts, span.symbol.category) when deduplicating pairings.

`symbol` `property`

The symbol this span carries.

`eq(value)` `method descriptor`

Return self==value.

`ge(value)` `method descriptor`

Return self>=value.

`gt(value)` `method descriptor`

Return self>value.

`hash()` `method descriptor`

Return hash(self).

`le(value)` `method descriptor`

Return self<=value.

`len()` `method descriptor`

Return len(self).

`lt(value)` `method descriptor`

Return self<value.

`ne(value)` `method descriptor`

Return self!=value.

`new(*args, **kwargs)` `builtin`

Create and return a new object. See help(type) for accurate signature.

`repr()` `method descriptor`

Return repr(self).

`Symbol`

A semantic interpretation applied to one or more parts of a name.

Carries a [SymbolCategory] and an id. Tagger pipelines emit Symbols during name analysis; matchers compare them between names as a coarse compatibility signal, and indexers flatten them into searchable fields. Equality and hashing are structural over (category, id).

Ids are always str — integer-sourced ids (Wikidata QIDs, ordinals, initial codepoints) are decimal-stringified at construction. Distinct Symbols with equal ids share one [Arc<str>] heap allocation via [intern].

doc = 'A semantic interpretation applied to one or more parts of a name.\n\nCarries a [`SymbolCategory`] and an id. Tagger pipelines emit\nSymbols during name analysis; matchers compare them between\nnames as a coarse compatibility signal, and indexers flatten\nthem into searchable fields. Equality and hashing are\nstructural over `(category, id)`.\n\nIds are always `str` — integer-sourced ids (Wikidata QIDs,\nordinals, initial codepoints) are decimal-stringified at\nconstruction. Distinct `Symbol`s with equal ids share one\n[`Arc<str>`] heap allocation via [`intern`].' `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`category` `property`

`id` `property`

The interned id string. Always str on the Python side — ids originally passed as int return as their decimal form.

`eq(value)` `method descriptor`

Return self==value.

`ge(value)` `method descriptor`

Return self>=value.

`gt(value)` `method descriptor`

Return self>value.

`hash()` `method descriptor`

Return hash(self).

`init(category, id)`

id is decimal-stringified if passed as int.

`le(value)` `method descriptor`

Return self<=value.

`lt(value)` `method descriptor`

Return self<value.

`ne(value)` `method descriptor`

Return self!=value.

`new(*args, **kwargs)` `builtin`

Create and return a new object. See help(type) for accurate signature.

`repr()` `method descriptor`

Return repr(self).

`str()` `method descriptor`

Return str(self).

`SymbolCategory`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

`DOMAIN = SymbolCategory.DOMAIN` `class-attribute`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

`INITIAL = SymbolCategory.INITIAL` `class-attribute`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

`LOCATION = SymbolCategory.LOCATION` `class-attribute`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

`NAME = SymbolCategory.NAME` `class-attribute`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

`NICK = SymbolCategory.NICK` `class-attribute`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

`NUMERIC = SymbolCategory.NUMERIC` `class-attribute`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

`ORG_CLASS = SymbolCategory.ORG_CLASS` `class-attribute`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

`PHONETIC = SymbolCategory.PHONETIC` `class-attribute`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

`SYMBOL = SymbolCategory.SYMBOL` `class-attribute`

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

doc = 'The kind of semantic annotation a [`Symbol`] carries. Drives how\nstrongly a symbol match counts during scoring — an `ORG_CLASS`\nmatch is a strong corporate-form signal, an `INITIAL` match is\nweak evidence that needs token-level corroboration.' `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`value` `property`

`eq(value)` `method descriptor`

Return self==value.

`ge(value)` `method descriptor`

Return self>=value.

`gt(value)` `method descriptor`

Return self>value.

`hash()` `method descriptor`

Return hash(self).

`int()` `method descriptor`

int(self)

`le(value)` `method descriptor`

Return self<=value.

`lt(value)` `method descriptor`

Return self<value.

`ne(value)` `method descriptor`

Return self!=value.

`repr()` `method descriptor`

Return repr(self).

`align_person_name_order(left, right)` `builtin`

Greedy-align two lists of name parts so comparable tokens share the same output index.

Used by the name matcher to reorder remaining tokens after symbolic tagging so a downstream per-index similarity pass compares like with like. Pairs are chosen by a length-desc, left-major walk over edit-similarity scores; ties are broken stably by input order so the output is deterministic.

Returns ([], tag_sort(right)) when left is empty, falls back to (tag_sort(left), tag_sort(right)) when no pair scores above the similarity floor, otherwise returns the greedy-aligned pair.

`analyze_names(type_tag, names, part_tags=None, *, infer_initials=False, symbols=True, phonetics=True, numerics=True, consolidate=True, rewrite=True)`

Build a set of tagged Name objects from raw strings.

Parameters:

Name	Type	Description	Default
`type_tag`	`NameTypeTag`	The NameTypeTag for every name in this batch. Drives which prefix/org-type/ tagger passes run: `PER` → person prefix strip + person tagger; `ORG`/`ENT` → org-type replacement + org prefix strip + org tagger; `OBJ` → object prefix strip ("M/V", "SS", …) but no tagger; `UNK` → no rewrites or tagging, just construction.	required
`names`	`Sequence[str]`	Raw name strings as harvested from the source entity. Empty strings and inputs that normalise to empty are dropped. Duplicates (after prenormalisation) are de-duplicated.	required
`part_tags`	`Optional[Mapping[NamePartTag, Sequence[str]]]`	Pre-classified part annotations, typically produced by an adapter that reads structured name-part properties off the source entity (e.g. firstName → `GIVEN`, lastName → `FAMILY`). Each value is applied to every constructed `Name` via `Name.tag_text`. Values can be multi-token strings — see the module docstring. Defaults to an empty mapping.	`None`
`infer_initials`	`bool`	When `True`, every single-character latin name part is tagged with an `INITIAL` symbol — useful on a free-text query side where `"J Smith"` arrives without a label on `"J"`. When `False` (default), only parts already tagged as `GIVEN` / `MIDDLE` pick up `INITIAL` symbols. Default `False` because initials are a query-side concept; the indexer and the candidate side of a matcher pass `False`, so the leaner default suits the common call. Ignored for non-person names. No-op when `symbols=False`.	`False`
`symbols`	`bool`	Master switch for symbol emission. When `True` (default), the INITIAL preamble, the AC tagger's match-and-apply pass, and NUMERIC-symbol emission all run. When `False`, no symbols are attached to the returned names — `name.symbols` is empty and `name.spans` stays empty. NamePartTag labelling (including the `NUM` / `STOP` / `LEGAL` promotions in the inference pass) still fires, and `part_tags` values are still applied via `Name.tag_text`. Useful for callers that only need tokens + part tags and don't match on symbol overlap; skipping the AC tagger is the main performance saving.	`True`
`phonetics`	`bool`	When `True` (default), each `NamePart.metaphone` is populated at construction; when `False`, the field stays `None` and the phonetics crate isn't called. Consumers that feed `part.metaphone` into downstream fields (e.g. yente's `name_phonemes` ES field) keep the default; callers that never read the property can save the per-part metaphone call.	`True`
`numerics`	`bool`	When `True` (default), numeric-looking name parts that the AC tagger's ordinal list didn't cover get a `Symbol(NUMERIC, int_value)` applied. When `False`, parts still get `NamePartTag.NUM` (cheap structural info) but no NUMERIC symbol is emitted. Callers that don't use numeric-symbol overlap for scoring can save the symbol allocation.	`True`
`consolidate`	`bool`	When `True` (default), the returned set has Name.consolidate_names applied — short names that are substrings of longer names in the same set are dropped. Indexers should pass `consolidate=False` to preserve partial-name recall (e.g. letting `"John Smith"` match `"John K Smith"` from the other side).	`True`
`rewrite`	`bool`	When `True` (default), the pre-tagger canonicalisation stages run: honorific-prefix removal for PER names (`Mr.`, `Dr.`, `Sir`), and for ORG/ENT names both article-prefix removal (`The`) and org-type compare-form rewriting (`Inc.`→`LLC`, `GmbH`→`JSC`, …). Pass `False` to keep the literal input form — the tagger still fires on the raw tokens because its alias set covers both original and canonical forms. Useful for debugging the tagger in isolation and for callers that want to display or index a name without the canonical substitutions.	`True`

Returns:

Type	Description
`Set[Name]`	A set of tagged `Name` objects, de-duplicated by normalised
`Set[Name]`	form. Empty if every input normalised to an empty string.

Source code in rigour/names/analyze.py

def analyze_names(
    type_tag: NameTypeTag,
    names: Sequence[str],
    part_tags: Optional[Mapping[NamePartTag, Sequence[str]]] = None,
    *,
    infer_initials: bool = False,
    symbols: bool = True,
    phonetics: bool = True,
    numerics: bool = True,
    consolidate: bool = True,
    rewrite: bool = True,
) -> Set[Name]:
    """Build a set of tagged [Name][rigour.names.Name] objects from raw strings.

    Args:
        type_tag: The [NameTypeTag][rigour.names.NameTypeTag] for
            every name in this batch. Drives which prefix/org-type/
            tagger passes run: `PER` → person prefix strip + person
            tagger; `ORG`/`ENT` → org-type replacement + org prefix
            strip + org tagger; `OBJ` → object prefix strip ("M/V",
            "SS", …) but no tagger; `UNK` → no rewrites or tagging,
            just construction.
        names: Raw name strings as harvested from the source entity.
            Empty strings and inputs that normalise to empty are
            dropped. Duplicates (after prenormalisation) are de-duplicated.
        part_tags: Pre-classified part annotations, typically produced
            by an adapter that reads structured name-part properties
            off the source entity (e.g. firstName → `GIVEN`,
            lastName → `FAMILY`). Each value is applied to every
            constructed `Name` via `Name.tag_text`. Values can be
            multi-token strings — see the module docstring. Defaults
            to an empty mapping.
        infer_initials: When `True`, every single-character latin name
            part is tagged with an `INITIAL` symbol — useful on a
            free-text query side where `"J Smith"` arrives without
            a label on `"J"`. When `False` (default), only parts
            already tagged as `GIVEN` / `MIDDLE` pick up `INITIAL`
            symbols. Default `False` because initials are a
            query-side concept; the indexer and the candidate side
            of a matcher pass `False`, so the leaner default suits
            the common call. Ignored for non-person names. No-op
            when `symbols=False`.
        symbols: Master switch for symbol emission. When `True`
            (default), the INITIAL preamble, the AC tagger's
            match-and-apply pass, and NUMERIC-symbol emission all
            run. When `False`, no symbols are attached to the
            returned names — `name.symbols` is empty and
            `name.spans` stays empty. NamePartTag labelling
            (including the `NUM` / `STOP` / `LEGAL` promotions in
            the inference pass) still fires, and `part_tags` values
            are still applied via `Name.tag_text`. Useful for
            callers that only need tokens + part tags and don't
            match on symbol overlap; skipping the AC tagger is the
            main performance saving.
        phonetics: When `True` (default), each `NamePart.metaphone`
            is populated at construction; when `False`, the field
            stays `None` and the phonetics crate isn't called.
            Consumers that feed `part.metaphone` into downstream
            fields (e.g. yente's `name_phonemes` ES field) keep the
            default; callers that never read the property can save
            the per-part metaphone call.
        numerics: When `True` (default), numeric-looking name parts
            that the AC tagger's ordinal list didn't cover get a
            `Symbol(NUMERIC, int_value)` applied. When `False`, parts
            still get `NamePartTag.NUM` (cheap structural info) but
            no NUMERIC symbol is emitted. Callers that don't use
            numeric-symbol overlap for scoring can save the symbol
            allocation.
        consolidate: When `True` (default), the returned set has
            [Name.consolidate_names][rigour.names.Name.consolidate_names]
            applied — short names that are substrings of longer names
            in the same set are dropped. **Indexers should pass
            `consolidate=False`** to preserve partial-name recall
            (e.g. letting `"John Smith"` match `"John K Smith"` from
            the other side).
        rewrite: When `True` (default), the pre-tagger canonicalisation
            stages run: honorific-prefix removal for PER names
            (`Mr.`, `Dr.`, `Sir`), and for ORG/ENT names both
            article-prefix removal (`The`) and org-type compare-form
            rewriting (`Inc.`→`LLC`, `GmbH`→`JSC`, …). Pass `False`
            to keep the literal input form — the tagger still fires
            on the raw tokens because its alias set covers both
            original and canonical forms. Useful for debugging the
            tagger in isolation and for callers that want to display
            or index a name without the canonical substitutions.

    Returns:
        A set of tagged `Name` objects, de-duplicated by normalised
        form. Empty if every input normalised to an empty string.
    """
    tag_dict: dict[NamePartTag, list[str]] | None
    if part_tags is None:
        tag_dict = None
    else:
        tag_dict = {tag: list(values) for tag, values in part_tags.items()}
    return _analyze_names(
        type_tag,
        list(names),
        tag_dict,
        infer_initials=infer_initials,
        symbols=symbols,
        phonetics=phonetics,
        numerics=numerics,
        consolidate=consolidate,
        rewrite=rewrite,
    )

`compare_parts(qry, res, config=None)` `builtin`

Score the alignment of two NamePart lists.

Callers should hand over the residue — parts that earlier stages (symbol pairing, alias tagging, identifier matching) couldn't explain by themselves — already canonicalised into positional order (tag_sort for ORG/ENT, align_person_name_order for PER). The function returns one [Alignment] per cluster, paired or solo; every input part appears exactly once across the output. Returned alignments carry symbol = None (residue distance is non-symbolic by definition).

config overrides the cost / budget / clustering scalars. Pass None (the default) to use the process-wide defaults — those match industry-typical recall-protective tuning. Sweep scripts build a fresh [CompareConfig] per iteration; matchers cache one per request.

`contains_split_phrase(string)`

Check whether string contains an alias-marker phrase.

Detects markers like "a.k.a.", "f.k.a.", "née", "alias", that signal a single string actually carries multiple distinct names. Useful for triaging input — a string with a split phrase shouldn't be treated as one atomic name. The phrase list is data-driven from resources/names/stopwords.yml::NAME_SPLIT_PHRASES, surfaced via rigour._core.name_split_phrases_list.

Parameters:

Name	Type	Description	Default
`string`	`str`	An input that may contain one or more names.	required

Returns:

Type	Description
`bool`	`True` iff at least one split-phrase marker appears in
`bool`	`string` as a whole word.

Source code in rigour/names/split_phrases.py

def contains_split_phrase(string: str) -> bool:
    """Check whether `string` contains an alias-marker phrase.

    Detects markers like `"a.k.a."`, `"f.k.a."`, `"née"`, `"alias"`,
    that signal a single string actually carries multiple distinct
    names. Useful for triaging input — a string with a split
    phrase shouldn't be treated as one atomic name. The phrase
    list is data-driven from
    `resources/names/stopwords.yml::NAME_SPLIT_PHRASES`,
    surfaced via `rigour._core.name_split_phrases_list`.

    Args:
        string: An input that may contain one or more names.

    Returns:
        `True` iff at least one split-phrase marker appears in
        `string` as a whole word.
    """
    return _split_phrase_regex().search(string) is not None

`extract_org_types(name, normalize_flags=Normalize.CASEFOLD, cleanup=Cleanup.Noop, generic=False)`

Find every organisation-type designation in a name.

Scans name for recognised aliases (LLC, Inc, GmbH, ...) and returns the matched substring and its canonical target. A poor-person's "is this a company name?" detector.

Parameters:

Name	Type	Description	Default
`name`	`str`	The text to be processed. Assumed to already be normalised with the same `normalize_flags` + `cleanup` the alias table was built from.	required
`normalize_flags`	`Normalize`	`Normalize` flag set applied to the alias list at build time. Default `Normalize.CASEFOLD`.	`CASEFOLD`
`cleanup`	`Cleanup`	`Cleanup` variant applied during alias normalisation. Default `Cleanup.Noop`.	`Noop`
`generic`	`bool`	If True, target values are the generic form (`llc`, `jsc`) instead of the type-specific compare form. Matches :func:`replace_org_types_compare`.	`False`

Returns:

Type	Description
`List[Tuple[str, str]]`	A list of `(matched_text, target)` tuples, one per
`List[Tuple[str, str]]`	non-overlapping match. Empty if nothing matches.

Source code in rigour/names/org_types.py

def extract_org_types(
    name: str,
    normalize_flags: Normalize = Normalize.CASEFOLD,
    cleanup: Cleanup = Cleanup.Noop,
    generic: bool = False,
) -> List[Tuple[str, str]]:
    """Find every organisation-type designation in a name.

    Scans `name` for recognised aliases (LLC, Inc, GmbH, ...) and returns
    the matched substring and its canonical target. A poor-person's
    "is this a company name?" detector.

    Args:
        name: The text to be processed. Assumed to already be normalised
            with the same `normalize_flags` + `cleanup` the alias table
            was built from.
        normalize_flags: `Normalize` flag
            set applied to the alias list at build time. Default
            `Normalize.CASEFOLD`.
        cleanup: `Cleanup` variant applied
            during alias normalisation. Default `Cleanup.Noop`.
        generic: If True, target values are the generic form (``llc``,
            ``jsc``) instead of the type-specific compare form. Matches
            :func:`replace_org_types_compare`.

    Returns:
        A list of ``(matched_text, target)`` tuples, one per
        non-overlapping match. Empty if nothing matches.
    """
    return _extract_org_types(name, int(normalize_flags), int(cleanup), generic)

`is_name(name)`

Check whether name plausibly contains a name.

Loose filter — true iff at least one character is a Unicode letter (general category L*). Useful for rejecting purely numeric ("007") or punctuation-only ("---") inputs before handing them to the rest of the name pipeline.

Parameters:

Name	Type	Description	Default
`name`	`str`	A string.	required

Returns:

Type	Description
`bool`	`True` iff `name` contains at least one letter.

Source code in rigour/names/check.py

def is_name(name: str) -> bool:
    """Check whether `name` plausibly contains a name.

    Loose filter — true iff at least one character is a Unicode
    letter (general category `L*`). Useful for rejecting purely
    numeric (`"007"`) or punctuation-only (`"---"`) inputs before
    handing them to the rest of the name pipeline.

    Args:
        name: A string.

    Returns:
        `True` iff `name` contains at least one letter.
    """
    for char in name:
        category = unicodedata.category(char)
        if category[0] == "L":
            return True
    return False

`is_stopword(form, *, normalizer=normalize_name, normalize=False)`

Check if the given form is a stopword. The stopword list is normalized first.

.. deprecated:: Use :func:rigour.text.is_stopword instead. This function will be removed in a future version.

Parameters:

Name	Type	Description	Default
`form`	`str`	The token to check, must already be normalized.	required
`normalizer`	`Normalizer`	The normalizer to use for checking stopwords.	`normalize_name`
`normalize`	`bool`	Whether to normalize the form before checking.	`False`

Returns:

Name	Type	Description
`bool`	`bool`	True if the form is a stopword, False otherwise.

Source code in rigour/names/check.py

def is_stopword(
    form: str, *, normalizer: Normalizer = normalize_name, normalize: bool = False
) -> bool:
    """Check if the given form is a stopword. The stopword list is normalized first.

    .. deprecated::
        Use :func:`rigour.text.is_stopword` instead. This function will be removed in a future version.

    Args:
        form (str): The token to check, must already be normalized.
        normalizer (Normalizer): The normalizer to use for checking stopwords.
        normalize (bool): Whether to normalize the form before checking.

    Returns:
        bool: True if the form is a stopword, False otherwise.
    """
    warnings.warn(
        "rigour.names.is_stopword is deprecated, use rigour.text.is_stopword instead",
        DeprecationWarning,
        stacklevel=2,
    )
    return _is_stopword(form, normalizer=normalizer, normalize=normalize)

`normalize_name(name)` `cached`

Casefold and tokenise a name into a stable matching key.

Convenience composition of :func:tokenize_name over a casefolded input, rejoined with single ASCII spaces. Equivalent to calling normalize(name, Normalize.CASEFOLD | Normalize.NAME) — use that directly when callers want explicit flag control.

Used internally by the rigour name-matching utilities; not intended as a general-purpose public surface.

Parameters:

Name	Type	Description	Default
`name`	`Optional[str]`	A name string, or `None`.	required

Returns:

Type	Description
`Optional[str]`	Normalised name (lowercase, single-space-separated
`Optional[str]`	tokens), or `None` if input is `None` or normalises to
`Optional[str]`	empty.

Source code in rigour/names/tokenize.py

@lru_cache(maxsize=MEMO_SMALL)
def normalize_name(name: Optional[str]) -> Optional[str]:
    """Casefold and tokenise a name into a stable matching key.

    Convenience composition of :func:`tokenize_name` over a
    casefolded input, rejoined with single ASCII spaces.
    Equivalent to calling
    `normalize(name, Normalize.CASEFOLD | Normalize.NAME)` —
    use that directly when callers want explicit flag control.

    Used internally by the rigour name-matching utilities; not
    intended as a general-purpose public surface.

    Args:
        name: A name string, or `None`.

    Returns:
        Normalised name (lowercase, single-space-separated
        tokens), or `None` if input is `None` or normalises to
        empty.
    """
    if name is None:
        return None
    return normalize(name, Normalize.CASEFOLD | Normalize.NAME)

`pair_symbols(query, result)` `builtin`

Align the symbol spans of two [Name]s into coverage-maximal pairings.

Each returned pairing is a tuple of non-conflicting [Alignment]s; edges within a pairing cover disjoint parts on each side. Each Alignment has symbol = Some(_) and a placeholder score = 1.0 — consumers should override the score with a per-category default before composing the pairing total. Pairings are distinguished by their coverage and category multiset — two pairings that cover the same parts with the same category mix are collapsed to one. Distinct category choices on the same parts (e.g. a token carrying both NAME:Qvan and SYMBOL:van) surface as separate pairings.

Returns [()] (a single empty pairing) when either name has more than 64 parts, when either name has no tagger spans, or when no symbol is shared between the two sides.

`pick_case(names)`

Pick the best mix of lower- and uppercase characters from a set of names that are identical except for case. If the names are not identical, undefined things happen (not recommended).

Rust-backed via :func:rigour._core.pick_case. The Rust implementation returns None for empty input; this Python wrapper raises ValueError to preserve the pre-port contract that external callers rely on.

Parameters:

Name	Type	Description	Default
`names`	`List[str]`	A list of identical names in different cases.	required

Returns:

Name	Type	Description
`str`	`str`	The best name for display.

Source code in rigour/names/pick.py

def pick_case(names: List[str]) -> str:
    """Pick the best mix of lower- and uppercase characters from a set of names
    that are identical except for case. If the names are not identical, undefined
    things happen (not recommended).

    Rust-backed via :func:`rigour._core.pick_case`. The Rust
    implementation returns `None` for empty input; this Python wrapper
    raises `ValueError` to preserve the pre-port contract that
    external callers rely on.

    Args:
        names (List[str]): A list of identical names in different cases.

    Returns:
        str: The best name for display.
    """
    from rigour._core import pick_case as _pick_case

    result = _pick_case(names)
    if result is None:
        raise ValueError("Cannot pick a name from an empty list.")
    return result

`remove_obj_prefixes(name)`

Strip vessel-class and generic-article prefixes from the head of an object name.

Drops "M/V", "SS", "The", etc. so "M/V Oceanic" → "Oceanic" doesn't penalise the shorter variant when matching vessels, vehicles, or aircraft. Driven by resources/names/stopwords.yml::OBJ_NAME_PREFIXES.

Parameters:

Name	Type	Description	Default
`name`	`str`	An object (vessel / vehicle / aircraft) name string.	required

Returns:

Type	Description
`str`	The name with any leading prefix(es) removed.

Source code in rigour/names/prefix.py

def remove_obj_prefixes(name: str) -> str:
    """Strip vessel-class and generic-article prefixes from the
    head of an object name.

    Drops `"M/V"`, `"SS"`, `"The"`, etc. so `"M/V Oceanic"` →
    `"Oceanic"` doesn't penalise the shorter variant when
    matching vessels, vehicles, or aircraft. Driven by
    `resources/names/stopwords.yml::OBJ_NAME_PREFIXES`.

    Args:
        name: An object (vessel / vehicle / aircraft) name string.

    Returns:
        The name with any leading prefix(es) removed.
    """
    return _obj_prefix_regex().sub("", name)

`remove_org_prefixes(name)`

Strip article-like prefixes from the head of an organisation name.

Drops "The", etc. so "The Charitable Trust" → "Charitable Trust" doesn't penalise the shorter variant when matching. Driven by resources/names/stopwords.yml::ORG_NAME_PREFIXES.

Parameters:

Name	Type	Description	Default
`name`	`str`	An organisation name string.	required

Returns:

Type	Description
`str`	The name with any leading article-prefix(es) removed.

Source code in rigour/names/prefix.py

def remove_org_prefixes(name: str) -> str:
    """Strip article-like prefixes from the head of an organisation name.

    Drops `"The"`, etc. so `"The Charitable Trust"` →
    `"Charitable Trust"` doesn't penalise the shorter variant
    when matching. Driven by
    `resources/names/stopwords.yml::ORG_NAME_PREFIXES`.

    Args:
        name: An organisation name string.

    Returns:
        The name with any leading article-prefix(es) removed.
    """
    return _org_prefix_regex().sub("", name)

`remove_org_types(name, replacement='', normalize_flags=Normalize.CASEFOLD, cleanup=Cleanup.Noop)`

Remove organisation-type designations from a name.

Every recognised alias (LLC, Inc, GmbH, ...) in name is replaced with replacement. Useful as a preprocessing step when you want the entity name stripped of legal-form noise.

Parameters:

Name	Type	Description	Default
`name`	`str`	The text to be processed. Assumed to already be normalised with the same `normalize_flags` + `cleanup` the alias table was built from.	required
`replacement`	`str`	The string to replace each matched alias with. Default `""` (strip).	`''`
`normalize_flags`	`Normalize`	`Normalize` flag set applied to the alias list at build time. Default `Normalize.CASEFOLD`.	`CASEFOLD`
`cleanup`	`Cleanup`	`Cleanup` variant applied during alias normalisation. Default `Cleanup.Noop`.	`Noop`

Returns:

Type	Description
`str`	The text with recognised organisation types replaced. May be
`str`	empty if the input consisted entirely of matched aliases and
`str`	`replacement` is the empty string.

Source code in rigour/names/org_types.py

def remove_org_types(
    name: str,
    replacement: str = "",
    normalize_flags: Normalize = Normalize.CASEFOLD,
    cleanup: Cleanup = Cleanup.Noop,
) -> str:
    """Remove organisation-type designations from a name.

    Every recognised alias (LLC, Inc, GmbH, ...) in `name` is replaced
    with `replacement`. Useful as a preprocessing step when you want
    the entity name stripped of legal-form noise.

    Args:
        name: The text to be processed. Assumed to already be normalised
            with the same `normalize_flags` + `cleanup` the alias table
            was built from.
        replacement: The string to replace each matched alias with.
            Default ``""`` (strip).
        normalize_flags: `Normalize` flag
            set applied to the alias list at build time. Default
            `Normalize.CASEFOLD`.
        cleanup: `Cleanup` variant applied
            during alias normalisation. Default `Cleanup.Noop`.

    Returns:
        The text with recognised organisation types replaced. May be
        empty if the input consisted entirely of matched aliases and
        `replacement` is the empty string.
    """
    return _remove_org_types(name, int(normalize_flags), int(cleanup), replacement)

`remove_person_prefixes(name)`

Strip honorific prefixes from the head of a person name.

Drops "Mr.", "Mrs.", "Dr.", "Lady", etc. so honorifics don't contaminate part alignment in matching or token-bag comparison. The list is data-driven from resources/names/stopwords.yml::PERSON_NAME_PREFIXES, surfaced via rigour._core.person_name_prefixes_list.

Parameters:

Name	Type	Description	Default
`name`	`str`	A person name string.	required

Returns:

Type	Description
`str`	The name with any leading honorific(s) removed. Idempotent
`str`	for inputs that don't start with a known prefix.

Source code in rigour/names/prefix.py

def remove_person_prefixes(name: str) -> str:
    """Strip honorific prefixes from the head of a person name.

    Drops `"Mr."`, `"Mrs."`, `"Dr."`, `"Lady"`, etc. so honorifics
    don't contaminate part alignment in matching or token-bag
    comparison. The list is data-driven from
    `resources/names/stopwords.yml::PERSON_NAME_PREFIXES`,
    surfaced via `rigour._core.person_name_prefixes_list`.

    Args:
        name: A person name string.

    Returns:
        The name with any leading honorific(s) removed. Idempotent
        for inputs that don't start with a known prefix.
    """
    return _person_prefix_regex().sub("", name)

`replace_org_types_compare(name, normalize_flags=Normalize.CASEFOLD, cleanup=Cleanup.Noop, generic=False)`

Replace organisation types in a name with a heavily normalised form.

Country-specific entity types (e.g. GmbH, Aktiengesellschaft, ООО) are rewritten into a simplified comparison form (e.g. gmbh, ag, ooo) suitable for string-distance matching. The result is meant for comparison pipelines, not for presentation.

Parameters:

Name	Type	Description	Default
`name`	`str`	The text to be processed. Assumed to already be normalised with the same `normalize_flags` + `cleanup` the alias table was built from.	required
`normalize_flags`	`Normalize`	`Normalize` flag set applied to the alias list at build time. Default `Normalize.CASEFOLD` matches production callers (nomenklatura/yente/FTM via `prenormalize_name`).	`CASEFOLD`
`cleanup`	`Cleanup`	`Cleanup` variant applied during alias normalisation. Default `Cleanup.Noop`.	`Noop`
`generic`	`bool`	If True, substitute the generic form of the organisation type (e.g. `llc`, `jsc`) instead of the type-specific compare form. Specs without a `generic` field are left unchanged in generic mode.	`False`

Returns:

Type	Description
`str`	The text with recognised organisation types substituted. If every
`str`	match would reduce the text to an empty string, the original
`str`	text is returned unchanged.

Source code in rigour/names/org_types.py

def replace_org_types_compare(
    name: str,
    normalize_flags: Normalize = Normalize.CASEFOLD,
    cleanup: Cleanup = Cleanup.Noop,
    generic: bool = False,
) -> str:
    """Replace organisation types in a name with a heavily normalised form.

    Country-specific entity types (e.g. GmbH, Aktiengesellschaft, ООО) are
    rewritten into a simplified comparison form (e.g. ``gmbh``, ``ag``,
    ``ooo``) suitable for string-distance matching. The result is meant
    for comparison pipelines, not for presentation.

    Args:
        name: The text to be processed. Assumed to already be normalised
            with the same `normalize_flags` + `cleanup` the alias table
            was built from.
        normalize_flags: `Normalize` flag
            set applied to the alias list at build time. Default
            `Normalize.CASEFOLD` matches production callers
            (nomenklatura/yente/FTM via `prenormalize_name`).
        cleanup: `Cleanup` variant applied
            during alias normalisation. Default `Cleanup.Noop`.
        generic: If True, substitute the generic form of the organisation
            type (e.g. ``llc``, ``jsc``) instead of the type-specific
            compare form. Specs without a `generic` field are left
            unchanged in generic mode.

    Returns:
        The text with recognised organisation types substituted. If every
        match would reduce the text to an empty string, the original
        text is returned unchanged.
    """
    return _replace_org_types_compare(name, int(normalize_flags), int(cleanup), generic)

`replace_org_types_display(name, normalize_flags=Normalize.CASEFOLD | Normalize.SQUASH_SPACES, cleanup=Cleanup.Noop)`

Replace organisation types in a name with their short display form.

Spelt-out legal forms are shortened into common abbreviations (e.g. "Siemens Aktiengesellschaft" → "Siemens AG"), preserving the case of non-matched portions. If the whole input is uppercase (str.isupper()), the whole output is re-uppercased.

Matches case-insensitively across Unicode by casefolding a copy of the input internally for the match step — normalize_flags must therefore include Normalize.CASEFOLD so the alias table is casefolded too. The default does this.

Parameters:

Name	Type	Description	Default
`name`	`str`	The text to be processed.	required
`normalize_flags`	`Normalize`	`Normalize` flag set applied to the alias list at build time. Must include `Normalize.CASEFOLD` for Unicode-case-insensitive matching. Default `CASEFOLD \| SQUASH_SPACES`.	`CASEFOLD \| SQUASH_SPACES`
`cleanup`	`Cleanup`	`Cleanup` variant applied during alias normalisation. Default `Cleanup.Noop`.	`Noop`

Returns:

Type	Description
`str`	The text with recognised organisation types substituted for
`str`	their display form. Non-matched regions keep their original case.

Source code in rigour/names/org_types.py

def replace_org_types_display(
    name: str,
    normalize_flags: Normalize = Normalize.CASEFOLD | Normalize.SQUASH_SPACES,
    cleanup: Cleanup = Cleanup.Noop,
) -> str:
    """Replace organisation types in a name with their short display form.

    Spelt-out legal forms are shortened into common abbreviations
    (e.g. ``"Siemens Aktiengesellschaft"`` → ``"Siemens AG"``), preserving
    the case of non-matched portions. If the whole input is uppercase
    (`str.isupper()`), the whole output is re-uppercased.

    Matches case-insensitively across Unicode by casefolding a copy of
    the input internally for the match step — `normalize_flags` must
    therefore include `Normalize.CASEFOLD` so the alias table is
    casefolded too. The default does this.

    Args:
        name: The text to be processed.
        normalize_flags: `Normalize` flag
            set applied to the alias list at build time. Must include
            `Normalize.CASEFOLD` for Unicode-case-insensitive matching.
            Default `CASEFOLD | SQUASH_SPACES`.
        cleanup: `Cleanup` variant applied
            during alias normalisation. Default `Cleanup.Noop`.

    Returns:
        The text with recognised organisation types substituted for
        their display form. Non-matched regions keep their original case.
    """
    return _replace_org_types_display(name, int(normalize_flags), int(cleanup))

`representative_names(names, limit, cluster_threshold=0.3)`

Reduce a bag of aliases to at most limit representatives without extreme information loss.

Useful when a downstream process (e.g. building a search-index query) wants to probe the alias space broadly under a budget cap. For a person with 20 transliterations of one name and limit=5, this returns ~1-5 centroid-selected representatives rather than all 20 near-identical forms. For a person with two genuinely distinct names (Nelson Mandela / Rolihlahla Mandela), both survive — N transliterations of one name don't add recall, but a second name does.

Fast path: if the input already collapses to <= limit distinct names (after casefold-dedup via :func:reduce_names), those names are returned as-is without clustering. Compression only runs when the input actually needs to be compressed. This means cluster_threshold has no effect when the fast path fires.

Ordering of the returned list is not guaranteed. Returned strings are originals from the input — :func:pick_name per cluster selects the best-case representative when clustering runs.

Parameters:

Name	Type	Description	Default
`names`	`List[str]`	input aliases, typically all belonging to one entity.	required
`limit`	`int`	upper bound on output size.	required
`cluster_threshold`	`float`	normalized Levenshtein distance (0..1) above which two names are considered distinct names rather than variants of one. Default 0.3 keeps transliterations together while separating genuinely different names. Ignored when the fast path fires.	`0.3`

Source code in rigour/names/pick.py

def representative_names(
    names: List[str],
    limit: int,
    cluster_threshold: float = 0.3,
) -> List[str]:
    """Reduce a bag of aliases to at most `limit` representatives
    without extreme information loss.

    Useful when a downstream process (e.g. building a search-index
    query) wants to probe the alias space broadly under a budget
    cap. For a person with 20 transliterations of one name and
    `limit=5`, this returns ~1-5 centroid-selected representatives
    rather than all 20 near-identical forms. For a person with two
    genuinely distinct names (Nelson Mandela / Rolihlahla Mandela),
    both survive — N transliterations of one name don't add recall,
    but a second *name* does.

    **Fast path**: if the input already collapses to `<= limit`
    distinct names (after casefold-dedup via :func:`reduce_names`),
    those names are returned as-is without clustering. Compression
    only runs when the input actually needs to be compressed. This
    means `cluster_threshold` has no effect when the fast path
    fires.

    Ordering of the returned list is not guaranteed. Returned
    strings are originals from the input — :func:`pick_name` per
    cluster selects the best-case representative when clustering
    runs.

    Args:
        names: input aliases, typically all belonging to one entity.
        limit: upper bound on output size.
        cluster_threshold: normalized Levenshtein distance (0..1) above
            which two names are considered distinct *names* rather than
            variants of one. Default 0.3 keeps transliterations together
            while separating genuinely different names. Ignored when
            the fast path fires.
    """
    if limit <= 0 or not names:
        return []
    reduced = reduce_names(names)
    if len(reduced) <= limit:
        return list(reduced)

    # Casefolded/whitespace-normalised form of each reduced name, for
    # distance measurement. The originals are what we return.
    normed: Dict[str, str] = {}
    for n in reduced:
        nn = " ".join(n.casefold().split())
        if nn:
            normed[n] = nn

    centroid = pick_name(reduced)
    if centroid is None or centroid not in normed:
        return []

    def _dist(a: str, b: str) -> float:
        return levenshtein(a, b) / max(len(a), len(b), 1)

    # Farthest-point-first seed selection with threshold stopping: each
    # new seed must be more than `cluster_threshold` away from every
    # already-picked seed, else we've run out of distinct clusters.
    seeds: List[str] = [centroid]
    while len(seeds) < limit:
        outlier: Optional[str] = None
        outlier_d = 0.0
        for n in reduced:
            if n in seeds or n not in normed:
                continue
            nn = normed[n]
            min_d = min(_dist(nn, normed[s]) for s in seeds)
            if min_d > outlier_d:
                outlier_d = min_d
                outlier = n
        if outlier is None or outlier_d <= cluster_threshold:
            break
        seeds.append(outlier)

    if len(seeds) == 1:
        return seeds

    # Assign each reduced name to its nearest seed, then pick_name per
    # cluster so the returned rep is the best display form of its group
    # rather than whichever outlier happened to be picked as the seed.
    clusters: List[List[str]] = [[s] for s in seeds]
    for n in reduced:
        if n in seeds or n not in normed:
            continue
        nn = normed[n]
        best_i = 0
        best_d = float("inf")
        for i, s in enumerate(seeds):
            d = _dist(nn, normed[s])
            if d < best_d:
                best_d = d
                best_i = i
        clusters[best_i].append(n)

    reps: List[str] = []
    for cluster in clusters:
        rep = pick_name(cluster)
        if rep is not None:
            reps.append(rep)
    return reps

`tokenize_name(text, token_min_length=1)`

Split a person or entity's name into name parts.

Unicode general-category-aware: separator categories (spaces, punctuation, math symbols) split tokens; delete categories (combining marks, modifier letters, format chars) drop; letters, numbers, and a small set of CJK modifier marks are kept.

Parameters:

Name	Type	Description	Default
`text`	`str`	The name to tokenize.	required
`token_min_length`	`int`	Drop tokens shorter than this many codepoints. Defaults to 1 (drop only zero-length).	`1`

Returns:

Type	Description
`List[str]`	Tokens in left-to-right order, with any deletion or
`List[str]`	whitespace-substitution applied. Order matches input.

Source code in rigour/names/tokenize.py

def tokenize_name(text: str, token_min_length: int = 1) -> List[str]:
    """Split a person or entity's name into name parts.

    Unicode general-category-aware: separator categories (spaces,
    punctuation, math symbols) split tokens; delete categories
    (combining marks, modifier letters, format chars) drop; letters,
    numbers, and a small set of CJK modifier marks are kept.

    Args:
        text: The name to tokenize.
        token_min_length: Drop tokens shorter than this many
            codepoints. Defaults to 1 (drop only zero-length).

    Returns:
        Tokens in left-to-right order, with any deletion or
        whitespace-substitution applied. Order matches input.
    """
    return _tokenize_name(text, token_min_length)

`rigour.names.analyze`

End-to-end name analysis: raw strings → tagged Name objects.

analyze_names is the unified entry point that downstream consumers (followthemoney's entity_names, in turn used by nomenklatura and yente) call once per entity to get matchable Name objects.

Rust-backed via rigour._core.analyze_names — one FFI crossing per call, regardless of how many names / part_tags the entity has. The single-call pipeline runs: prefix strip → prenormalize → org-type replacement (for ORG/ENT) → Name + NamePart construction → part tagging via Name.tag_text → tagger match-and-apply → NUMERIC / STOP / LEGAL inference → optional consolidate_names.

Part-tag value shape

part_tags values can be multi-token strings. A value like "Jean Claude" in part_tags[NamePartTag.GIVEN] for the name "Jean Claude Juncker" will tag both the "jean" and "claude" parts as GIVEN — the underlying Name.tag_text tokenises the value and walks the name parts looking for the token sequence. The tokens of the value don't need to be adjacent in the name, just present in order.

`analyze_names(type_tag, names, part_tags=None, *, infer_initials=False, symbols=True, phonetics=True, numerics=True, consolidate=True, rewrite=True)`

Build a set of tagged Name objects from raw strings.

Parameters:

Name	Type	Description	Default
`type_tag`	`NameTypeTag`	The NameTypeTag for every name in this batch. Drives which prefix/org-type/ tagger passes run: `PER` → person prefix strip + person tagger; `ORG`/`ENT` → org-type replacement + org prefix strip + org tagger; `OBJ` → object prefix strip ("M/V", "SS", …) but no tagger; `UNK` → no rewrites or tagging, just construction.	required
`names`	`Sequence[str]`	Raw name strings as harvested from the source entity. Empty strings and inputs that normalise to empty are dropped. Duplicates (after prenormalisation) are de-duplicated.	required
`part_tags`	`Optional[Mapping[NamePartTag, Sequence[str]]]`	Pre-classified part annotations, typically produced by an adapter that reads structured name-part properties off the source entity (e.g. firstName → `GIVEN`, lastName → `FAMILY`). Each value is applied to every constructed `Name` via `Name.tag_text`. Values can be multi-token strings — see the module docstring. Defaults to an empty mapping.	`None`
`infer_initials`	`bool`	When `True`, every single-character latin name part is tagged with an `INITIAL` symbol — useful on a free-text query side where `"J Smith"` arrives without a label on `"J"`. When `False` (default), only parts already tagged as `GIVEN` / `MIDDLE` pick up `INITIAL` symbols. Default `False` because initials are a query-side concept; the indexer and the candidate side of a matcher pass `False`, so the leaner default suits the common call. Ignored for non-person names. No-op when `symbols=False`.	`False`
`symbols`	`bool`	Master switch for symbol emission. When `True` (default), the INITIAL preamble, the AC tagger's match-and-apply pass, and NUMERIC-symbol emission all run. When `False`, no symbols are attached to the returned names — `name.symbols` is empty and `name.spans` stays empty. NamePartTag labelling (including the `NUM` / `STOP` / `LEGAL` promotions in the inference pass) still fires, and `part_tags` values are still applied via `Name.tag_text`. Useful for callers that only need tokens + part tags and don't match on symbol overlap; skipping the AC tagger is the main performance saving.	`True`
`phonetics`	`bool`	When `True` (default), each `NamePart.metaphone` is populated at construction; when `False`, the field stays `None` and the phonetics crate isn't called. Consumers that feed `part.metaphone` into downstream fields (e.g. yente's `name_phonemes` ES field) keep the default; callers that never read the property can save the per-part metaphone call.	`True`
`numerics`	`bool`	When `True` (default), numeric-looking name parts that the AC tagger's ordinal list didn't cover get a `Symbol(NUMERIC, int_value)` applied. When `False`, parts still get `NamePartTag.NUM` (cheap structural info) but no NUMERIC symbol is emitted. Callers that don't use numeric-symbol overlap for scoring can save the symbol allocation.	`True`
`consolidate`	`bool`	When `True` (default), the returned set has Name.consolidate_names applied — short names that are substrings of longer names in the same set are dropped. Indexers should pass `consolidate=False` to preserve partial-name recall (e.g. letting `"John Smith"` match `"John K Smith"` from the other side).	`True`
`rewrite`	`bool`	When `True` (default), the pre-tagger canonicalisation stages run: honorific-prefix removal for PER names (`Mr.`, `Dr.`, `Sir`), and for ORG/ENT names both article-prefix removal (`The`) and org-type compare-form rewriting (`Inc.`→`LLC`, `GmbH`→`JSC`, …). Pass `False` to keep the literal input form — the tagger still fires on the raw tokens because its alias set covers both original and canonical forms. Useful for debugging the tagger in isolation and for callers that want to display or index a name without the canonical substitutions.	`True`

Returns:

Type	Description
`Set[Name]`	A set of tagged `Name` objects, de-duplicated by normalised
`Set[Name]`	form. Empty if every input normalised to an empty string.

Source code in rigour/names/analyze.py

def analyze_names(
    type_tag: NameTypeTag,
    names: Sequence[str],
    part_tags: Optional[Mapping[NamePartTag, Sequence[str]]] = None,
    *,
    infer_initials: bool = False,
    symbols: bool = True,
    phonetics: bool = True,
    numerics: bool = True,
    consolidate: bool = True,
    rewrite: bool = True,
) -> Set[Name]:
    """Build a set of tagged [Name][rigour.names.Name] objects from raw strings.

    Args:
        type_tag: The [NameTypeTag][rigour.names.NameTypeTag] for
            every name in this batch. Drives which prefix/org-type/
            tagger passes run: `PER` → person prefix strip + person
            tagger; `ORG`/`ENT` → org-type replacement + org prefix
            strip + org tagger; `OBJ` → object prefix strip ("M/V",
            "SS", …) but no tagger; `UNK` → no rewrites or tagging,
            just construction.
        names: Raw name strings as harvested from the source entity.
            Empty strings and inputs that normalise to empty are
            dropped. Duplicates (after prenormalisation) are de-duplicated.
        part_tags: Pre-classified part annotations, typically produced
            by an adapter that reads structured name-part properties
            off the source entity (e.g. firstName → `GIVEN`,
            lastName → `FAMILY`). Each value is applied to every
            constructed `Name` via `Name.tag_text`. Values can be
            multi-token strings — see the module docstring. Defaults
            to an empty mapping.
        infer_initials: When `True`, every single-character latin name
            part is tagged with an `INITIAL` symbol — useful on a
            free-text query side where `"J Smith"` arrives without
            a label on `"J"`. When `False` (default), only parts
            already tagged as `GIVEN` / `MIDDLE` pick up `INITIAL`
            symbols. Default `False` because initials are a
            query-side concept; the indexer and the candidate side
            of a matcher pass `False`, so the leaner default suits
            the common call. Ignored for non-person names. No-op
            when `symbols=False`.
        symbols: Master switch for symbol emission. When `True`
            (default), the INITIAL preamble, the AC tagger's
            match-and-apply pass, and NUMERIC-symbol emission all
            run. When `False`, no symbols are attached to the
            returned names — `name.symbols` is empty and
            `name.spans` stays empty. NamePartTag labelling
            (including the `NUM` / `STOP` / `LEGAL` promotions in
            the inference pass) still fires, and `part_tags` values
            are still applied via `Name.tag_text`. Useful for
            callers that only need tokens + part tags and don't
            match on symbol overlap; skipping the AC tagger is the
            main performance saving.
        phonetics: When `True` (default), each `NamePart.metaphone`
            is populated at construction; when `False`, the field
            stays `None` and the phonetics crate isn't called.
            Consumers that feed `part.metaphone` into downstream
            fields (e.g. yente's `name_phonemes` ES field) keep the
            default; callers that never read the property can save
            the per-part metaphone call.
        numerics: When `True` (default), numeric-looking name parts
            that the AC tagger's ordinal list didn't cover get a
            `Symbol(NUMERIC, int_value)` applied. When `False`, parts
            still get `NamePartTag.NUM` (cheap structural info) but
            no NUMERIC symbol is emitted. Callers that don't use
            numeric-symbol overlap for scoring can save the symbol
            allocation.
        consolidate: When `True` (default), the returned set has
            [Name.consolidate_names][rigour.names.Name.consolidate_names]
            applied — short names that are substrings of longer names
            in the same set are dropped. **Indexers should pass
            `consolidate=False`** to preserve partial-name recall
            (e.g. letting `"John Smith"` match `"John K Smith"` from
            the other side).
        rewrite: When `True` (default), the pre-tagger canonicalisation
            stages run: honorific-prefix removal for PER names
            (`Mr.`, `Dr.`, `Sir`), and for ORG/ENT names both
            article-prefix removal (`The`) and org-type compare-form
            rewriting (`Inc.`→`LLC`, `GmbH`→`JSC`, …). Pass `False`
            to keep the literal input form — the tagger still fires
            on the raw tokens because its alias set covers both
            original and canonical forms. Useful for debugging the
            tagger in isolation and for callers that want to display
            or index a name without the canonical substitutions.

    Returns:
        A set of tagged `Name` objects, de-duplicated by normalised
        form. Empty if every input normalised to an empty string.
    """
    tag_dict: dict[NamePartTag, list[str]] | None
    if part_tags is None:
        tag_dict = None
    else:
        tag_dict = {tag: list(values) for tag, values in part_tags.items()}
    return _analyze_names(
        type_tag,
        list(names),
        tag_dict,
        infer_initials=infer_initials,
        symbols=symbols,
        phonetics=phonetics,
        numerics=numerics,
        consolidate=consolidate,
        rewrite=rewrite,
    )

`rigour.names.compare`

Residue-distance scoring for two NamePart lists.

Reach for compare_parts when a name matcher has already peeled off the parts it can explain by other means — symbol pairing, alias tagging, identifier hits — and is left with a residue that needs a fuzzy-match verdict (typo, transliteration drift, surface-form variants of the same token).

The function returns one Alignment per cluster of aligned parts (paired or solo). Every input part appears in exactly one alignment, so a caller can sum / weight / threshold the result without losing track of which inputs got accounted for. Returned alignments carry symbol = None (residue distance is non-symbolic by definition).

The cost model penalises digit mismatches more than letter mismatches, treats visually / phonetically confusable pairs (0/o, 1/l, c/k, …) as cheap edits, and charges almost nothing for token merge / split. A length-dependent budget caps the per-side similarity at zero once the total cost exceeds what's plausible for typo noise — the matcher refuses to fuzzy-match when the edit-density crosses into distinct-entity territory.

Pass a CompareConfig to override the cost / budget / clustering scalars — e.g. budget_tolerance to shift between strict (payment-screening) and permissive (KYC- onboarding) profiles, or cost_* for sweep-based calibration. The default is recall-protective and matches industry-typical tuning.

`Alignment`

One unit of name-comparison evidence.

Three modes:

Symbol-paired edge — symbol is Some and both sides carry the same Symbol. Returned by pair_symbols. Default score is 1.0; consumers may override with a category default (e.g. SYM_SCORES[NAME] = 0.9).
Residue cluster — symbol is None, both sides non-empty. Returned by compare_parts for parts that aligned by edit distance.
Extra — symbol is None, exactly one side is empty. Represents a part that found no counterpart on the other side; the matcher applies a side-specific weight.

qps / rps / symbol / qstr / rstr are immutable post-construction. score and weight are mutable to support the matcher's policy passes (literal-equality rescue, extras-weight override, family-name boost). Both stored as Py<PyFloat> so Python-side reads are an INCREF rather than a fresh allocation per access.

__hash__ and __eq__ key on (symbol, qps, rps) — NamePart already hashes by (index, form) so position is preserved. score and weight are not part of identity.

doc = "One unit of name-comparison evidence.\n\nThree modes:\n\n- Symbol-paired edge — `symbol` is `Some` and both sides\n carry the same `Symbol`. Returned by `pair_symbols`. Default\n `score` is `1.0`; consumers may override with a category\n default (e.g. `SYM_SCORES[NAME] = 0.9`).\n- Residue cluster — `symbol` is `None`, both sides\n non-empty. Returned by `compare_parts` for parts that\n aligned by edit distance.\n- Extra — `symbol` is `None`, exactly one side is empty.\n Represents a part that found no counterpart on the other\n side; the matcher applies a side-specific weight.\n\n`qps` / `rps` / `symbol` / `qstr` / `rstr` are immutable\npost-construction. `score` and `weight` are mutable to support\nthe matcher's policy passes (literal-equality rescue,\nextras-weight override, family-name boost). Both stored as\n`Py<PyFloat>` so Python-side reads are an INCREF rather than a\nfresh allocation per access.\n\n`hash` and `eq` key on `(symbol, qps, rps)` —\n`NamePart` already hashes by `(index, form)` so position is\npreserved. `score` and `weight` are not part of identity." `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`qps` `property`

Query-side parts covered by this alignment.

`qstr` `property`

" ".join(p.comparable for p in qps), cached.

`rps` `property`

Result-side parts covered by this alignment.

`rstr` `property`

" ".join(p.comparable for p in rps), cached.

`score` `property`

Similarity in [0, 1]. For symbol-paired edges, defaults to 1.0; consumers override with a category default. For residue clusters, the per-cluster product. For extras, 0.0.

`symbol` `property`

Shared Symbol for symbol-paired edges; None for residue clusters and extras.

`weight` `property`

Aggregation weight in the matcher's weighted average. Defaults to 1.0; consumers override per category (SYM_WEIGHTS), for extras (nm_extra_*_name), for family-name boost (nm_family_name_weight), and for stopword down-weight.

`eq(value)` `method descriptor`

Return self==value.

`ge(value)` `method descriptor`

Return self>=value.

`gt(value)` `method descriptor`

Return self>value.

`hash()` `method descriptor`

Return hash(self).

`le(value)` `method descriptor`

Return self<=value.

`lt(value)` `method descriptor`

Return self<value.

`ne(value)` `method descriptor`

Return self!=value.

`new(*args, **kwargs)` `builtin`

Create and return a new object. See help(type) for accurate signature.

`repr()` `method descriptor`

Return repr(self).

`CompareConfig`

Tunable cost / budget / clustering scalars for [py_compare_parts].

Frozen by design: a sweep iteration constructs a fresh CompareConfig, the matcher caches one per request. Mutability would buy nothing (the values are read once per name pair) and would cost a runtime borrow check on each Rust-side access.

The default values reproduce the constants this struct replaced; compare_parts(qry, res) with no config argument is exactly equivalent to the pre-CompareConfig call.

doc = 'Tunable cost / budget / clustering scalars for [`py_compare_parts`].\n\nFrozen by design: a sweep iteration constructs a fresh\n`CompareConfig`, the matcher caches one per request. Mutability\nwould buy nothing (the values are read once per name pair) and\nwould cost a runtime borrow check on each Rust-side access.\n\nThe default values reproduce the constants this struct replaced;\n`compare_parts(qry, res)` with no `config` argument is exactly\nequivalent to the pre-`CompareConfig` call.' `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`module = 'rigour._core'` `class-attribute`

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

`budget_log_base` `property`

Logarithm base in the per-side cost-budget formula log_budget_log_base(max(len - budget_short_floor, 1)) * budget_tolerance. The base controls how aggressively the budget grows with token length — smaller base = faster growth = more permissive on long names.

`budget_short_floor` `property`

Short-token floor: tokens shorter than this contribute zero to the budget, so any non-zero edit fails the cap. This is the fail-closed property — the matcher refuses to fuzzy- match on 1-2 character tokens (vessel hull suffixes, isolated initials, 2-char Chinese given names) where typo / distinct-entity signal is too weak.

`budget_tolerance` `property`

Multiplier on the per-side cost budget. Lower is stricter (less edit tolerated before a cluster scores zero); higher is more permissive. Callers tune this per scenario — KYC at onboarding runs more permissive than payment screening.

`cluster_overlap_min` `property`

Overlap fraction (matched chars / shorter-side length) above which two parts pair into a cluster. A pair below this threshold surfaces as solo records — the matched-character evidence is too thin to claim the parts are talking about the same token. The 0.51 default (i.e. "more than half") is the lowest value where majority of the shorter token agrees.

`cost_confusable` `property`

Substitute between a confusable pair from resources/names/compare.yml (0/o, 1/l, …). OCR / transliteration / homoglyph noise — the writer was probably aiming at the same character.

`cost_digit` `property`

Edit involving a digit on either side. Digits identify specific things — vintage years, vessel hull numbers, fund vintages — so a digit mismatch is evidence of a different entity, not a typo.

`cost_sep_drop` `property`

Token boundary lost or gained on one side. Token merge/split (vanderbilt ↔ van der bilt) is a common surface-form variant of the same name; charging it almost nothing keeps the alignment from refusing to bridge whitespace artifacts.

`new(*args, **kwargs)` `builtin`

Create and return a new object. See help(type) for accurate signature.

`repr()` `method descriptor`

Return repr(self).

`compare_parts(qry, res, config=None)` `builtin`

Score the alignment of two NamePart lists.

Callers should hand over the residue — parts that earlier stages (symbol pairing, alias tagging, identifier matching) couldn't explain by themselves — already canonicalised into positional order (tag_sort for ORG/ENT, align_person_name_order for PER). The function returns one [Alignment] per cluster, paired or solo; every input part appears exactly once across the output. Returned alignments carry symbol = None (residue distance is non-symbolic by definition).

config overrides the cost / budget / clustering scalars. Pass None (the default) to use the process-wide defaults — those match industry-typical recall-protective tuning. Sweep scripts build a fresh [CompareConfig] per iteration; matchers cache one per request.

Names

rigour.names

Alignment

__module__ = 'rigour._core' class-attribute

qps property

qstr property

rps property

rstr property

score property

symbol property

weight property

__eq__(value) method descriptor

__ge__(value) method descriptor

__gt__(value) method descriptor

__hash__() method descriptor

__le__(value) method descriptor

__lt__(value) method descriptor

__ne__(value) method descriptor

__new__(*args, **kwargs) builtin

__repr__() method descriptor

CompareConfig

__module__ = 'rigour._core' class-attribute

budget_log_base property

budget_short_floor property

budget_tolerance property

cluster_overlap_min property

cost_confusable property

cost_digit property

cost_sep_drop property

__new__(*args, **kwargs) builtin

__repr__() method descriptor

Name

__doc__ = "A personal, organisational, or object name.\n\nEquality and hashing are over `form`. A `Name`'s `tag` can change\nand `spans` grows without affecting either." class-attribute

__module__ = 'rigour._core' class-attribute

comparable property

form property

norm_form property

original property

parts property

spans property

symbols property

tag property

__eq__(value) method descriptor

__ge__(value) method descriptor

__gt__(value) method descriptor

__hash__() method descriptor

__le__(value) method descriptor

__lt__(value) method descriptor

__ne__(value) method descriptor

__new__(*args, **kwargs) builtin

__repr__() method descriptor

__str__() method descriptor

apply_part(part, symbol) method descriptor

apply_phrase(phrase, symbol) method descriptor

consolidate_names(names) builtin

contains(other) method descriptor

tag_text(text, tag, max_matches=1) method descriptor

NamePart

__doc__ = 'A single tagged component of a [`crate::names::name::Name`].\n\nEquality and hashing are over `(index, form)` — the immutable\nidentity of the part. `tag` can be re-written after construction\nwithout invalidating either.' class-attribute

__module__ = 'rigour._core' class-attribute

ascii property

comparable property

form property

index property

integer property

latinize property

metaphone property

numeric property

tag property

__eq__(value) method descriptor

__ge__(value) method descriptor

__gt__(value) method descriptor

__hash__() method descriptor

__le__(value) method descriptor

__len__() method descriptor

__lt__(value) method descriptor

__ne__(value) method descriptor

__new__(*args, **kwargs) builtin

__repr__() method descriptor

tag_sort(parts) builtin

`rigour.names`

`Alignment`

`module = 'rigour._core'` `class-attribute`

`qps` `property`

`qstr` `property`

`rps` `property`

`rstr` `property`

`score` `property`

`symbol` `property`

`weight` `property`

`eq(value)` `method descriptor`

`ge(value)` `method descriptor`

`gt(value)` `method descriptor`

`hash()` `method descriptor`

`le(value)` `method descriptor`

`lt(value)` `method descriptor`

`ne(value)` `method descriptor`

`new(*args, **kwargs)` `builtin`

`repr()` `method descriptor`

`CompareConfig`

`module = 'rigour._core'` `class-attribute`

`budget_log_base` `property`

`budget_short_floor` `property`

`budget_tolerance` `property`

`cluster_overlap_min` `property`

`cost_confusable` `property`

`cost_digit` `property`

`cost_sep_drop` `property`

`new(*args, **kwargs)` `builtin`

`repr()` `method descriptor`

`Name`

doc = "A personal, organisational, or object name.\n\nEquality and hashing are over `form`. A `Name`'s `tag` can change\nand `spans` grows without affecting either." `class-attribute`

`module = 'rigour._core'` `class-attribute`

`comparable` `property`

`form` `property`

`norm_form` `property`

`original` `property`

`parts` `property`

`spans` `property`

`symbols` `property`

`tag` `property`

`eq(value)` `method descriptor`

`ge(value)` `method descriptor`

`gt(value)` `method descriptor`

`hash()` `method descriptor`

`le(value)` `method descriptor`

`lt(value)` `method descriptor`

`ne(value)` `method descriptor`

`new(*args, **kwargs)` `builtin`

`repr()` `method descriptor`

`str()` `method descriptor`

`apply_part(part, symbol)` `method descriptor`

`apply_phrase(phrase, symbol)` `method descriptor`

`consolidate_names(names)` `builtin`

`contains(other)` `method descriptor`

`tag_text(text, tag, max_matches=1)` `method descriptor`

`NamePart`

doc = 'A single tagged component of a [`crate::names::name::Name`].\n\nEquality and hashing are over `(index, form)` — the immutable\nidentity of the part. `tag` can be re-written after construction\nwithout invalidating either.' `class-attribute`

`module = 'rigour._core'` `class-attribute`

`ascii` `property`

`comparable` `property`

`form` `property`

`index` `property`

`integer` `property`

`latinize` `property`

`metaphone` `property`

`numeric` `property`

`tag` `property`

`eq(value)` `method descriptor`

`ge(value)` `method descriptor`

`gt(value)` `method descriptor`

`hash()` `method descriptor`

`le(value)` `method descriptor`

`len()` `method descriptor`

`lt(value)` `method descriptor`

`ne(value)` `method descriptor`

`new(*args, **kwargs)` `builtin`

`repr()` `method descriptor`

`tag_sort(parts)` `builtin`

`NamePartTag`