Skip to content

Names

rigour.names

Name handling utilities for person and organisation names. This module contains a large (and growing) set of tools for handling names. In general, there are three types of names: people, organizations, and objects. Different normalization may be required for each of these types, including prefix removal for person names (e.g. "Mr." or "Ms.") and type normalization for organization names (e.g. "Incorporated" -> "Inc" or "Limited" -> "Ltd").

The Name class is meant to provide a structure for a name, including its original form, normalized form, metadata on the type of thing described by the name, and the language of the name. The NamePart class is used to represent individual parts of a name, such as the first name, middle name, and last name.

Alignment

One unit of name-comparison evidence.

Three modes:

  • Symbol-paired edgesymbol is Some and both sides carry the same Symbol. Returned by pair_symbols. Default score is 1.0; consumers may override with a category default (e.g. SYM_SCORES[NAME] = 0.9).
  • Residue clustersymbol is None, both sides non-empty. Returned by compare_parts for parts that aligned by edit distance.
  • Extrasymbol is None, exactly one side is empty. Represents a part that found no counterpart on the other side; the matcher applies a side-specific weight.

qps / rps / symbol / qstr / rstr are immutable post-construction. score and weight are mutable to support the matcher's policy passes (literal-equality rescue, extras-weight override, family-name boost). Both stored as Py<PyFloat> so Python-side reads are an INCREF rather than a fresh allocation per access.

__hash__ and __eq__ key on (symbol, qps, rps)NamePart already hashes by (index, form) so position is preserved. score and weight are not part of identity.

__doc__ = "One unit of name-comparison evidence.\n\nThree modes:\n\n- **Symbol-paired edge** — `symbol` is `Some` and both sides\n carry the same `Symbol`. Returned by `pair_symbols`. Default\n `score` is `1.0`; consumers may override with a category\n default (e.g. `SYM_SCORES[NAME] = 0.9`).\n- **Residue cluster** — `symbol` is `None`, both sides\n non-empty. Returned by `compare_parts` for parts that\n aligned by edit distance.\n- **Extra** — `symbol` is `None`, exactly one side is empty.\n Represents a part that found no counterpart on the other\n side; the matcher applies a side-specific weight.\n\n`qps` / `rps` / `symbol` / `qstr` / `rstr` are immutable\npost-construction. `score` and `weight` are mutable to support\nthe matcher's policy passes (literal-equality rescue,\nextras-weight override, family-name boost). Both stored as\n`Py<PyFloat>` so Python-side reads are an INCREF rather than a\nfresh allocation per access.\n\n`__hash__` and `__eq__` key on `(symbol, qps, rps)` —\n`NamePart` already hashes by `(index, form)` so position is\npreserved. `score` and `weight` are not part of identity." class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

qps property

Query-side parts covered by this alignment.

qstr property

" ".join(p.comparable for p in qps), cached.

rps property

Result-side parts covered by this alignment.

rstr property

" ".join(p.comparable for p in rps), cached.

score property

Similarity in [0, 1]. For symbol-paired edges, defaults to 1.0; consumers override with a category default. For residue clusters, the per-cluster product. For extras, 0.0.

symbol property

Shared Symbol for symbol-paired edges; None for residue clusters and extras.

weight property

Aggregation weight in the matcher's weighted average. Defaults to 1.0; consumers override per category (SYM_WEIGHTS), for extras (nm_extra_*_name), for family-name boost (nm_family_name_weight), and for stopword down-weight.

__eq__(value) method descriptor

Return self==value.

__ge__(value) method descriptor

Return self>=value.

__gt__(value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__le__(value) method descriptor

Return self<=value.

__lt__(value) method descriptor

Return self<value.

__ne__(value) method descriptor

Return self!=value.

__new__(*args, **kwargs) builtin

Create and return a new object. See help(type) for accurate signature.

__repr__() method descriptor

Return repr(self).

CompareConfig

Tunable cost / budget / clustering scalars for [py_compare_parts].

Frozen by design: a sweep iteration constructs a fresh CompareConfig, the matcher caches one per request. Mutability would buy nothing (the values are read once per name pair) and would cost a runtime borrow check on each Rust-side access.

The default values reproduce the constants this struct replaced; compare_parts(qry, res) with no config argument is exactly equivalent to the pre-CompareConfig call.

__doc__ = 'Tunable cost / budget / clustering scalars for [`py_compare_parts`].\n\nFrozen by design: a sweep iteration constructs a fresh\n`CompareConfig`, the matcher caches one per request. Mutability\nwould buy nothing (the values are read once per name pair) and\nwould cost a runtime borrow check on each Rust-side access.\n\nThe default values reproduce the constants this struct replaced;\n`compare_parts(qry, res)` with no `config` argument is exactly\nequivalent to the pre-`CompareConfig` call.' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

budget_log_base property

Logarithm base in the per-side cost-budget formula log_budget_log_base(max(len - budget_short_floor, 1)) * budget_tolerance. The base controls how aggressively the budget grows with token length — smaller base = faster growth = more permissive on long names.

budget_short_floor property

Short-token floor: tokens shorter than this contribute zero to the budget, so any non-zero edit fails the cap. This is the fail-closed property — the matcher refuses to fuzzy- match on 1-2 character tokens (vessel hull suffixes, isolated initials, 2-char Chinese given names) where typo / distinct-entity signal is too weak.

budget_tolerance property

Multiplier on the per-side cost budget. Lower is stricter (less edit tolerated before a cluster scores zero); higher is more permissive. Callers tune this per scenario — KYC at onboarding runs more permissive than payment screening.

cluster_overlap_min property

Overlap fraction (matched chars / shorter-side length) above which two parts pair into a cluster. A pair below this threshold surfaces as solo records — the matched-character evidence is too thin to claim the parts are talking about the same token. The 0.51 default (i.e. "more than half") is the lowest value where majority of the shorter token agrees.

cost_confusable property

Substitute between a confusable pair from resources/names/compare.yml (0/o, 1/l, …). OCR / transliteration / homoglyph noise — the writer was probably aiming at the same character.

cost_digit property

Edit involving a digit on either side. Digits identify specific things — vintage years, vessel hull numbers, fund vintages — so a digit mismatch is evidence of a different entity, not a typo.

cost_sep_drop property

Token boundary lost or gained on one side. Token merge/split (vanderbiltvan der bilt) is a common surface-form variant of the same name; charging it almost nothing keeps the alignment from refusing to bridge whitespace artifacts.

__new__(*args, **kwargs) builtin

Create and return a new object. See help(type) for accurate signature.

__repr__() method descriptor

Return repr(self).

Name

A personal, organisational, or object name.

Equality and hashing are over form. A Name's tag can change and spans grows without affecting either.

__doc__ = "A personal, organisational, or object name.\n\nEquality and hashing are over `form`. A `Name`'s `tag` can change\nand `spans` grows without affecting either." class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

comparable property

Space-joined part.comparable across the parts. Precomputed.

form property

Normalised form. Defaults to casefold(original) if not supplied at construction.

norm_form property

Space-joined part.form across the parts. Precomputed.

original property

Input string, verbatim.

parts property

Tokens of form, one [NamePart] per token. Exposed as a tuple so it's hashable — downstream code keys on (span.parts, span.symbol.category) etc.

spans property

Tagger output — grows over the name's lifetime via apply_phrase / apply_part.

symbols property

Aggregate view of every symbol the tagger has attached to this name. Useful when you want the symbol set regardless of which parts carry them (e.g. indexing the name's semantic annotations into a flat field).

tag property

What kind of thing the name describes. Mutable — infer_part_tags may upgrade ENTORG after tagging.

__eq__(value) method descriptor

Return self==value.

__ge__(value) method descriptor

Return self>=value.

__gt__(value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__le__(value) method descriptor

Return self<=value.

__lt__(value) method descriptor

Return self<value.

__ne__(value) method descriptor

Return self!=value.

__new__(*args, **kwargs) builtin

Create and return a new object. See help(type) for accurate signature.

__repr__() method descriptor

Return repr(self).

__str__() method descriptor

Return str(self).

apply_part(part, symbol) method descriptor

Record that a single [NamePart] carries symbol.

The single-part variant of [Name::apply_phrase]. Used for symbols that inherently apply to one token: INITIAL on a single-character latin part, NUMERIC inferred from a part like "123456789" that the ordinal tagger didn't cover.

apply_phrase(phrase, symbol) method descriptor

Record that phrase in this name carries symbol.

The tagger's output path: when the AC automaton reports a recognised phrase (e.g. "limited liability company" → ORG_CLASS:LLP), the match is attached as a [Span] so downstream matching and inference can see which tokens the symbol covers. Every non-overlapping occurrence of phrase in the name gets its own Span.

Idempotent on (phrase, symbol): if any existing Span on this name already carries the same symbol over the same joined-form sequence, the call is a no-op. This keeps the invariant "no duplicate (phrase, symbol) Spans" intact even when more than one tagger fires on the same Name (e.g. the org and person taggers both running on an ENT input).

consolidate_names(names) builtin

Drop short names that are contained in longer names.

Useful when building a matcher to prevent a scenario where a short version of a name ("John Smith") is matched against a query "John K Smith" — where the longer candidate version would have correctly disqualified the match ("John K Smith" != "John R Smith"). Keeping only the longer form forces the matcher to reckon with the full evidence.

Containment uses [Name::contains]; see there for the PER-aware subset rule. Accepts any Python iterable of Name; returns a new set.

contains(other) method descriptor

True iff this name structurally contains other.

Used by matcher pipelines to detect when one name's evidence is a subset of another's — e.g. "John Smith" is contained in "John K Smith", and the longer form supersedes the shorter when consolidating candidate names before scoring (see [Name::consolidate_names]). Also backs middle-initial matching: "John Smith" contains "J. Smith" when the J carries an INITIAL symbol that self also has.

Rule: for PER names, every part of other must have a (not-necessarily-adjacent) comparable-equal counterpart in self. For non-PER names, or when the PER rule doesn't find a full subset, falls back to substring containment of norm_form. Returns False when self.tag == UNK or when the two names are equal.

tag_text(text, tag, max_matches=1) method descriptor

Tag the parts that spell out text with the given tag.

Used when external metadata tells the caller the structural role of a subset of the name's tokens. For example, an FTM firstName property of "Jean Claude" on a name "Jean Claude Juncker" marks both the jean and claude parts as GIVEN; a lastName of "Juncker" then marks the remaining part as FAMILY.

Walks self.parts looking for a contiguous (adjacency-insensitive) match of the tokenised text. On a hit, each matched part's tag is set to tag; parts that already carry a tag that conflicts under [NamePartTag::can_match] demote to AMBIGUOUS instead. Stops after max_matches successful matches.

NamePart

A single tagged component of a [crate::names::name::Name].

Equality and hashing are over (index, form) — the immutable identity of the part. tag can be re-written after construction without invalidating either.

__doc__ = 'A single tagged component of a [`crate::names::name::Name`].\n\nEquality and hashing are over `(index, form)` — the immutable\nidentity of the part. `tag` can be re-written after construction\nwithout invalidating either.' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

ascii property

ASCII-ified form of form for admitted-script parts; None when the part is outside the admitted scripts or reduces to empty after stripping non-alphanumerics.

comparable property

Best-effort matchable form: integer string for numerics, form for non-latinize parts, ascii otherwise.

form property

Token text, as tokenised from the parent name's form.

index property

Position of this part within the parent name's parts list.

integer property

Parsed integer value for numeric parts, or None when the part isn't numeric or doesn't fit an i64.

latinize property

True if form is in an admitted-script set (Latin, Cyrillic, Greek, Armenian, Georgian, Hangul) and thus can be meaningfully ASCII-ified.

metaphone property

Metaphone phonetic key, or None when phonetics were disabled at construction or the part doesn't qualify (non-latinize, numeric, or shorter than three characters).

numeric property

True if form is entirely numeric characters.

tag property

Structural role of this part. Set by the tagging pipeline; UNSET at construction.

__eq__(value) method descriptor

Return self==value.

__ge__(value) method descriptor

Return self>=value.

__gt__(value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__le__(value) method descriptor

Return self<=value.

__len__() method descriptor

Return len(self).

__lt__(value) method descriptor

Return self<value.

__ne__(value) method descriptor

Return self!=value.

__new__(*args, **kwargs) builtin

Create and return a new object. See help(type) for accurate signature.

__repr__() method descriptor

Return repr(self).

tag_sort(parts) builtin

Sort name parts into canonical display order.

Used when rendering a name back out for humans: honorifics come first, then given names, middle, family, suffixes, legal forms, and stopwords — independent of the input word order. A tokeniser might hand the parts over as "Guttenberg zu Karl-Theodor" (order from the source data); tag_sort restores "Karl-Theodor zu Guttenberg" shape once the parts have been tagged. Sort is stable across parts with the same tag; see [crate::names::tag::NAME_TAGS_ORDER] for the full ordering.

NamePartTag

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

AMBIGUOUS = NamePartTag.AMBIGUOUS class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

FAMILY = NamePartTag.FAMILY class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

GIVEN = NamePartTag.GIVEN class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

HONORIFIC = NamePartTag.HONORIFIC class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

LEGAL = NamePartTag.LEGAL class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

MATRONYMIC = NamePartTag.MATRONYMIC class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

MIDDLE = NamePartTag.MIDDLE class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

NICK = NamePartTag.NICK class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

NUM = NamePartTag.NUM class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

PATRONYMIC = NamePartTag.PATRONYMIC class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

STOP = NamePartTag.STOP class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

SUFFIX = NamePartTag.SUFFIX class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

TITLE = NamePartTag.TITLE class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

TRIBAL = NamePartTag.TRIBAL class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

UNSET = NamePartTag.UNSET class-attribute

The structural role of a part within a name. A newly-constructed [crate::names::part::NamePart] starts as UNSET; the tagging pipeline promotes it based on external hints (firstName, lastName, …) or pattern matches (numeric, stopword, legal form).

__doc__ = 'The structural role of a part within a name. A newly-constructed\n[`crate::names::part::NamePart`] starts as `UNSET`; the tagging\npipeline promotes it based on external hints (firstName,\nlastName, …) or pattern matches (numeric, stopword, legal form).' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

value property

__eq__(value) method descriptor

Return self==value.

__ge__(value) method descriptor

Return self>=value.

__gt__(value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__int__() method descriptor

int(self)

__le__(value) method descriptor

Return self<=value.

__lt__(value) method descriptor

Return self<value.

__ne__(value) method descriptor

Return self!=value.

__repr__() method descriptor

Return repr(self).

NameTypeTag

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

ENT = NameTypeTag.ENT class-attribute

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

OBJ = NameTypeTag.OBJ class-attribute

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

ORG = NameTypeTag.ORG class-attribute

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

PER = NameTypeTag.PER class-attribute

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

UNK = NameTypeTag.UNK class-attribute

What kind of thing a name describes. Drives which pipeline passes apply when a [crate::names::name::Name] is analysed.

__doc__ = 'What kind of thing a name describes. Drives which pipeline passes\napply when a [`crate::names::name::Name`] is analysed.' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

value property

__eq__(value) method descriptor

Return self==value.

__ge__(value) method descriptor

Return self>=value.

__gt__(value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__int__() method descriptor

int(self)

__le__(value) method descriptor

Return self<=value.

__lt__(value) method descriptor

Return self<value.

__ne__(value) method descriptor

Return self!=value.

__repr__() method descriptor

Return repr(self).

Span

A contiguous group of [NamePart]s annotated with a [crate::names::symbol::Symbol] — the tagger's output unit.

__doc__ = "A contiguous group of [`NamePart`]s annotated with a\n[`crate::names::symbol::Symbol`] — the tagger's output unit." class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

comparable property

Space-joined part.comparable over the covered parts, for use in matcher-side substring checks.

parts property

The [NamePart]s covered by this span. Same Py<NamePart> references that live in the parent [crate::names::name::Name]'s .parts, so span.parts[0] is name.parts[i] is True from Python. Exposed as a tuple — hashable, so downstream code can key on (span.parts, span.symbol.category) when deduplicating pairings.

symbol property

The symbol this span carries.

__eq__(value) method descriptor

Return self==value.

__ge__(value) method descriptor

Return self>=value.

__gt__(value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__le__(value) method descriptor

Return self<=value.

__len__() method descriptor

Return len(self).

__lt__(value) method descriptor

Return self<value.

__ne__(value) method descriptor

Return self!=value.

__new__(*args, **kwargs) builtin

Create and return a new object. See help(type) for accurate signature.

__repr__() method descriptor

Return repr(self).

Symbol

A semantic interpretation applied to one or more parts of a name.

Carries a [SymbolCategory] and an id. Tagger pipelines emit Symbols during name analysis; matchers compare them between names as a coarse compatibility signal, and indexers flatten them into searchable fields. Equality and hashing are structural over (category, id).

Ids are always str — integer-sourced ids (Wikidata QIDs, ordinals, initial codepoints) are decimal-stringified at construction. Distinct Symbols with equal ids share one [Arc<str>] heap allocation via [intern].

__doc__ = 'A semantic interpretation applied to one or more parts of a name.\n\nCarries a [`SymbolCategory`] and an id. Tagger pipelines emit\nSymbols during name analysis; matchers compare them between\nnames as a coarse compatibility signal, and indexers flatten\nthem into searchable fields. Equality and hashing are\nstructural over `(category, id)`.\n\nIds are always `str` — integer-sourced ids (Wikidata QIDs,\nordinals, initial codepoints) are decimal-stringified at\nconstruction. Distinct `Symbol`s with equal ids share one\n[`Arc<str>`] heap allocation via [`intern`].' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

category property

id property

The interned id string. Always str on the Python side — ids originally passed as int return as their decimal form.

__eq__(value) method descriptor

Return self==value.

__ge__(value) method descriptor

Return self>=value.

__gt__(value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__init__(category, id)

id is decimal-stringified if passed as int.

__le__(value) method descriptor

Return self<=value.

__lt__(value) method descriptor

Return self<value.

__ne__(value) method descriptor

Return self!=value.

__new__(*args, **kwargs) builtin

Create and return a new object. See help(type) for accurate signature.

__repr__() method descriptor

Return repr(self).

__str__() method descriptor

Return str(self).

SymbolCategory

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

DOMAIN = SymbolCategory.DOMAIN class-attribute

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

INITIAL = SymbolCategory.INITIAL class-attribute

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

LOCATION = SymbolCategory.LOCATION class-attribute

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

NAME = SymbolCategory.NAME class-attribute

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

NICK = SymbolCategory.NICK class-attribute

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

NUMERIC = SymbolCategory.NUMERIC class-attribute

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

ORG_CLASS = SymbolCategory.ORG_CLASS class-attribute

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

PHONETIC = SymbolCategory.PHONETIC class-attribute

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

SYMBOL = SymbolCategory.SYMBOL class-attribute

The kind of semantic annotation a [Symbol] carries. Drives how strongly a symbol match counts during scoring — an ORG_CLASS match is a strong corporate-form signal, an INITIAL match is weak evidence that needs token-level corroboration.

__doc__ = 'The kind of semantic annotation a [`Symbol`] carries. Drives how\nstrongly a symbol match counts during scoring — an `ORG_CLASS`\nmatch is a strong corporate-form signal, an `INITIAL` match is\nweak evidence that needs token-level corroboration.' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

value property

__eq__(value) method descriptor

Return self==value.

__ge__(value) method descriptor

Return self>=value.

__gt__(value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__int__() method descriptor

int(self)

__le__(value) method descriptor

Return self<=value.

__lt__(value) method descriptor

Return self<value.

__ne__(value) method descriptor

Return self!=value.

__repr__() method descriptor

Return repr(self).

align_person_name_order(left, right) builtin

Greedy-align two lists of name parts so comparable tokens share the same output index.

Used by the name matcher to reorder remaining tokens after symbolic tagging so a downstream per-index similarity pass compares like with like. Pairs are chosen by a length-desc, left-major walk over edit-similarity scores; ties are broken stably by input order so the output is deterministic.

Returns ([], tag_sort(right)) when left is empty, falls back to (tag_sort(left), tag_sort(right)) when no pair scores above the similarity floor, otherwise returns the greedy-aligned pair.

analyze_names(type_tag, names, part_tags=None, *, infer_initials=False, symbols=True, phonetics=True, numerics=True, consolidate=True, rewrite=True)

Build a set of tagged Name objects from raw strings.

Parameters:

Name Type Description Default
type_tag NameTypeTag

The NameTypeTag for every name in this batch. Drives which prefix/org-type/ tagger passes run: PER → person prefix strip + person tagger; ORG/ENT → org-type replacement + org prefix strip + org tagger; OBJ → object prefix strip ("M/V", "SS", …) but no tagger; UNK → no rewrites or tagging, just construction.

required
names Sequence[str]

Raw name strings as harvested from the source entity. Empty strings and inputs that normalise to empty are dropped. Duplicates (after prenormalisation) are de-duplicated.

required
part_tags Optional[Mapping[NamePartTag, Sequence[str]]]

Pre-classified part annotations, typically produced by an adapter that reads structured name-part properties off the source entity (e.g. firstName → GIVEN, lastName → FAMILY). Each value is applied to every constructed Name via Name.tag_text. Values can be multi-token strings — see the module docstring. Defaults to an empty mapping.

None
infer_initials bool

When True, every single-character latin name part is tagged with an INITIAL symbol — useful on a free-text query side where "J Smith" arrives without a label on "J". When False (default), only parts already tagged as GIVEN / MIDDLE pick up INITIAL symbols. Default False because initials are a query-side concept; the indexer and the candidate side of a matcher pass False, so the leaner default suits the common call. Ignored for non-person names. No-op when symbols=False.

False
symbols bool

Master switch for symbol emission. When True (default), the INITIAL preamble, the AC tagger's match-and-apply pass, and NUMERIC-symbol emission all run. When False, no symbols are attached to the returned names — name.symbols is empty and name.spans stays empty. NamePartTag labelling (including the NUM / STOP / LEGAL promotions in the inference pass) still fires, and part_tags values are still applied via Name.tag_text. Useful for callers that only need tokens + part tags and don't match on symbol overlap; skipping the AC tagger is the main performance saving.

True
phonetics bool

When True (default), each NamePart.metaphone is populated at construction; when False, the field stays None and the phonetics crate isn't called. Consumers that feed part.metaphone into downstream fields (e.g. yente's name_phonemes ES field) keep the default; callers that never read the property can save the per-part metaphone call.

True
numerics bool

When True (default), numeric-looking name parts that the AC tagger's ordinal list didn't cover get a Symbol(NUMERIC, int_value) applied. When False, parts still get NamePartTag.NUM (cheap structural info) but no NUMERIC symbol is emitted. Callers that don't use numeric-symbol overlap for scoring can save the symbol allocation.

True
consolidate bool

When True (default), the returned set has Name.consolidate_names applied — short names that are substrings of longer names in the same set are dropped. Indexers should pass consolidate=False to preserve partial-name recall (e.g. letting "John Smith" match "John K Smith" from the other side).

True
rewrite bool

When True (default), the pre-tagger canonicalisation stages run: honorific-prefix removal for PER names (Mr., Dr., Sir), and for ORG/ENT names both article-prefix removal (The) and org-type compare-form rewriting (Inc.LLC, GmbHJSC, …). Pass False to keep the literal input form — the tagger still fires on the raw tokens because its alias set covers both original and canonical forms. Useful for debugging the tagger in isolation and for callers that want to display or index a name without the canonical substitutions.

True

Returns:

Type Description
Set[Name]

A set of tagged Name objects, de-duplicated by normalised

Set[Name]

form. Empty if every input normalised to an empty string.

Source code in rigour/names/analyze.py
def analyze_names(
    type_tag: NameTypeTag,
    names: Sequence[str],
    part_tags: Optional[Mapping[NamePartTag, Sequence[str]]] = None,
    *,
    infer_initials: bool = False,
    symbols: bool = True,
    phonetics: bool = True,
    numerics: bool = True,
    consolidate: bool = True,
    rewrite: bool = True,
) -> Set[Name]:
    """Build a set of tagged [Name][rigour.names.Name] objects from raw strings.

    Args:
        type_tag: The [NameTypeTag][rigour.names.NameTypeTag] for
            every name in this batch. Drives which prefix/org-type/
            tagger passes run: `PER` → person prefix strip + person
            tagger; `ORG`/`ENT` → org-type replacement + org prefix
            strip + org tagger; `OBJ` → object prefix strip ("M/V",
            "SS", …) but no tagger; `UNK` → no rewrites or tagging,
            just construction.
        names: Raw name strings as harvested from the source entity.
            Empty strings and inputs that normalise to empty are
            dropped. Duplicates (after prenormalisation) are de-duplicated.
        part_tags: Pre-classified part annotations, typically produced
            by an adapter that reads structured name-part properties
            off the source entity (e.g. firstName → `GIVEN`,
            lastName → `FAMILY`). Each value is applied to every
            constructed `Name` via `Name.tag_text`. Values can be
            multi-token strings — see the module docstring. Defaults
            to an empty mapping.
        infer_initials: When `True`, every single-character latin name
            part is tagged with an `INITIAL` symbol — useful on a
            free-text query side where `"J Smith"` arrives without
            a label on `"J"`. When `False` (default), only parts
            already tagged as `GIVEN` / `MIDDLE` pick up `INITIAL`
            symbols. Default `False` because initials are a
            query-side concept; the indexer and the candidate side
            of a matcher pass `False`, so the leaner default suits
            the common call. Ignored for non-person names. No-op
            when `symbols=False`.
        symbols: Master switch for symbol emission. When `True`
            (default), the INITIAL preamble, the AC tagger's
            match-and-apply pass, and NUMERIC-symbol emission all
            run. When `False`, no symbols are attached to the
            returned names — `name.symbols` is empty and
            `name.spans` stays empty. NamePartTag labelling
            (including the `NUM` / `STOP` / `LEGAL` promotions in
            the inference pass) still fires, and `part_tags` values
            are still applied via `Name.tag_text`. Useful for
            callers that only need tokens + part tags and don't
            match on symbol overlap; skipping the AC tagger is the
            main performance saving.
        phonetics: When `True` (default), each `NamePart.metaphone`
            is populated at construction; when `False`, the field
            stays `None` and the phonetics crate isn't called.
            Consumers that feed `part.metaphone` into downstream
            fields (e.g. yente's `name_phonemes` ES field) keep the
            default; callers that never read the property can save
            the per-part metaphone call.
        numerics: When `True` (default), numeric-looking name parts
            that the AC tagger's ordinal list didn't cover get a
            `Symbol(NUMERIC, int_value)` applied. When `False`, parts
            still get `NamePartTag.NUM` (cheap structural info) but
            no NUMERIC symbol is emitted. Callers that don't use
            numeric-symbol overlap for scoring can save the symbol
            allocation.
        consolidate: When `True` (default), the returned set has
            [Name.consolidate_names][rigour.names.Name.consolidate_names]
            applied — short names that are substrings of longer names
            in the same set are dropped. **Indexers should pass
            `consolidate=False`** to preserve partial-name recall
            (e.g. letting `"John Smith"` match `"John K Smith"` from
            the other side).
        rewrite: When `True` (default), the pre-tagger canonicalisation
            stages run: honorific-prefix removal for PER names
            (`Mr.`, `Dr.`, `Sir`), and for ORG/ENT names both
            article-prefix removal (`The`) and org-type compare-form
            rewriting (`Inc.`→`LLC`, `GmbH`→`JSC`, …). Pass `False`
            to keep the literal input form — the tagger still fires
            on the raw tokens because its alias set covers both
            original and canonical forms. Useful for debugging the
            tagger in isolation and for callers that want to display
            or index a name without the canonical substitutions.

    Returns:
        A set of tagged `Name` objects, de-duplicated by normalised
        form. Empty if every input normalised to an empty string.
    """
    tag_dict: dict[NamePartTag, list[str]] | None
    if part_tags is None:
        tag_dict = None
    else:
        tag_dict = {tag: list(values) for tag, values in part_tags.items()}
    return _analyze_names(
        type_tag,
        list(names),
        tag_dict,
        infer_initials=infer_initials,
        symbols=symbols,
        phonetics=phonetics,
        numerics=numerics,
        consolidate=consolidate,
        rewrite=rewrite,
    )

compare_parts(qry, res, config=None) builtin

Score the alignment of two NamePart lists.

Callers should hand over the residue — parts that earlier stages (symbol pairing, alias tagging, identifier matching) couldn't explain by themselves — already canonicalised into positional order (tag_sort for ORG/ENT, align_person_name_order for PER). The function returns one [Alignment] per cluster, paired or solo; every input part appears exactly once across the output. Returned alignments carry symbol = None (residue distance is non-symbolic by definition).

config overrides the cost / budget / clustering scalars. Pass None (the default) to use the process-wide defaults — those match industry-typical recall-protective tuning. Sweep scripts build a fresh [CompareConfig] per iteration; matchers cache one per request.

contains_split_phrase(string)

Check whether string contains an alias-marker phrase.

Detects markers like "a.k.a.", "f.k.a.", "née", "alias", that signal a single string actually carries multiple distinct names. Useful for triaging input — a string with a split phrase shouldn't be treated as one atomic name. The phrase list is data-driven from resources/names/stopwords.yml::NAME_SPLIT_PHRASES, surfaced via rigour._core.name_split_phrases_list.

Parameters:

Name Type Description Default
string str

An input that may contain one or more names.

required

Returns:

Type Description
bool

True iff at least one split-phrase marker appears in

bool

string as a whole word.

Source code in rigour/names/split_phrases.py
def contains_split_phrase(string: str) -> bool:
    """Check whether `string` contains an alias-marker phrase.

    Detects markers like `"a.k.a."`, `"f.k.a."`, `"née"`, `"alias"`,
    that signal a single string actually carries multiple distinct
    names. Useful for triaging input — a string with a split
    phrase shouldn't be treated as one atomic name. The phrase
    list is data-driven from
    `resources/names/stopwords.yml::NAME_SPLIT_PHRASES`,
    surfaced via `rigour._core.name_split_phrases_list`.

    Args:
        string: An input that may contain one or more names.

    Returns:
        `True` iff at least one split-phrase marker appears in
        `string` as a whole word.
    """
    return _split_phrase_regex().search(string) is not None

extract_org_types(name, normalize_flags=Normalize.CASEFOLD, cleanup=Cleanup.Noop, generic=False)

Find every organisation-type designation in a name.

Scans name for recognised aliases (LLC, Inc, GmbH, ...) and returns the matched substring and its canonical target. A poor-person's "is this a company name?" detector.

Parameters:

Name Type Description Default
name str

The text to be processed. Assumed to already be normalised with the same normalize_flags + cleanup the alias table was built from.

required
normalize_flags Normalize

Normalize flag set applied to the alias list at build time. Default Normalize.CASEFOLD.

CASEFOLD
cleanup Cleanup

Cleanup variant applied during alias normalisation. Default Cleanup.Noop.

Noop
generic bool

If True, target values are the generic form (llc, jsc) instead of the type-specific compare form. Matches :func:replace_org_types_compare.

False

Returns:

Type Description
List[Tuple[str, str]]

A list of (matched_text, target) tuples, one per

List[Tuple[str, str]]

non-overlapping match. Empty if nothing matches.

Source code in rigour/names/org_types.py
def extract_org_types(
    name: str,
    normalize_flags: Normalize = Normalize.CASEFOLD,
    cleanup: Cleanup = Cleanup.Noop,
    generic: bool = False,
) -> List[Tuple[str, str]]:
    """Find every organisation-type designation in a name.

    Scans `name` for recognised aliases (LLC, Inc, GmbH, ...) and returns
    the matched substring and its canonical target. A poor-person's
    "is this a company name?" detector.

    Args:
        name: The text to be processed. Assumed to already be normalised
            with the same `normalize_flags` + `cleanup` the alias table
            was built from.
        normalize_flags: `Normalize` flag
            set applied to the alias list at build time. Default
            `Normalize.CASEFOLD`.
        cleanup: `Cleanup` variant applied
            during alias normalisation. Default `Cleanup.Noop`.
        generic: If True, target values are the generic form (``llc``,
            ``jsc``) instead of the type-specific compare form. Matches
            :func:`replace_org_types_compare`.

    Returns:
        A list of ``(matched_text, target)`` tuples, one per
        non-overlapping match. Empty if nothing matches.
    """
    return _extract_org_types(name, int(normalize_flags), int(cleanup), generic)

is_name(name)

Check whether name plausibly contains a name.

Loose filter — true iff at least one character is a Unicode letter (general category L*). Useful for rejecting purely numeric ("007") or punctuation-only ("---") inputs before handing them to the rest of the name pipeline.

Parameters:

Name Type Description Default
name str

A string.

required

Returns:

Type Description
bool

True iff name contains at least one letter.

Source code in rigour/names/check.py
def is_name(name: str) -> bool:
    """Check whether `name` plausibly contains a name.

    Loose filter — true iff at least one character is a Unicode
    letter (general category `L*`). Useful for rejecting purely
    numeric (`"007"`) or punctuation-only (`"---"`) inputs before
    handing them to the rest of the name pipeline.

    Args:
        name: A string.

    Returns:
        `True` iff `name` contains at least one letter.
    """
    for char in name:
        category = unicodedata.category(char)
        if category[0] == "L":
            return True
    return False

is_stopword(form, *, normalizer=normalize_name, normalize=False)

Check if the given form is a stopword. The stopword list is normalized first.

.. deprecated:: Use :func:rigour.text.is_stopword instead. This function will be removed in a future version.

Parameters:

Name Type Description Default
form str

The token to check, must already be normalized.

required
normalizer Normalizer

The normalizer to use for checking stopwords.

normalize_name
normalize bool

Whether to normalize the form before checking.

False

Returns:

Name Type Description
bool bool

True if the form is a stopword, False otherwise.

Source code in rigour/names/check.py
def is_stopword(
    form: str, *, normalizer: Normalizer = normalize_name, normalize: bool = False
) -> bool:
    """Check if the given form is a stopword. The stopword list is normalized first.

    .. deprecated::
        Use :func:`rigour.text.is_stopword` instead. This function will be removed in a future version.

    Args:
        form (str): The token to check, must already be normalized.
        normalizer (Normalizer): The normalizer to use for checking stopwords.
        normalize (bool): Whether to normalize the form before checking.

    Returns:
        bool: True if the form is a stopword, False otherwise.
    """
    warnings.warn(
        "rigour.names.is_stopword is deprecated, use rigour.text.is_stopword instead",
        DeprecationWarning,
        stacklevel=2,
    )
    return _is_stopword(form, normalizer=normalizer, normalize=normalize)

normalize_name(name) cached

Casefold and tokenise a name into a stable matching key.

Convenience composition of :func:tokenize_name over a casefolded input, rejoined with single ASCII spaces. Equivalent to calling normalize(name, Normalize.CASEFOLD | Normalize.NAME) — use that directly when callers want explicit flag control.

Used internally by the rigour name-matching utilities; not intended as a general-purpose public surface.

Parameters:

Name Type Description Default
name Optional[str]

A name string, or None.

required

Returns:

Type Description
Optional[str]

Normalised name (lowercase, single-space-separated

Optional[str]

tokens), or None if input is None or normalises to

Optional[str]

empty.

Source code in rigour/names/tokenize.py
@lru_cache(maxsize=MEMO_SMALL)
def normalize_name(name: Optional[str]) -> Optional[str]:
    """Casefold and tokenise a name into a stable matching key.

    Convenience composition of :func:`tokenize_name` over a
    casefolded input, rejoined with single ASCII spaces.
    Equivalent to calling
    `normalize(name, Normalize.CASEFOLD | Normalize.NAME)` —
    use that directly when callers want explicit flag control.

    Used internally by the rigour name-matching utilities; not
    intended as a general-purpose public surface.

    Args:
        name: A name string, or `None`.

    Returns:
        Normalised name (lowercase, single-space-separated
        tokens), or `None` if input is `None` or normalises to
        empty.
    """
    if name is None:
        return None
    return normalize(name, Normalize.CASEFOLD | Normalize.NAME)

pair_symbols(query, result) builtin

Align the symbol spans of two [Name]s into coverage-maximal pairings.

Each returned pairing is a tuple of non-conflicting [Alignment]s; edges within a pairing cover disjoint parts on each side. Each Alignment has symbol = Some(_) and a placeholder score = 1.0 — consumers should override the score with a per-category default before composing the pairing total. Pairings are distinguished by their coverage and category multiset — two pairings that cover the same parts with the same category mix are collapsed to one. Distinct category choices on the same parts (e.g. a token carrying both NAME:Qvan and SYMBOL:van) surface as separate pairings.

Returns [()] (a single empty pairing) when either name has more than 64 parts, when either name has no tagger spans, or when no symbol is shared between the two sides.

pick_case(names)

Pick the best mix of lower- and uppercase characters from a set of names that are identical except for case. If the names are not identical, undefined things happen (not recommended).

Rust-backed via :func:rigour._core.pick_case. The Rust implementation returns None for empty input; this Python wrapper raises ValueError to preserve the pre-port contract that external callers rely on.

Parameters:

Name Type Description Default
names List[str]

A list of identical names in different cases.

required

Returns:

Name Type Description
str str

The best name for display.

Source code in rigour/names/pick.py
def pick_case(names: List[str]) -> str:
    """Pick the best mix of lower- and uppercase characters from a set of names
    that are identical except for case. If the names are not identical, undefined
    things happen (not recommended).

    Rust-backed via :func:`rigour._core.pick_case`. The Rust
    implementation returns `None` for empty input; this Python wrapper
    raises `ValueError` to preserve the pre-port contract that
    external callers rely on.

    Args:
        names (List[str]): A list of identical names in different cases.

    Returns:
        str: The best name for display.
    """
    from rigour._core import pick_case as _pick_case

    result = _pick_case(names)
    if result is None:
        raise ValueError("Cannot pick a name from an empty list.")
    return result

remove_obj_prefixes(name)

Strip vessel-class and generic-article prefixes from the head of an object name.

Drops "M/V", "SS", "The", etc. so "M/V Oceanic""Oceanic" doesn't penalise the shorter variant when matching vessels, vehicles, or aircraft. Driven by resources/names/stopwords.yml::OBJ_NAME_PREFIXES.

Parameters:

Name Type Description Default
name str

An object (vessel / vehicle / aircraft) name string.

required

Returns:

Type Description
str

The name with any leading prefix(es) removed.

Source code in rigour/names/prefix.py
def remove_obj_prefixes(name: str) -> str:
    """Strip vessel-class and generic-article prefixes from the
    head of an object name.

    Drops `"M/V"`, `"SS"`, `"The"`, etc. so `"M/V Oceanic"` →
    `"Oceanic"` doesn't penalise the shorter variant when
    matching vessels, vehicles, or aircraft. Driven by
    `resources/names/stopwords.yml::OBJ_NAME_PREFIXES`.

    Args:
        name: An object (vessel / vehicle / aircraft) name string.

    Returns:
        The name with any leading prefix(es) removed.
    """
    return _obj_prefix_regex().sub("", name)

remove_org_prefixes(name)

Strip article-like prefixes from the head of an organisation name.

Drops "The", etc. so "The Charitable Trust""Charitable Trust" doesn't penalise the shorter variant when matching. Driven by resources/names/stopwords.yml::ORG_NAME_PREFIXES.

Parameters:

Name Type Description Default
name str

An organisation name string.

required

Returns:

Type Description
str

The name with any leading article-prefix(es) removed.

Source code in rigour/names/prefix.py
def remove_org_prefixes(name: str) -> str:
    """Strip article-like prefixes from the head of an organisation name.

    Drops `"The"`, etc. so `"The Charitable Trust"` →
    `"Charitable Trust"` doesn't penalise the shorter variant
    when matching. Driven by
    `resources/names/stopwords.yml::ORG_NAME_PREFIXES`.

    Args:
        name: An organisation name string.

    Returns:
        The name with any leading article-prefix(es) removed.
    """
    return _org_prefix_regex().sub("", name)

remove_org_types(name, replacement='', normalize_flags=Normalize.CASEFOLD, cleanup=Cleanup.Noop)

Remove organisation-type designations from a name.

Every recognised alias (LLC, Inc, GmbH, ...) in name is replaced with replacement. Useful as a preprocessing step when you want the entity name stripped of legal-form noise.

Parameters:

Name Type Description Default
name str

The text to be processed. Assumed to already be normalised with the same normalize_flags + cleanup the alias table was built from.

required
replacement str

The string to replace each matched alias with. Default "" (strip).

''
normalize_flags Normalize

Normalize flag set applied to the alias list at build time. Default Normalize.CASEFOLD.

CASEFOLD
cleanup Cleanup

Cleanup variant applied during alias normalisation. Default Cleanup.Noop.

Noop

Returns:

Type Description
str

The text with recognised organisation types replaced. May be

str

empty if the input consisted entirely of matched aliases and

str

replacement is the empty string.

Source code in rigour/names/org_types.py
def remove_org_types(
    name: str,
    replacement: str = "",
    normalize_flags: Normalize = Normalize.CASEFOLD,
    cleanup: Cleanup = Cleanup.Noop,
) -> str:
    """Remove organisation-type designations from a name.

    Every recognised alias (LLC, Inc, GmbH, ...) in `name` is replaced
    with `replacement`. Useful as a preprocessing step when you want
    the entity name stripped of legal-form noise.

    Args:
        name: The text to be processed. Assumed to already be normalised
            with the same `normalize_flags` + `cleanup` the alias table
            was built from.
        replacement: The string to replace each matched alias with.
            Default ``""`` (strip).
        normalize_flags: `Normalize` flag
            set applied to the alias list at build time. Default
            `Normalize.CASEFOLD`.
        cleanup: `Cleanup` variant applied
            during alias normalisation. Default `Cleanup.Noop`.

    Returns:
        The text with recognised organisation types replaced. May be
        empty if the input consisted entirely of matched aliases and
        `replacement` is the empty string.
    """
    return _remove_org_types(name, int(normalize_flags), int(cleanup), replacement)

remove_person_prefixes(name)

Strip honorific prefixes from the head of a person name.

Drops "Mr.", "Mrs.", "Dr.", "Lady", etc. so honorifics don't contaminate part alignment in matching or token-bag comparison. The list is data-driven from resources/names/stopwords.yml::PERSON_NAME_PREFIXES, surfaced via rigour._core.person_name_prefixes_list.

Parameters:

Name Type Description Default
name str

A person name string.

required

Returns:

Type Description
str

The name with any leading honorific(s) removed. Idempotent

str

for inputs that don't start with a known prefix.

Source code in rigour/names/prefix.py
def remove_person_prefixes(name: str) -> str:
    """Strip honorific prefixes from the head of a person name.

    Drops `"Mr."`, `"Mrs."`, `"Dr."`, `"Lady"`, etc. so honorifics
    don't contaminate part alignment in matching or token-bag
    comparison. The list is data-driven from
    `resources/names/stopwords.yml::PERSON_NAME_PREFIXES`,
    surfaced via `rigour._core.person_name_prefixes_list`.

    Args:
        name: A person name string.

    Returns:
        The name with any leading honorific(s) removed. Idempotent
        for inputs that don't start with a known prefix.
    """
    return _person_prefix_regex().sub("", name)

replace_org_types_compare(name, normalize_flags=Normalize.CASEFOLD, cleanup=Cleanup.Noop, generic=False)

Replace organisation types in a name with a heavily normalised form.

Country-specific entity types (e.g. GmbH, Aktiengesellschaft, ООО) are rewritten into a simplified comparison form (e.g. gmbh, ag, ooo) suitable for string-distance matching. The result is meant for comparison pipelines, not for presentation.

Parameters:

Name Type Description Default
name str

The text to be processed. Assumed to already be normalised with the same normalize_flags + cleanup the alias table was built from.

required
normalize_flags Normalize

Normalize flag set applied to the alias list at build time. Default Normalize.CASEFOLD matches production callers (nomenklatura/yente/FTM via prenormalize_name).

CASEFOLD
cleanup Cleanup

Cleanup variant applied during alias normalisation. Default Cleanup.Noop.

Noop
generic bool

If True, substitute the generic form of the organisation type (e.g. llc, jsc) instead of the type-specific compare form. Specs without a generic field are left unchanged in generic mode.

False

Returns:

Type Description
str

The text with recognised organisation types substituted. If every

str

match would reduce the text to an empty string, the original

str

text is returned unchanged.

Source code in rigour/names/org_types.py
def replace_org_types_compare(
    name: str,
    normalize_flags: Normalize = Normalize.CASEFOLD,
    cleanup: Cleanup = Cleanup.Noop,
    generic: bool = False,
) -> str:
    """Replace organisation types in a name with a heavily normalised form.

    Country-specific entity types (e.g. GmbH, Aktiengesellschaft, ООО) are
    rewritten into a simplified comparison form (e.g. ``gmbh``, ``ag``,
    ``ooo``) suitable for string-distance matching. The result is meant
    for comparison pipelines, not for presentation.

    Args:
        name: The text to be processed. Assumed to already be normalised
            with the same `normalize_flags` + `cleanup` the alias table
            was built from.
        normalize_flags: `Normalize` flag
            set applied to the alias list at build time. Default
            `Normalize.CASEFOLD` matches production callers
            (nomenklatura/yente/FTM via `prenormalize_name`).
        cleanup: `Cleanup` variant applied
            during alias normalisation. Default `Cleanup.Noop`.
        generic: If True, substitute the generic form of the organisation
            type (e.g. ``llc``, ``jsc``) instead of the type-specific
            compare form. Specs without a `generic` field are left
            unchanged in generic mode.

    Returns:
        The text with recognised organisation types substituted. If every
        match would reduce the text to an empty string, the original
        text is returned unchanged.
    """
    return _replace_org_types_compare(name, int(normalize_flags), int(cleanup), generic)

replace_org_types_display(name, normalize_flags=Normalize.CASEFOLD | Normalize.SQUASH_SPACES, cleanup=Cleanup.Noop)

Replace organisation types in a name with their short display form.

Spelt-out legal forms are shortened into common abbreviations (e.g. "Siemens Aktiengesellschaft""Siemens AG"), preserving the case of non-matched portions. If the whole input is uppercase (str.isupper()), the whole output is re-uppercased.

Matches case-insensitively across Unicode by casefolding a copy of the input internally for the match step — normalize_flags must therefore include Normalize.CASEFOLD so the alias table is casefolded too. The default does this.

Parameters:

Name Type Description Default
name str

The text to be processed.

required
normalize_flags Normalize

Normalize flag set applied to the alias list at build time. Must include Normalize.CASEFOLD for Unicode-case-insensitive matching. Default CASEFOLD | SQUASH_SPACES.

CASEFOLD | SQUASH_SPACES
cleanup Cleanup

Cleanup variant applied during alias normalisation. Default Cleanup.Noop.

Noop

Returns:

Type Description
str

The text with recognised organisation types substituted for

str

their display form. Non-matched regions keep their original case.

Source code in rigour/names/org_types.py
def replace_org_types_display(
    name: str,
    normalize_flags: Normalize = Normalize.CASEFOLD | Normalize.SQUASH_SPACES,
    cleanup: Cleanup = Cleanup.Noop,
) -> str:
    """Replace organisation types in a name with their short display form.

    Spelt-out legal forms are shortened into common abbreviations
    (e.g. ``"Siemens Aktiengesellschaft"`` → ``"Siemens AG"``), preserving
    the case of non-matched portions. If the whole input is uppercase
    (`str.isupper()`), the whole output is re-uppercased.

    Matches case-insensitively across Unicode by casefolding a copy of
    the input internally for the match step — `normalize_flags` must
    therefore include `Normalize.CASEFOLD` so the alias table is
    casefolded too. The default does this.

    Args:
        name: The text to be processed.
        normalize_flags: `Normalize` flag
            set applied to the alias list at build time. Must include
            `Normalize.CASEFOLD` for Unicode-case-insensitive matching.
            Default `CASEFOLD | SQUASH_SPACES`.
        cleanup: `Cleanup` variant applied
            during alias normalisation. Default `Cleanup.Noop`.

    Returns:
        The text with recognised organisation types substituted for
        their display form. Non-matched regions keep their original case.
    """
    return _replace_org_types_display(name, int(normalize_flags), int(cleanup))

representative_names(names, limit, cluster_threshold=0.3)

Reduce a bag of aliases to at most limit representatives without extreme information loss.

Useful when a downstream process (e.g. building a search-index query) wants to probe the alias space broadly under a budget cap. For a person with 20 transliterations of one name and limit=5, this returns ~1-5 centroid-selected representatives rather than all 20 near-identical forms. For a person with two genuinely distinct names (Nelson Mandela / Rolihlahla Mandela), both survive — N transliterations of one name don't add recall, but a second name does.

Fast path: if the input already collapses to <= limit distinct names (after casefold-dedup via :func:reduce_names), those names are returned as-is without clustering. Compression only runs when the input actually needs to be compressed. This means cluster_threshold has no effect when the fast path fires.

Ordering of the returned list is not guaranteed. Returned strings are originals from the input — :func:pick_name per cluster selects the best-case representative when clustering runs.

Parameters:

Name Type Description Default
names List[str]

input aliases, typically all belonging to one entity.

required
limit int

upper bound on output size.

required
cluster_threshold float

normalized Levenshtein distance (0..1) above which two names are considered distinct names rather than variants of one. Default 0.3 keeps transliterations together while separating genuinely different names. Ignored when the fast path fires.

0.3
Source code in rigour/names/pick.py
def representative_names(
    names: List[str],
    limit: int,
    cluster_threshold: float = 0.3,
) -> List[str]:
    """Reduce a bag of aliases to at most `limit` representatives
    without extreme information loss.

    Useful when a downstream process (e.g. building a search-index
    query) wants to probe the alias space broadly under a budget
    cap. For a person with 20 transliterations of one name and
    `limit=5`, this returns ~1-5 centroid-selected representatives
    rather than all 20 near-identical forms. For a person with two
    genuinely distinct names (Nelson Mandela / Rolihlahla Mandela),
    both survive — N transliterations of one name don't add recall,
    but a second *name* does.

    **Fast path**: if the input already collapses to `<= limit`
    distinct names (after casefold-dedup via :func:`reduce_names`),
    those names are returned as-is without clustering. Compression
    only runs when the input actually needs to be compressed. This
    means `cluster_threshold` has no effect when the fast path
    fires.

    Ordering of the returned list is not guaranteed. Returned
    strings are originals from the input — :func:`pick_name` per
    cluster selects the best-case representative when clustering
    runs.

    Args:
        names: input aliases, typically all belonging to one entity.
        limit: upper bound on output size.
        cluster_threshold: normalized Levenshtein distance (0..1) above
            which two names are considered distinct *names* rather than
            variants of one. Default 0.3 keeps transliterations together
            while separating genuinely different names. Ignored when
            the fast path fires.
    """
    if limit <= 0 or not names:
        return []
    reduced = reduce_names(names)
    if len(reduced) <= limit:
        return list(reduced)

    # Casefolded/whitespace-normalised form of each reduced name, for
    # distance measurement. The originals are what we return.
    normed: Dict[str, str] = {}
    for n in reduced:
        nn = " ".join(n.casefold().split())
        if nn:
            normed[n] = nn

    centroid = pick_name(reduced)
    if centroid is None or centroid not in normed:
        return []

    def _dist(a: str, b: str) -> float:
        return levenshtein(a, b) / max(len(a), len(b), 1)

    # Farthest-point-first seed selection with threshold stopping: each
    # new seed must be more than `cluster_threshold` away from every
    # already-picked seed, else we've run out of distinct clusters.
    seeds: List[str] = [centroid]
    while len(seeds) < limit:
        outlier: Optional[str] = None
        outlier_d = 0.0
        for n in reduced:
            if n in seeds or n not in normed:
                continue
            nn = normed[n]
            min_d = min(_dist(nn, normed[s]) for s in seeds)
            if min_d > outlier_d:
                outlier_d = min_d
                outlier = n
        if outlier is None or outlier_d <= cluster_threshold:
            break
        seeds.append(outlier)

    if len(seeds) == 1:
        return seeds

    # Assign each reduced name to its nearest seed, then pick_name per
    # cluster so the returned rep is the best display form of its group
    # rather than whichever outlier happened to be picked as the seed.
    clusters: List[List[str]] = [[s] for s in seeds]
    for n in reduced:
        if n in seeds or n not in normed:
            continue
        nn = normed[n]
        best_i = 0
        best_d = float("inf")
        for i, s in enumerate(seeds):
            d = _dist(nn, normed[s])
            if d < best_d:
                best_d = d
                best_i = i
        clusters[best_i].append(n)

    reps: List[str] = []
    for cluster in clusters:
        rep = pick_name(cluster)
        if rep is not None:
            reps.append(rep)
    return reps

tokenize_name(text, token_min_length=1)

Split a person or entity's name into name parts.

Unicode general-category-aware: separator categories (spaces, punctuation, math symbols) split tokens; delete categories (combining marks, modifier letters, format chars) drop; letters, numbers, and a small set of CJK modifier marks are kept.

Parameters:

Name Type Description Default
text str

The name to tokenize.

required
token_min_length int

Drop tokens shorter than this many codepoints. Defaults to 1 (drop only zero-length).

1

Returns:

Type Description
List[str]

Tokens in left-to-right order, with any deletion or

List[str]

whitespace-substitution applied. Order matches input.

Source code in rigour/names/tokenize.py
def tokenize_name(text: str, token_min_length: int = 1) -> List[str]:
    """Split a person or entity's name into name parts.

    Unicode general-category-aware: separator categories (spaces,
    punctuation, math symbols) split tokens; delete categories
    (combining marks, modifier letters, format chars) drop; letters,
    numbers, and a small set of CJK modifier marks are kept.

    Args:
        text: The name to tokenize.
        token_min_length: Drop tokens shorter than this many
            codepoints. Defaults to 1 (drop only zero-length).

    Returns:
        Tokens in left-to-right order, with any deletion or
        whitespace-substitution applied. Order matches input.
    """
    return _tokenize_name(text, token_min_length)

rigour.names.analyze

End-to-end name analysis: raw strings → tagged Name objects.

analyze_names is the unified entry point that downstream consumers (followthemoney's entity_names, in turn used by nomenklatura and yente) call once per entity to get matchable Name objects.

Rust-backed via rigour._core.analyze_names — one FFI crossing per call, regardless of how many names / part_tags the entity has. The single-call pipeline runs: prefix strip → prenormalize → org-type replacement (for ORG/ENT) → Name + NamePart construction → part tagging via Name.tag_text → tagger match-and-apply → NUMERIC / STOP / LEGAL inference → optional consolidate_names.

Part-tag value shape

part_tags values can be multi-token strings. A value like "Jean Claude" in part_tags[NamePartTag.GIVEN] for the name "Jean Claude Juncker" will tag both the "jean" and "claude" parts as GIVEN — the underlying Name.tag_text tokenises the value and walks the name parts looking for the token sequence. The tokens of the value don't need to be adjacent in the name, just present in order.

analyze_names(type_tag, names, part_tags=None, *, infer_initials=False, symbols=True, phonetics=True, numerics=True, consolidate=True, rewrite=True)

Build a set of tagged Name objects from raw strings.

Parameters:

Name Type Description Default
type_tag NameTypeTag

The NameTypeTag for every name in this batch. Drives which prefix/org-type/ tagger passes run: PER → person prefix strip + person tagger; ORG/ENT → org-type replacement + org prefix strip + org tagger; OBJ → object prefix strip ("M/V", "SS", …) but no tagger; UNK → no rewrites or tagging, just construction.

required
names Sequence[str]

Raw name strings as harvested from the source entity. Empty strings and inputs that normalise to empty are dropped. Duplicates (after prenormalisation) are de-duplicated.

required
part_tags Optional[Mapping[NamePartTag, Sequence[str]]]

Pre-classified part annotations, typically produced by an adapter that reads structured name-part properties off the source entity (e.g. firstName → GIVEN, lastName → FAMILY). Each value is applied to every constructed Name via Name.tag_text. Values can be multi-token strings — see the module docstring. Defaults to an empty mapping.

None
infer_initials bool

When True, every single-character latin name part is tagged with an INITIAL symbol — useful on a free-text query side where "J Smith" arrives without a label on "J". When False (default), only parts already tagged as GIVEN / MIDDLE pick up INITIAL symbols. Default False because initials are a query-side concept; the indexer and the candidate side of a matcher pass False, so the leaner default suits the common call. Ignored for non-person names. No-op when symbols=False.

False
symbols bool

Master switch for symbol emission. When True (default), the INITIAL preamble, the AC tagger's match-and-apply pass, and NUMERIC-symbol emission all run. When False, no symbols are attached to the returned names — name.symbols is empty and name.spans stays empty. NamePartTag labelling (including the NUM / STOP / LEGAL promotions in the inference pass) still fires, and part_tags values are still applied via Name.tag_text. Useful for callers that only need tokens + part tags and don't match on symbol overlap; skipping the AC tagger is the main performance saving.

True
phonetics bool

When True (default), each NamePart.metaphone is populated at construction; when False, the field stays None and the phonetics crate isn't called. Consumers that feed part.metaphone into downstream fields (e.g. yente's name_phonemes ES field) keep the default; callers that never read the property can save the per-part metaphone call.

True
numerics bool

When True (default), numeric-looking name parts that the AC tagger's ordinal list didn't cover get a Symbol(NUMERIC, int_value) applied. When False, parts still get NamePartTag.NUM (cheap structural info) but no NUMERIC symbol is emitted. Callers that don't use numeric-symbol overlap for scoring can save the symbol allocation.

True
consolidate bool

When True (default), the returned set has Name.consolidate_names applied — short names that are substrings of longer names in the same set are dropped. Indexers should pass consolidate=False to preserve partial-name recall (e.g. letting "John Smith" match "John K Smith" from the other side).

True
rewrite bool

When True (default), the pre-tagger canonicalisation stages run: honorific-prefix removal for PER names (Mr., Dr., Sir), and for ORG/ENT names both article-prefix removal (The) and org-type compare-form rewriting (Inc.LLC, GmbHJSC, …). Pass False to keep the literal input form — the tagger still fires on the raw tokens because its alias set covers both original and canonical forms. Useful for debugging the tagger in isolation and for callers that want to display or index a name without the canonical substitutions.

True

Returns:

Type Description
Set[Name]

A set of tagged Name objects, de-duplicated by normalised

Set[Name]

form. Empty if every input normalised to an empty string.

Source code in rigour/names/analyze.py
def analyze_names(
    type_tag: NameTypeTag,
    names: Sequence[str],
    part_tags: Optional[Mapping[NamePartTag, Sequence[str]]] = None,
    *,
    infer_initials: bool = False,
    symbols: bool = True,
    phonetics: bool = True,
    numerics: bool = True,
    consolidate: bool = True,
    rewrite: bool = True,
) -> Set[Name]:
    """Build a set of tagged [Name][rigour.names.Name] objects from raw strings.

    Args:
        type_tag: The [NameTypeTag][rigour.names.NameTypeTag] for
            every name in this batch. Drives which prefix/org-type/
            tagger passes run: `PER` → person prefix strip + person
            tagger; `ORG`/`ENT` → org-type replacement + org prefix
            strip + org tagger; `OBJ` → object prefix strip ("M/V",
            "SS", …) but no tagger; `UNK` → no rewrites or tagging,
            just construction.
        names: Raw name strings as harvested from the source entity.
            Empty strings and inputs that normalise to empty are
            dropped. Duplicates (after prenormalisation) are de-duplicated.
        part_tags: Pre-classified part annotations, typically produced
            by an adapter that reads structured name-part properties
            off the source entity (e.g. firstName → `GIVEN`,
            lastName → `FAMILY`). Each value is applied to every
            constructed `Name` via `Name.tag_text`. Values can be
            multi-token strings — see the module docstring. Defaults
            to an empty mapping.
        infer_initials: When `True`, every single-character latin name
            part is tagged with an `INITIAL` symbol — useful on a
            free-text query side where `"J Smith"` arrives without
            a label on `"J"`. When `False` (default), only parts
            already tagged as `GIVEN` / `MIDDLE` pick up `INITIAL`
            symbols. Default `False` because initials are a
            query-side concept; the indexer and the candidate side
            of a matcher pass `False`, so the leaner default suits
            the common call. Ignored for non-person names. No-op
            when `symbols=False`.
        symbols: Master switch for symbol emission. When `True`
            (default), the INITIAL preamble, the AC tagger's
            match-and-apply pass, and NUMERIC-symbol emission all
            run. When `False`, no symbols are attached to the
            returned names — `name.symbols` is empty and
            `name.spans` stays empty. NamePartTag labelling
            (including the `NUM` / `STOP` / `LEGAL` promotions in
            the inference pass) still fires, and `part_tags` values
            are still applied via `Name.tag_text`. Useful for
            callers that only need tokens + part tags and don't
            match on symbol overlap; skipping the AC tagger is the
            main performance saving.
        phonetics: When `True` (default), each `NamePart.metaphone`
            is populated at construction; when `False`, the field
            stays `None` and the phonetics crate isn't called.
            Consumers that feed `part.metaphone` into downstream
            fields (e.g. yente's `name_phonemes` ES field) keep the
            default; callers that never read the property can save
            the per-part metaphone call.
        numerics: When `True` (default), numeric-looking name parts
            that the AC tagger's ordinal list didn't cover get a
            `Symbol(NUMERIC, int_value)` applied. When `False`, parts
            still get `NamePartTag.NUM` (cheap structural info) but
            no NUMERIC symbol is emitted. Callers that don't use
            numeric-symbol overlap for scoring can save the symbol
            allocation.
        consolidate: When `True` (default), the returned set has
            [Name.consolidate_names][rigour.names.Name.consolidate_names]
            applied — short names that are substrings of longer names
            in the same set are dropped. **Indexers should pass
            `consolidate=False`** to preserve partial-name recall
            (e.g. letting `"John Smith"` match `"John K Smith"` from
            the other side).
        rewrite: When `True` (default), the pre-tagger canonicalisation
            stages run: honorific-prefix removal for PER names
            (`Mr.`, `Dr.`, `Sir`), and for ORG/ENT names both
            article-prefix removal (`The`) and org-type compare-form
            rewriting (`Inc.`→`LLC`, `GmbH`→`JSC`, …). Pass `False`
            to keep the literal input form — the tagger still fires
            on the raw tokens because its alias set covers both
            original and canonical forms. Useful for debugging the
            tagger in isolation and for callers that want to display
            or index a name without the canonical substitutions.

    Returns:
        A set of tagged `Name` objects, de-duplicated by normalised
        form. Empty if every input normalised to an empty string.
    """
    tag_dict: dict[NamePartTag, list[str]] | None
    if part_tags is None:
        tag_dict = None
    else:
        tag_dict = {tag: list(values) for tag, values in part_tags.items()}
    return _analyze_names(
        type_tag,
        list(names),
        tag_dict,
        infer_initials=infer_initials,
        symbols=symbols,
        phonetics=phonetics,
        numerics=numerics,
        consolidate=consolidate,
        rewrite=rewrite,
    )

rigour.names.compare

Residue-distance scoring for two NamePart lists.

Reach for compare_parts when a name matcher has already peeled off the parts it can explain by other means — symbol pairing, alias tagging, identifier hits — and is left with a residue that needs a fuzzy-match verdict (typo, transliteration drift, surface-form variants of the same token).

The function returns one Alignment per cluster of aligned parts (paired or solo). Every input part appears in exactly one alignment, so a caller can sum / weight / threshold the result without losing track of which inputs got accounted for. Returned alignments carry symbol = None (residue distance is non-symbolic by definition).

The cost model penalises digit mismatches more than letter mismatches, treats visually / phonetically confusable pairs (0/o, 1/l, c/k, …) as cheap edits, and charges almost nothing for token merge / split. A length-dependent budget caps the per-side similarity at zero once the total cost exceeds what's plausible for typo noise — the matcher refuses to fuzzy-match when the edit-density crosses into distinct-entity territory.

Pass a CompareConfig to override the cost / budget / clustering scalars — e.g. budget_tolerance to shift between strict (payment-screening) and permissive (KYC- onboarding) profiles, or cost_* for sweep-based calibration. The default is recall-protective and matches industry-typical tuning.

Alignment

One unit of name-comparison evidence.

Three modes:

  • Symbol-paired edgesymbol is Some and both sides carry the same Symbol. Returned by pair_symbols. Default score is 1.0; consumers may override with a category default (e.g. SYM_SCORES[NAME] = 0.9).
  • Residue clustersymbol is None, both sides non-empty. Returned by compare_parts for parts that aligned by edit distance.
  • Extrasymbol is None, exactly one side is empty. Represents a part that found no counterpart on the other side; the matcher applies a side-specific weight.

qps / rps / symbol / qstr / rstr are immutable post-construction. score and weight are mutable to support the matcher's policy passes (literal-equality rescue, extras-weight override, family-name boost). Both stored as Py<PyFloat> so Python-side reads are an INCREF rather than a fresh allocation per access.

__hash__ and __eq__ key on (symbol, qps, rps)NamePart already hashes by (index, form) so position is preserved. score and weight are not part of identity.

__doc__ = "One unit of name-comparison evidence.\n\nThree modes:\n\n- **Symbol-paired edge** — `symbol` is `Some` and both sides\n carry the same `Symbol`. Returned by `pair_symbols`. Default\n `score` is `1.0`; consumers may override with a category\n default (e.g. `SYM_SCORES[NAME] = 0.9`).\n- **Residue cluster** — `symbol` is `None`, both sides\n non-empty. Returned by `compare_parts` for parts that\n aligned by edit distance.\n- **Extra** — `symbol` is `None`, exactly one side is empty.\n Represents a part that found no counterpart on the other\n side; the matcher applies a side-specific weight.\n\n`qps` / `rps` / `symbol` / `qstr` / `rstr` are immutable\npost-construction. `score` and `weight` are mutable to support\nthe matcher's policy passes (literal-equality rescue,\nextras-weight override, family-name boost). Both stored as\n`Py<PyFloat>` so Python-side reads are an INCREF rather than a\nfresh allocation per access.\n\n`__hash__` and `__eq__` key on `(symbol, qps, rps)` —\n`NamePart` already hashes by `(index, form)` so position is\npreserved. `score` and `weight` are not part of identity." class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

qps property

Query-side parts covered by this alignment.

qstr property

" ".join(p.comparable for p in qps), cached.

rps property

Result-side parts covered by this alignment.

rstr property

" ".join(p.comparable for p in rps), cached.

score property

Similarity in [0, 1]. For symbol-paired edges, defaults to 1.0; consumers override with a category default. For residue clusters, the per-cluster product. For extras, 0.0.

symbol property

Shared Symbol for symbol-paired edges; None for residue clusters and extras.

weight property

Aggregation weight in the matcher's weighted average. Defaults to 1.0; consumers override per category (SYM_WEIGHTS), for extras (nm_extra_*_name), for family-name boost (nm_family_name_weight), and for stopword down-weight.

__eq__(value) method descriptor

Return self==value.

__ge__(value) method descriptor

Return self>=value.

__gt__(value) method descriptor

Return self>value.

__hash__() method descriptor

Return hash(self).

__le__(value) method descriptor

Return self<=value.

__lt__(value) method descriptor

Return self<value.

__ne__(value) method descriptor

Return self!=value.

__new__(*args, **kwargs) builtin

Create and return a new object. See help(type) for accurate signature.

__repr__() method descriptor

Return repr(self).

CompareConfig

Tunable cost / budget / clustering scalars for [py_compare_parts].

Frozen by design: a sweep iteration constructs a fresh CompareConfig, the matcher caches one per request. Mutability would buy nothing (the values are read once per name pair) and would cost a runtime borrow check on each Rust-side access.

The default values reproduce the constants this struct replaced; compare_parts(qry, res) with no config argument is exactly equivalent to the pre-CompareConfig call.

__doc__ = 'Tunable cost / budget / clustering scalars for [`py_compare_parts`].\n\nFrozen by design: a sweep iteration constructs a fresh\n`CompareConfig`, the matcher caches one per request. Mutability\nwould buy nothing (the values are read once per name pair) and\nwould cost a runtime borrow check on each Rust-side access.\n\nThe default values reproduce the constants this struct replaced;\n`compare_parts(qry, res)` with no `config` argument is exactly\nequivalent to the pre-`CompareConfig` call.' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

__module__ = 'rigour._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.

budget_log_base property

Logarithm base in the per-side cost-budget formula log_budget_log_base(max(len - budget_short_floor, 1)) * budget_tolerance. The base controls how aggressively the budget grows with token length — smaller base = faster growth = more permissive on long names.

budget_short_floor property

Short-token floor: tokens shorter than this contribute zero to the budget, so any non-zero edit fails the cap. This is the fail-closed property — the matcher refuses to fuzzy- match on 1-2 character tokens (vessel hull suffixes, isolated initials, 2-char Chinese given names) where typo / distinct-entity signal is too weak.

budget_tolerance property

Multiplier on the per-side cost budget. Lower is stricter (less edit tolerated before a cluster scores zero); higher is more permissive. Callers tune this per scenario — KYC at onboarding runs more permissive than payment screening.

cluster_overlap_min property

Overlap fraction (matched chars / shorter-side length) above which two parts pair into a cluster. A pair below this threshold surfaces as solo records — the matched-character evidence is too thin to claim the parts are talking about the same token. The 0.51 default (i.e. "more than half") is the lowest value where majority of the shorter token agrees.

cost_confusable property

Substitute between a confusable pair from resources/names/compare.yml (0/o, 1/l, …). OCR / transliteration / homoglyph noise — the writer was probably aiming at the same character.

cost_digit property

Edit involving a digit on either side. Digits identify specific things — vintage years, vessel hull numbers, fund vintages — so a digit mismatch is evidence of a different entity, not a typo.

cost_sep_drop property

Token boundary lost or gained on one side. Token merge/split (vanderbiltvan der bilt) is a common surface-form variant of the same name; charging it almost nothing keeps the alignment from refusing to bridge whitespace artifacts.

__new__(*args, **kwargs) builtin

Create and return a new object. See help(type) for accurate signature.

__repr__() method descriptor

Return repr(self).

compare_parts(qry, res, config=None) builtin

Score the alignment of two NamePart lists.

Callers should hand over the residue — parts that earlier stages (symbol pairing, alias tagging, identifier matching) couldn't explain by themselves — already canonicalised into positional order (tag_sort for ORG/ENT, align_person_name_order for PER). The function returns one [Alignment] per cluster, paired or solo; every input part appears exactly once across the output. Returned alignments carry symbol = None (residue distance is non-symbolic by definition).

config overrides the cost / budget / clustering scalars. Pass None (the default) to use the process-wide defaults — those match industry-typical recall-protective tuning. Sweep scripts build a fresh [CompareConfig] per iteration; matchers cache one per request.