Names
rigour.names
Name handling utilities for person and organisation names. This module contains a large (and growing) set of tools for handling names. In general, there are three types of names: people, organizations, and objects. Different normalization may be required for each of these types, including prefix removal for person names (e.g. "Mr." or "Ms.") and type normalization for organization names (e.g. "Incorporated" -> "Inc" or "Limited" -> "Ltd").
The Name class is meant to provide a structure for a name, including its original form, normalized form,
metadata on the type of thing described by the name, and the language of the name. The NamePart class
is used to represent individual parts of a name, such as the first name, middle name, and last name.
Alignment
One unit of name-comparison evidence.
Three modes:
- Symbol-paired edge —
symbolisSomeand both sides carry the sameSymbol. Returned bypair_symbols. Defaultscoreis1.0; consumers may override with a category default (e.g.SYM_SCORES[NAME] = 0.9). - Residue cluster —
symbolisNone, both sides non-empty. Returned bycompare_partsfor parts that aligned by edit distance. - Extra —
symbolisNone, exactly one side is empty. Represents a part that found no counterpart on the other side; the matcher applies a side-specific weight.
qps / rps / symbol / qstr / rstr are immutable
post-construction. score and weight are mutable to support
the matcher's policy passes (literal-equality rescue,
extras-weight override, family-name boost). Both stored as
Py<PyFloat> so Python-side reads are an INCREF rather than a
fresh allocation per access.
__hash__ and __eq__ key on (symbol, qps, rps) —
NamePart already hashes by (index, form) so position is
preserved. score and weight are not part of identity.
__doc__ = "One unit of name-comparison evidence.\n\nThree modes:\n\n- **Symbol-paired edge** — `symbol` is `Some` and both sides\n carry the same `Symbol`. Returned by `pair_symbols`. Default\n `score` is `1.0`; consumers may override with a category\n default (e.g. `SYM_SCORES[NAME] = 0.9`).\n- **Residue cluster** — `symbol` is `None`, both sides\n non-empty. Returned by `compare_parts` for parts that\n aligned by edit distance.\n- **Extra** — `symbol` is `None`, exactly one side is empty.\n Represents a part that found no counterpart on the other\n side; the matcher applies a side-specific weight.\n\n`qps` / `rps` / `symbol` / `qstr` / `rstr` are immutable\npost-construction. `score` and `weight` are mutable to support\nthe matcher's policy passes (literal-equality rescue,\nextras-weight override, family-name boost). Both stored as\n`Py<PyFloat>` so Python-side reads are an INCREF rather than a\nfresh allocation per access.\n\n`__hash__` and `__eq__` key on `(symbol, qps, rps)` —\n`NamePart` already hashes by `(index, form)` so position is\npreserved. `score` and `weight` are not part of identity."
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
qps
property
Query-side parts covered by this alignment.
qstr
property
" ".join(p.comparable for p in qps), cached.
rps
property
Result-side parts covered by this alignment.
rstr
property
" ".join(p.comparable for p in rps), cached.
score
property
Similarity in [0, 1]. For symbol-paired edges, defaults
to 1.0; consumers override with a category default. For
residue clusters, the per-cluster product. For extras,
0.0.
symbol
property
Shared Symbol for symbol-paired edges; None for
residue clusters and extras.
weight
property
Aggregation weight in the matcher's weighted average.
Defaults to 1.0; consumers override per category
(SYM_WEIGHTS), for extras (nm_extra_*_name), for
family-name boost (nm_family_name_weight), and for
stopword down-weight.
__eq__(value)
method descriptor
Return self==value.
__ge__(value)
method descriptor
Return self>=value.
__gt__(value)
method descriptor
Return self>value.
__hash__()
method descriptor
Return hash(self).
__le__(value)
method descriptor
Return self<=value.
__lt__(value)
method descriptor
Return self<value.
__ne__(value)
method descriptor
Return self!=value.
__new__(*args, **kwargs)
builtin
Create and return a new object. See help(type) for accurate signature.
__repr__()
method descriptor
Return repr(self).
CompareConfig
Tunable cost / budget / clustering scalars for [py_compare_parts].
Frozen by design: a sweep iteration constructs a fresh
CompareConfig, the matcher caches one per request. Mutability
would buy nothing (the values are read once per name pair) and
would cost a runtime borrow check on each Rust-side access.
The default values reproduce the constants this struct replaced;
compare_parts(qry, res) with no config argument is exactly
equivalent to the pre-CompareConfig call.
__doc__ = 'Tunable cost / budget / clustering scalars for [`py_compare_parts`].\n\nFrozen by design: a sweep iteration constructs a fresh\n`CompareConfig`, the matcher caches one per request. Mutability\nwould buy nothing (the values are read once per name pair) and\nwould cost a runtime borrow check on each Rust-side access.\n\nThe default values reproduce the constants this struct replaced;\n`compare_parts(qry, res)` with no `config` argument is exactly\nequivalent to the pre-`CompareConfig` call.'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
budget_log_base
property
Logarithm base in the per-side cost-budget formula
log_budget_log_base(max(len - budget_short_floor, 1)) *
budget_tolerance. The base controls how aggressively the
budget grows with token length — smaller base = faster
growth = more permissive on long names.
budget_short_floor
property
Short-token floor: tokens shorter than this contribute zero to the budget, so any non-zero edit fails the cap. This is the fail-closed property — the matcher refuses to fuzzy- match on 1-2 character tokens (vessel hull suffixes, isolated initials, 2-char Chinese given names) where typo / distinct-entity signal is too weak.
budget_tolerance
property
Multiplier on the per-side cost budget. Lower is stricter (less edit tolerated before a cluster scores zero); higher is more permissive. Callers tune this per scenario — KYC at onboarding runs more permissive than payment screening.
cluster_overlap_min
property
Overlap fraction (matched chars / shorter-side length) above which two parts pair into a cluster. A pair below this threshold surfaces as solo records — the matched-character evidence is too thin to claim the parts are talking about the same token. The 0.51 default (i.e. "more than half") is the lowest value where majority of the shorter token agrees.
cost_confusable
property
Substitute between a confusable pair from
resources/names/compare.yml (0/o, 1/l, …). OCR /
transliteration / homoglyph noise — the writer was probably
aiming at the same character.
cost_digit
property
Edit involving a digit on either side. Digits identify specific things — vintage years, vessel hull numbers, fund vintages — so a digit mismatch is evidence of a different entity, not a typo.
cost_sep_drop
property
Token boundary lost or gained on one side. Token merge/split
(vanderbilt ↔ van der bilt) is a common surface-form
variant of the same name; charging it almost nothing keeps
the alignment from refusing to bridge whitespace artifacts.
__new__(*args, **kwargs)
builtin
Create and return a new object. See help(type) for accurate signature.
__repr__()
method descriptor
Return repr(self).
Name
A personal, organisational, or object name.
Equality and hashing are over form. A Name's tag can change
and spans grows without affecting either.
__doc__ = "A personal, organisational, or object name.\n\nEquality and hashing are over `form`. A `Name`'s `tag` can change\nand `spans` grows without affecting either."
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
comparable
property
Space-joined part.comparable across the parts. Precomputed.
form
property
Normalised form. Defaults to casefold(original) if not
supplied at construction.
norm_form
property
Space-joined part.form across the parts. Precomputed.
original
property
Input string, verbatim.
parts
property
Tokens of form, one [NamePart] per token. Exposed as a
tuple so it's hashable — downstream code keys on
(span.parts, span.symbol.category) etc.
spans
property
Tagger output — grows over the name's lifetime via
apply_phrase / apply_part.
symbols
property
Aggregate view of every symbol the tagger has attached to this name. Useful when you want the symbol set regardless of which parts carry them (e.g. indexing the name's semantic annotations into a flat field).
tag
property
What kind of thing the name describes. Mutable —
infer_part_tags may upgrade ENT → ORG after tagging.
__eq__(value)
method descriptor
Return self==value.
__ge__(value)
method descriptor
Return self>=value.
__gt__(value)
method descriptor
Return self>value.
__hash__()
method descriptor
Return hash(self).
__le__(value)
method descriptor
Return self<=value.
__lt__(value)
method descriptor
Return self<value.
__ne__(value)
method descriptor
Return self!=value.
__new__(*args, **kwargs)
builtin
Create and return a new object. See help(type) for accurate signature.
__repr__()
method descriptor
Return repr(self).
__str__()
method descriptor
Return str(self).
apply_part(part, symbol)
method descriptor
Record that a single [NamePart] carries symbol.
The single-part variant of [Name::apply_phrase]. Used for
symbols that inherently apply to one token: INITIAL on a
single-character latin part, NUMERIC inferred from a part
like "123456789" that the ordinal tagger didn't cover.
apply_phrase(phrase, symbol)
method descriptor
Record that phrase in this name carries symbol.
The tagger's output path: when the AC automaton reports a
recognised phrase (e.g. "limited liability company" →
ORG_CLASS:LLP), the match is attached as a [Span] so
downstream matching and inference can see which tokens the
symbol covers. Every non-overlapping occurrence of phrase
in the name gets its own Span.
Idempotent on (phrase, symbol): if any existing Span on
this name already carries the same symbol over the same
joined-form sequence, the call is a no-op. This keeps the
invariant "no duplicate (phrase, symbol) Spans" intact even
when more than one tagger fires on the same Name (e.g. the
org and person taggers both running on an ENT input).
consolidate_names(names)
builtin
Drop short names that are contained in longer names.
Useful when building a matcher to prevent a scenario where a short version of a name ("John Smith") is matched against a query "John K Smith" — where the longer candidate version would have correctly disqualified the match ("John K Smith" != "John R Smith"). Keeping only the longer form forces the matcher to reckon with the full evidence.
Containment uses [Name::contains]; see there for the
PER-aware subset rule. Accepts any Python iterable of Name;
returns a new set.
contains(other)
method descriptor
True iff this name structurally contains other.
Used by matcher pipelines to detect when one name's evidence
is a subset of another's — e.g. "John Smith" is contained in
"John K Smith", and the longer form supersedes the shorter
when consolidating candidate names before scoring
(see [Name::consolidate_names]). Also backs middle-initial
matching: "John Smith" contains "J. Smith" when the J
carries an INITIAL symbol that self also has.
Rule: for PER names, every part of other must have a
(not-necessarily-adjacent) comparable-equal counterpart in
self. For non-PER names, or when the PER rule doesn't find
a full subset, falls back to substring containment of
norm_form. Returns False when self.tag == UNK or when
the two names are equal.
tag_text(text, tag, max_matches=1)
method descriptor
Tag the parts that spell out text with the given tag.
Used when external metadata tells the caller the structural
role of a subset of the name's tokens. For example, an FTM
firstName property of "Jean Claude" on a name "Jean Claude
Juncker" marks both the jean and claude parts as
GIVEN; a lastName of "Juncker" then marks the remaining
part as FAMILY.
Walks self.parts looking for a contiguous
(adjacency-insensitive) match of the tokenised text. On a
hit, each matched part's tag is set to tag; parts that
already carry a tag that conflicts under
[NamePartTag::can_match] demote to AMBIGUOUS instead.
Stops after max_matches successful matches.
NamePart
A single tagged component of a [crate::names::name::Name].
Equality and hashing are over (index, form) — the immutable
identity of the part. tag can be re-written after construction
without invalidating either.
__doc__ = 'A single tagged component of a [`crate::names::name::Name`].\n\nEquality and hashing are over `(index, form)` — the immutable\nidentity of the part. `tag` can be re-written after construction\nwithout invalidating either.'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
ascii
property
ASCII-ified form of form for admitted-script parts;
None when the part is outside the admitted scripts or
reduces to empty after stripping non-alphanumerics.
comparable
property
Best-effort matchable form: integer string for numerics,
form for non-latinize parts, ascii otherwise.
form
property
Token text, as tokenised from the parent name's form.
index
property
Position of this part within the parent name's parts list.
integer
property
Parsed integer value for numeric parts, or None when the
part isn't numeric or doesn't fit an i64.
latinize
property
True if form is in an admitted-script set (Latin,
Cyrillic, Greek, Armenian, Georgian, Hangul) and thus can
be meaningfully ASCII-ified.
metaphone
property
Metaphone phonetic key, or None when phonetics were
disabled at construction or the part doesn't qualify
(non-latinize, numeric, or shorter than three characters).
numeric
property
True if form is entirely numeric characters.
tag
property
Structural role of this part. Set by the tagging pipeline;
UNSET at construction.
__eq__(value)
method descriptor
Return self==value.
__ge__(value)
method descriptor
Return self>=value.
__gt__(value)
method descriptor
Return self>value.
__hash__()
method descriptor
Return hash(self).
__le__(value)
method descriptor
Return self<=value.
__len__()
method descriptor
Return len(self).
__lt__(value)
method descriptor
Return self<value.
__ne__(value)
method descriptor
Return self!=value.
__new__(*args, **kwargs)
builtin
Create and return a new object. See help(type) for accurate signature.
__repr__()
method descriptor
Return repr(self).
tag_sort(parts)
builtin
Sort name parts into canonical display order.
Used when rendering a name back out for humans: honorifics
come first, then given names, middle, family, suffixes,
legal forms, and stopwords — independent of the input word
order. A tokeniser might hand the parts over as "Guttenberg
zu Karl-Theodor" (order from the source data); tag_sort
restores "Karl-Theodor zu Guttenberg" shape once the parts
have been tagged. Sort is stable across parts with the same
tag; see [crate::names::tag::NAME_TAGS_ORDER] for the full
ordering.
NamePartTag
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
AMBIGUOUS = NamePartTag.AMBIGUOUS
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
FAMILY = NamePartTag.FAMILY
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
GIVEN = NamePartTag.GIVEN
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
HONORIFIC = NamePartTag.HONORIFIC
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
LEGAL = NamePartTag.LEGAL
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
MATRONYMIC = NamePartTag.MATRONYMIC
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
MIDDLE = NamePartTag.MIDDLE
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
NICK = NamePartTag.NICK
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
NUM = NamePartTag.NUM
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
PATRONYMIC = NamePartTag.PATRONYMIC
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
STOP = NamePartTag.STOP
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
SUFFIX = NamePartTag.SUFFIX
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
TITLE = NamePartTag.TITLE
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
TRIBAL = NamePartTag.TRIBAL
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
UNSET = NamePartTag.UNSET
class-attribute
The structural role of a part within a name. A newly-constructed
[crate::names::part::NamePart] starts as UNSET; the tagging
pipeline promotes it based on external hints (firstName,
lastName, …) or pattern matches (numeric, stopword, legal form).
__doc__ = 'The structural role of a part within a name. A newly-constructed\n[`crate::names::part::NamePart`] starts as `UNSET`; the tagging\npipeline promotes it based on external hints (firstName,\nlastName, …) or pattern matches (numeric, stopword, legal form).'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
value
property
__eq__(value)
method descriptor
Return self==value.
__ge__(value)
method descriptor
Return self>=value.
__gt__(value)
method descriptor
Return self>value.
__hash__()
method descriptor
Return hash(self).
__int__()
method descriptor
int(self)
__le__(value)
method descriptor
Return self<=value.
__lt__(value)
method descriptor
Return self<value.
__ne__(value)
method descriptor
Return self!=value.
__repr__()
method descriptor
Return repr(self).
NameTypeTag
What kind of thing a name describes. Drives which pipeline passes
apply when a [crate::names::name::Name] is analysed.
ENT = NameTypeTag.ENT
class-attribute
What kind of thing a name describes. Drives which pipeline passes
apply when a [crate::names::name::Name] is analysed.
OBJ = NameTypeTag.OBJ
class-attribute
What kind of thing a name describes. Drives which pipeline passes
apply when a [crate::names::name::Name] is analysed.
ORG = NameTypeTag.ORG
class-attribute
What kind of thing a name describes. Drives which pipeline passes
apply when a [crate::names::name::Name] is analysed.
PER = NameTypeTag.PER
class-attribute
What kind of thing a name describes. Drives which pipeline passes
apply when a [crate::names::name::Name] is analysed.
UNK = NameTypeTag.UNK
class-attribute
What kind of thing a name describes. Drives which pipeline passes
apply when a [crate::names::name::Name] is analysed.
__doc__ = 'What kind of thing a name describes. Drives which pipeline passes\napply when a [`crate::names::name::Name`] is analysed.'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
value
property
__eq__(value)
method descriptor
Return self==value.
__ge__(value)
method descriptor
Return self>=value.
__gt__(value)
method descriptor
Return self>value.
__hash__()
method descriptor
Return hash(self).
__int__()
method descriptor
int(self)
__le__(value)
method descriptor
Return self<=value.
__lt__(value)
method descriptor
Return self<value.
__ne__(value)
method descriptor
Return self!=value.
__repr__()
method descriptor
Return repr(self).
Span
A contiguous group of [NamePart]s annotated with a
[crate::names::symbol::Symbol] — the tagger's output unit.
__doc__ = "A contiguous group of [`NamePart`]s annotated with a\n[`crate::names::symbol::Symbol`] — the tagger's output unit."
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
comparable
property
Space-joined part.comparable over the covered parts, for
use in matcher-side substring checks.
parts
property
The [NamePart]s covered by this span. Same Py<NamePart>
references that live in the parent [crate::names::name::Name]'s
.parts, so span.parts[0] is name.parts[i] is True from
Python. Exposed as a tuple — hashable, so downstream code can
key on (span.parts, span.symbol.category) when deduplicating
pairings.
symbol
property
The symbol this span carries.
__eq__(value)
method descriptor
Return self==value.
__ge__(value)
method descriptor
Return self>=value.
__gt__(value)
method descriptor
Return self>value.
__hash__()
method descriptor
Return hash(self).
__le__(value)
method descriptor
Return self<=value.
__len__()
method descriptor
Return len(self).
__lt__(value)
method descriptor
Return self<value.
__ne__(value)
method descriptor
Return self!=value.
__new__(*args, **kwargs)
builtin
Create and return a new object. See help(type) for accurate signature.
__repr__()
method descriptor
Return repr(self).
Symbol
A semantic interpretation applied to one or more parts of a name.
Carries a [SymbolCategory] and an id. Tagger pipelines emit
Symbols during name analysis; matchers compare them between
names as a coarse compatibility signal, and indexers flatten
them into searchable fields. Equality and hashing are
structural over (category, id).
Ids are always str — integer-sourced ids (Wikidata QIDs,
ordinals, initial codepoints) are decimal-stringified at
construction. Distinct Symbols with equal ids share one
[Arc<str>] heap allocation via [intern].
__doc__ = 'A semantic interpretation applied to one or more parts of a name.\n\nCarries a [`SymbolCategory`] and an id. Tagger pipelines emit\nSymbols during name analysis; matchers compare them between\nnames as a coarse compatibility signal, and indexers flatten\nthem into searchable fields. Equality and hashing are\nstructural over `(category, id)`.\n\nIds are always `str` — integer-sourced ids (Wikidata QIDs,\nordinals, initial codepoints) are decimal-stringified at\nconstruction. Distinct `Symbol`s with equal ids share one\n[`Arc<str>`] heap allocation via [`intern`].'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
category
property
id
property
The interned id string. Always str on the Python side —
ids originally passed as int return as their decimal form.
__eq__(value)
method descriptor
Return self==value.
__ge__(value)
method descriptor
Return self>=value.
__gt__(value)
method descriptor
Return self>value.
__hash__()
method descriptor
Return hash(self).
__init__(category, id)
id is decimal-stringified if passed as int.
__le__(value)
method descriptor
Return self<=value.
__lt__(value)
method descriptor
Return self<value.
__ne__(value)
method descriptor
Return self!=value.
__new__(*args, **kwargs)
builtin
Create and return a new object. See help(type) for accurate signature.
__repr__()
method descriptor
Return repr(self).
__str__()
method descriptor
Return str(self).
SymbolCategory
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
DOMAIN = SymbolCategory.DOMAIN
class-attribute
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
INITIAL = SymbolCategory.INITIAL
class-attribute
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
LOCATION = SymbolCategory.LOCATION
class-attribute
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
NAME = SymbolCategory.NAME
class-attribute
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
NICK = SymbolCategory.NICK
class-attribute
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
NUMERIC = SymbolCategory.NUMERIC
class-attribute
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
ORG_CLASS = SymbolCategory.ORG_CLASS
class-attribute
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
PHONETIC = SymbolCategory.PHONETIC
class-attribute
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
SYMBOL = SymbolCategory.SYMBOL
class-attribute
The kind of semantic annotation a [Symbol] carries. Drives how
strongly a symbol match counts during scoring — an ORG_CLASS
match is a strong corporate-form signal, an INITIAL match is
weak evidence that needs token-level corroboration.
__doc__ = 'The kind of semantic annotation a [`Symbol`] carries. Drives how\nstrongly a symbol match counts during scoring — an `ORG_CLASS`\nmatch is a strong corporate-form signal, an `INITIAL` match is\nweak evidence that needs token-level corroboration.'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
value
property
__eq__(value)
method descriptor
Return self==value.
__ge__(value)
method descriptor
Return self>=value.
__gt__(value)
method descriptor
Return self>value.
__hash__()
method descriptor
Return hash(self).
__int__()
method descriptor
int(self)
__le__(value)
method descriptor
Return self<=value.
__lt__(value)
method descriptor
Return self<value.
__ne__(value)
method descriptor
Return self!=value.
__repr__()
method descriptor
Return repr(self).
align_person_name_order(left, right)
builtin
Greedy-align two lists of name parts so comparable tokens share the same output index.
Used by the name matcher to reorder remaining tokens after symbolic tagging so a downstream per-index similarity pass compares like with like. Pairs are chosen by a length-desc, left-major walk over edit-similarity scores; ties are broken stably by input order so the output is deterministic.
Returns ([], tag_sort(right)) when left is empty, falls back
to (tag_sort(left), tag_sort(right)) when no pair scores above
the similarity floor, otherwise returns the greedy-aligned pair.
analyze_names(type_tag, names, part_tags=None, *, infer_initials=False, symbols=True, phonetics=True, numerics=True, consolidate=True, rewrite=True)
Build a set of tagged Name objects from raw strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
type_tag
|
NameTypeTag
|
The NameTypeTag for
every name in this batch. Drives which prefix/org-type/
tagger passes run: |
required |
names
|
Sequence[str]
|
Raw name strings as harvested from the source entity. Empty strings and inputs that normalise to empty are dropped. Duplicates (after prenormalisation) are de-duplicated. |
required |
part_tags
|
Optional[Mapping[NamePartTag, Sequence[str]]]
|
Pre-classified part annotations, typically produced
by an adapter that reads structured name-part properties
off the source entity (e.g. firstName → |
None
|
infer_initials
|
bool
|
When |
False
|
symbols
|
bool
|
Master switch for symbol emission. When |
True
|
phonetics
|
bool
|
When |
True
|
numerics
|
bool
|
When |
True
|
consolidate
|
bool
|
When |
True
|
rewrite
|
bool
|
When |
True
|
Returns:
| Type | Description |
|---|---|
Set[Name]
|
A set of tagged |
Set[Name]
|
form. Empty if every input normalised to an empty string. |
Source code in rigour/names/analyze.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | |
compare_parts(qry, res, config=None)
builtin
Score the alignment of two NamePart lists.
Callers should hand over the residue — parts that earlier stages
(symbol pairing, alias tagging, identifier matching) couldn't
explain by themselves — already canonicalised into positional
order (tag_sort for ORG/ENT, align_person_name_order for PER).
The function returns one [Alignment] per cluster, paired or
solo; every input part appears exactly once across the output.
Returned alignments carry symbol = None (residue distance is
non-symbolic by definition).
config overrides the cost / budget / clustering scalars. Pass
None (the default) to use the process-wide defaults — those
match industry-typical recall-protective tuning. Sweep scripts
build a fresh [CompareConfig] per iteration; matchers cache one
per request.
contains_split_phrase(string)
Check whether string contains an alias-marker phrase.
Detects markers like "a.k.a.", "f.k.a.", "née", "alias",
that signal a single string actually carries multiple distinct
names. Useful for triaging input — a string with a split
phrase shouldn't be treated as one atomic name. The phrase
list is data-driven from
resources/names/stopwords.yml::NAME_SPLIT_PHRASES,
surfaced via rigour._core.name_split_phrases_list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
string
|
str
|
An input that may contain one or more names. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
|
Source code in rigour/names/split_phrases.py
extract_org_types(name, normalize_flags=Normalize.CASEFOLD, cleanup=Cleanup.Noop, generic=False)
Find every organisation-type designation in a name.
Scans name for recognised aliases (LLC, Inc, GmbH, ...) and returns
the matched substring and its canonical target. A poor-person's
"is this a company name?" detector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The text to be processed. Assumed to already be normalised
with the same |
required |
normalize_flags
|
Normalize
|
|
CASEFOLD
|
cleanup
|
Cleanup
|
|
Noop
|
generic
|
bool
|
If True, target values are the generic form ( |
False
|
Returns:
| Type | Description |
|---|---|
List[Tuple[str, str]]
|
A list of |
List[Tuple[str, str]]
|
non-overlapping match. Empty if nothing matches. |
Source code in rigour/names/org_types.py
is_name(name)
Check whether name plausibly contains a name.
Loose filter — true iff at least one character is a Unicode
letter (general category L*). Useful for rejecting purely
numeric ("007") or punctuation-only ("---") inputs before
handing them to the rest of the name pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
A string. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
|
Source code in rigour/names/check.py
is_stopword(form, *, normalizer=normalize_name, normalize=False)
Check if the given form is a stopword. The stopword list is normalized first.
.. deprecated::
Use :func:rigour.text.is_stopword instead. This function will be removed in a future version.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
form
|
str
|
The token to check, must already be normalized. |
required |
normalizer
|
Normalizer
|
The normalizer to use for checking stopwords. |
normalize_name
|
normalize
|
bool
|
Whether to normalize the form before checking. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the form is a stopword, False otherwise. |
Source code in rigour/names/check.py
normalize_name(name)
cached
Casefold and tokenise a name into a stable matching key.
Convenience composition of :func:tokenize_name over a
casefolded input, rejoined with single ASCII spaces.
Equivalent to calling
normalize(name, Normalize.CASEFOLD | Normalize.NAME) —
use that directly when callers want explicit flag control.
Used internally by the rigour name-matching utilities; not intended as a general-purpose public surface.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
Optional[str]
|
A name string, or |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Normalised name (lowercase, single-space-separated |
Optional[str]
|
tokens), or |
Optional[str]
|
empty. |
Source code in rigour/names/tokenize.py
pair_symbols(query, result)
builtin
Align the symbol spans of two [Name]s into coverage-maximal
pairings.
Each returned pairing is a tuple of non-conflicting
[Alignment]s; edges within a pairing cover disjoint parts on
each side. Each Alignment has symbol = Some(_) and a
placeholder score = 1.0 — consumers should override the
score with a per-category default before composing the pairing
total. Pairings are distinguished by their coverage and
category multiset — two pairings that cover the same parts
with the same category mix are collapsed to one. Distinct
category choices on the same parts (e.g. a token carrying both
NAME:Qvan and SYMBOL:van) surface as separate pairings.
Returns [()] (a single empty pairing) when either name has
more than 64 parts, when either name has no tagger spans, or
when no symbol is shared between the two sides.
pick_case(names)
Pick the best mix of lower- and uppercase characters from a set of names that are identical except for case. If the names are not identical, undefined things happen (not recommended).
Rust-backed via :func:rigour._core.pick_case. The Rust
implementation returns None for empty input; this Python wrapper
raises ValueError to preserve the pre-port contract that
external callers rely on.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
names
|
List[str]
|
A list of identical names in different cases. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The best name for display. |
Source code in rigour/names/pick.py
remove_obj_prefixes(name)
Strip vessel-class and generic-article prefixes from the head of an object name.
Drops "M/V", "SS", "The", etc. so "M/V Oceanic" →
"Oceanic" doesn't penalise the shorter variant when
matching vessels, vehicles, or aircraft. Driven by
resources/names/stopwords.yml::OBJ_NAME_PREFIXES.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
An object (vessel / vehicle / aircraft) name string. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The name with any leading prefix(es) removed. |
Source code in rigour/names/prefix.py
remove_org_prefixes(name)
Strip article-like prefixes from the head of an organisation name.
Drops "The", etc. so "The Charitable Trust" →
"Charitable Trust" doesn't penalise the shorter variant
when matching. Driven by
resources/names/stopwords.yml::ORG_NAME_PREFIXES.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
An organisation name string. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The name with any leading article-prefix(es) removed. |
Source code in rigour/names/prefix.py
remove_org_types(name, replacement='', normalize_flags=Normalize.CASEFOLD, cleanup=Cleanup.Noop)
Remove organisation-type designations from a name.
Every recognised alias (LLC, Inc, GmbH, ...) in name is replaced
with replacement. Useful as a preprocessing step when you want
the entity name stripped of legal-form noise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The text to be processed. Assumed to already be normalised
with the same |
required |
replacement
|
str
|
The string to replace each matched alias with.
Default |
''
|
normalize_flags
|
Normalize
|
|
CASEFOLD
|
cleanup
|
Cleanup
|
|
Noop
|
Returns:
| Type | Description |
|---|---|
str
|
The text with recognised organisation types replaced. May be |
str
|
empty if the input consisted entirely of matched aliases and |
str
|
|
Source code in rigour/names/org_types.py
remove_person_prefixes(name)
Strip honorific prefixes from the head of a person name.
Drops "Mr.", "Mrs.", "Dr.", "Lady", etc. so honorifics
don't contaminate part alignment in matching or token-bag
comparison. The list is data-driven from
resources/names/stopwords.yml::PERSON_NAME_PREFIXES,
surfaced via rigour._core.person_name_prefixes_list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
A person name string. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The name with any leading honorific(s) removed. Idempotent |
str
|
for inputs that don't start with a known prefix. |
Source code in rigour/names/prefix.py
replace_org_types_compare(name, normalize_flags=Normalize.CASEFOLD, cleanup=Cleanup.Noop, generic=False)
Replace organisation types in a name with a heavily normalised form.
Country-specific entity types (e.g. GmbH, Aktiengesellschaft, ООО) are
rewritten into a simplified comparison form (e.g. gmbh, ag,
ooo) suitable for string-distance matching. The result is meant
for comparison pipelines, not for presentation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The text to be processed. Assumed to already be normalised
with the same |
required |
normalize_flags
|
Normalize
|
|
CASEFOLD
|
cleanup
|
Cleanup
|
|
Noop
|
generic
|
bool
|
If True, substitute the generic form of the organisation
type (e.g. |
False
|
Returns:
| Type | Description |
|---|---|
str
|
The text with recognised organisation types substituted. If every |
str
|
match would reduce the text to an empty string, the original |
str
|
text is returned unchanged. |
Source code in rigour/names/org_types.py
replace_org_types_display(name, normalize_flags=Normalize.CASEFOLD | Normalize.SQUASH_SPACES, cleanup=Cleanup.Noop)
Replace organisation types in a name with their short display form.
Spelt-out legal forms are shortened into common abbreviations
(e.g. "Siemens Aktiengesellschaft" → "Siemens AG"), preserving
the case of non-matched portions. If the whole input is uppercase
(str.isupper()), the whole output is re-uppercased.
Matches case-insensitively across Unicode by casefolding a copy of
the input internally for the match step — normalize_flags must
therefore include Normalize.CASEFOLD so the alias table is
casefolded too. The default does this.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The text to be processed. |
required |
normalize_flags
|
Normalize
|
|
CASEFOLD | SQUASH_SPACES
|
cleanup
|
Cleanup
|
|
Noop
|
Returns:
| Type | Description |
|---|---|
str
|
The text with recognised organisation types substituted for |
str
|
their display form. Non-matched regions keep their original case. |
Source code in rigour/names/org_types.py
representative_names(names, limit, cluster_threshold=0.3)
Reduce a bag of aliases to at most limit representatives
without extreme information loss.
Useful when a downstream process (e.g. building a search-index
query) wants to probe the alias space broadly under a budget
cap. For a person with 20 transliterations of one name and
limit=5, this returns ~1-5 centroid-selected representatives
rather than all 20 near-identical forms. For a person with two
genuinely distinct names (Nelson Mandela / Rolihlahla Mandela),
both survive — N transliterations of one name don't add recall,
but a second name does.
Fast path: if the input already collapses to <= limit
distinct names (after casefold-dedup via :func:reduce_names),
those names are returned as-is without clustering. Compression
only runs when the input actually needs to be compressed. This
means cluster_threshold has no effect when the fast path
fires.
Ordering of the returned list is not guaranteed. Returned
strings are originals from the input — :func:pick_name per
cluster selects the best-case representative when clustering
runs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
names
|
List[str]
|
input aliases, typically all belonging to one entity. |
required |
limit
|
int
|
upper bound on output size. |
required |
cluster_threshold
|
float
|
normalized Levenshtein distance (0..1) above which two names are considered distinct names rather than variants of one. Default 0.3 keeps transliterations together while separating genuinely different names. Ignored when the fast path fires. |
0.3
|
Source code in rigour/names/pick.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
tokenize_name(text, token_min_length=1)
Split a person or entity's name into name parts.
Unicode general-category-aware: separator categories (spaces, punctuation, math symbols) split tokens; delete categories (combining marks, modifier letters, format chars) drop; letters, numbers, and a small set of CJK modifier marks are kept.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The name to tokenize. |
required |
token_min_length
|
int
|
Drop tokens shorter than this many codepoints. Defaults to 1 (drop only zero-length). |
1
|
Returns:
| Type | Description |
|---|---|
List[str]
|
Tokens in left-to-right order, with any deletion or |
List[str]
|
whitespace-substitution applied. Order matches input. |
Source code in rigour/names/tokenize.py
rigour.names.analyze
End-to-end name analysis: raw strings → tagged Name objects.
analyze_names is the unified entry point that downstream consumers
(followthemoney's entity_names, in turn used by nomenklatura and
yente) call once per entity to get matchable Name objects.
Rust-backed via rigour._core.analyze_names — one FFI crossing per
call, regardless of how many names / part_tags the entity has. The
single-call pipeline runs: prefix strip → prenormalize → org-type
replacement (for ORG/ENT) → Name + NamePart construction → part
tagging via Name.tag_text → tagger match-and-apply → NUMERIC /
STOP / LEGAL inference → optional consolidate_names.
Part-tag value shape
part_tags values can be multi-token strings. A value like
"Jean Claude" in part_tags[NamePartTag.GIVEN] for the name
"Jean Claude Juncker" will tag both the "jean" and
"claude" parts as GIVEN — the underlying Name.tag_text
tokenises the value and walks the name parts looking for the token
sequence. The tokens of the value don't need to be adjacent in the
name, just present in order.
analyze_names(type_tag, names, part_tags=None, *, infer_initials=False, symbols=True, phonetics=True, numerics=True, consolidate=True, rewrite=True)
Build a set of tagged Name objects from raw strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
type_tag
|
NameTypeTag
|
The NameTypeTag for
every name in this batch. Drives which prefix/org-type/
tagger passes run: |
required |
names
|
Sequence[str]
|
Raw name strings as harvested from the source entity. Empty strings and inputs that normalise to empty are dropped. Duplicates (after prenormalisation) are de-duplicated. |
required |
part_tags
|
Optional[Mapping[NamePartTag, Sequence[str]]]
|
Pre-classified part annotations, typically produced
by an adapter that reads structured name-part properties
off the source entity (e.g. firstName → |
None
|
infer_initials
|
bool
|
When |
False
|
symbols
|
bool
|
Master switch for symbol emission. When |
True
|
phonetics
|
bool
|
When |
True
|
numerics
|
bool
|
When |
True
|
consolidate
|
bool
|
When |
True
|
rewrite
|
bool
|
When |
True
|
Returns:
| Type | Description |
|---|---|
Set[Name]
|
A set of tagged |
Set[Name]
|
form. Empty if every input normalised to an empty string. |
Source code in rigour/names/analyze.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | |
rigour.names.compare
Residue-distance scoring for two NamePart lists.
Reach for compare_parts when a name matcher has already peeled off the parts it can explain by other means — symbol pairing, alias tagging, identifier hits — and is left with a residue that needs a fuzzy-match verdict (typo, transliteration drift, surface-form variants of the same token).
The function returns one
Alignment per cluster of aligned
parts (paired or solo). Every input part appears in exactly one
alignment, so a caller can sum / weight / threshold the result
without losing track of which inputs got accounted for. Returned
alignments carry symbol = None (residue distance is non-symbolic
by definition).
The cost model penalises digit mismatches more than letter mismatches,
treats visually / phonetically confusable pairs (0/o, 1/l,
c/k, …) as cheap edits, and charges almost nothing for token
merge / split. A length-dependent budget caps the per-side similarity
at zero once the total cost exceeds what's plausible for typo noise —
the matcher refuses to fuzzy-match when the edit-density crosses into
distinct-entity territory.
Pass a CompareConfig to override
the cost / budget / clustering scalars — e.g. budget_tolerance to
shift between strict (payment-screening) and permissive (KYC-
onboarding) profiles, or cost_* for sweep-based calibration. The
default is recall-protective and matches industry-typical tuning.
Alignment
One unit of name-comparison evidence.
Three modes:
- Symbol-paired edge —
symbolisSomeand both sides carry the sameSymbol. Returned bypair_symbols. Defaultscoreis1.0; consumers may override with a category default (e.g.SYM_SCORES[NAME] = 0.9). - Residue cluster —
symbolisNone, both sides non-empty. Returned bycompare_partsfor parts that aligned by edit distance. - Extra —
symbolisNone, exactly one side is empty. Represents a part that found no counterpart on the other side; the matcher applies a side-specific weight.
qps / rps / symbol / qstr / rstr are immutable
post-construction. score and weight are mutable to support
the matcher's policy passes (literal-equality rescue,
extras-weight override, family-name boost). Both stored as
Py<PyFloat> so Python-side reads are an INCREF rather than a
fresh allocation per access.
__hash__ and __eq__ key on (symbol, qps, rps) —
NamePart already hashes by (index, form) so position is
preserved. score and weight are not part of identity.
__doc__ = "One unit of name-comparison evidence.\n\nThree modes:\n\n- **Symbol-paired edge** — `symbol` is `Some` and both sides\n carry the same `Symbol`. Returned by `pair_symbols`. Default\n `score` is `1.0`; consumers may override with a category\n default (e.g. `SYM_SCORES[NAME] = 0.9`).\n- **Residue cluster** — `symbol` is `None`, both sides\n non-empty. Returned by `compare_parts` for parts that\n aligned by edit distance.\n- **Extra** — `symbol` is `None`, exactly one side is empty.\n Represents a part that found no counterpart on the other\n side; the matcher applies a side-specific weight.\n\n`qps` / `rps` / `symbol` / `qstr` / `rstr` are immutable\npost-construction. `score` and `weight` are mutable to support\nthe matcher's policy passes (literal-equality rescue,\nextras-weight override, family-name boost). Both stored as\n`Py<PyFloat>` so Python-side reads are an INCREF rather than a\nfresh allocation per access.\n\n`__hash__` and `__eq__` key on `(symbol, qps, rps)` —\n`NamePart` already hashes by `(index, form)` so position is\npreserved. `score` and `weight` are not part of identity."
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
qps
property
Query-side parts covered by this alignment.
qstr
property
" ".join(p.comparable for p in qps), cached.
rps
property
Result-side parts covered by this alignment.
rstr
property
" ".join(p.comparable for p in rps), cached.
score
property
Similarity in [0, 1]. For symbol-paired edges, defaults
to 1.0; consumers override with a category default. For
residue clusters, the per-cluster product. For extras,
0.0.
symbol
property
Shared Symbol for symbol-paired edges; None for
residue clusters and extras.
weight
property
Aggregation weight in the matcher's weighted average.
Defaults to 1.0; consumers override per category
(SYM_WEIGHTS), for extras (nm_extra_*_name), for
family-name boost (nm_family_name_weight), and for
stopword down-weight.
__eq__(value)
method descriptor
Return self==value.
__ge__(value)
method descriptor
Return self>=value.
__gt__(value)
method descriptor
Return self>value.
__hash__()
method descriptor
Return hash(self).
__le__(value)
method descriptor
Return self<=value.
__lt__(value)
method descriptor
Return self<value.
__ne__(value)
method descriptor
Return self!=value.
__new__(*args, **kwargs)
builtin
Create and return a new object. See help(type) for accurate signature.
__repr__()
method descriptor
Return repr(self).
CompareConfig
Tunable cost / budget / clustering scalars for [py_compare_parts].
Frozen by design: a sweep iteration constructs a fresh
CompareConfig, the matcher caches one per request. Mutability
would buy nothing (the values are read once per name pair) and
would cost a runtime borrow check on each Rust-side access.
The default values reproduce the constants this struct replaced;
compare_parts(qry, res) with no config argument is exactly
equivalent to the pre-CompareConfig call.
__doc__ = 'Tunable cost / budget / clustering scalars for [`py_compare_parts`].\n\nFrozen by design: a sweep iteration constructs a fresh\n`CompareConfig`, the matcher caches one per request. Mutability\nwould buy nothing (the values are read once per name pair) and\nwould cost a runtime borrow check on each Rust-side access.\n\nThe default values reproduce the constants this struct replaced;\n`compare_parts(qry, res)` with no `config` argument is exactly\nequivalent to the pre-`CompareConfig` call.'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
__module__ = 'rigour._core'
class-attribute
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to 'utf-8'. errors defaults to 'strict'.
budget_log_base
property
Logarithm base in the per-side cost-budget formula
log_budget_log_base(max(len - budget_short_floor, 1)) *
budget_tolerance. The base controls how aggressively the
budget grows with token length — smaller base = faster
growth = more permissive on long names.
budget_short_floor
property
Short-token floor: tokens shorter than this contribute zero to the budget, so any non-zero edit fails the cap. This is the fail-closed property — the matcher refuses to fuzzy- match on 1-2 character tokens (vessel hull suffixes, isolated initials, 2-char Chinese given names) where typo / distinct-entity signal is too weak.
budget_tolerance
property
Multiplier on the per-side cost budget. Lower is stricter (less edit tolerated before a cluster scores zero); higher is more permissive. Callers tune this per scenario — KYC at onboarding runs more permissive than payment screening.
cluster_overlap_min
property
Overlap fraction (matched chars / shorter-side length) above which two parts pair into a cluster. A pair below this threshold surfaces as solo records — the matched-character evidence is too thin to claim the parts are talking about the same token. The 0.51 default (i.e. "more than half") is the lowest value where majority of the shorter token agrees.
cost_confusable
property
Substitute between a confusable pair from
resources/names/compare.yml (0/o, 1/l, …). OCR /
transliteration / homoglyph noise — the writer was probably
aiming at the same character.
cost_digit
property
Edit involving a digit on either side. Digits identify specific things — vintage years, vessel hull numbers, fund vintages — so a digit mismatch is evidence of a different entity, not a typo.
cost_sep_drop
property
Token boundary lost or gained on one side. Token merge/split
(vanderbilt ↔ van der bilt) is a common surface-form
variant of the same name; charging it almost nothing keeps
the alignment from refusing to bridge whitespace artifacts.
__new__(*args, **kwargs)
builtin
Create and return a new object. See help(type) for accurate signature.
__repr__()
method descriptor
Return repr(self).
compare_parts(qry, res, config=None)
builtin
Score the alignment of two NamePart lists.
Callers should hand over the residue — parts that earlier stages
(symbol pairing, alias tagging, identifier matching) couldn't
explain by themselves — already canonicalised into positional
order (tag_sort for ORG/ENT, align_person_name_order for PER).
The function returns one [Alignment] per cluster, paired or
solo; every input part appears exactly once across the output.
Returned alignments carry symbol = None (residue distance is
non-symbolic by definition).
config overrides the cost / budget / clustering scalars. Pass
None (the default) to use the process-wide defaults — those
match industry-typical recall-protective tuning. Sweep scripts
build a fresh [CompareConfig] per iteration; matchers cache one
per request.