Text
rigour.text
is_levenshtein_plausible(left, right, max_edits=env.LEVENSHTEIN_MAX_EDITS, max_percent=env.LEVENSHTEIN_MAX_PERCENT, max_length=env.MAX_NAME_LENGTH)
A sanity check to post-filter name matching results based on a budget of allowed Levenshtein distance. This basically cuts off results where the Jaro-Winkler or Metaphone comparison was too lenient.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
left
|
str
|
A string. |
required |
right
|
str
|
A string. |
required |
max_edits
|
Optional[int]
|
The maximum number of edits allowed. |
LEVENSHTEIN_MAX_EDITS
|
max_percent
|
float
|
The maximum percentage of edits allowed. |
LEVENSHTEIN_MAX_PERCENT
|
Returns:
| Type | Description |
|---|---|
bool
|
A boolean. |
Source code in rigour/text/distance.py
is_nullplace(form, *, normalizer=normalize_text, normalize=False)
Check whether form is a nullplace.
Nullplaces are place names that don't refer to a specific
location: "overseas", "abroad", "stateless",
"international waters", etc. Useful for filtering out
records where a country / address slot was populated with a
placeholder rather than a real geography.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
form
|
str
|
The string to check. |
required |
normalizer
|
Normalizer
|
Normalizer applied to the wordlist at load
time, and to |
normalize_text
|
normalize
|
bool
|
When |
False
|
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
nullplace list. |
Source code in rigour/text/stopwords.py
is_nullword(form, *, normalizer=normalize_text, normalize=False)
Check whether form is a nullword.
Nullwords are tokens that imply a missing value:
"none", "not available", "n/a", "unknown", etc.
Useful for filtering out records where an alias slot was
populated with a placeholder rather than real data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
form
|
str
|
The token to check. |
required |
normalizer
|
Normalizer
|
Normalizer applied to the wordlist at load
time, and to |
normalize_text
|
normalize
|
bool
|
When |
False
|
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
nullword list. |
Source code in rigour/text/stopwords.py
is_stopword(form, *, normalizer=normalize_text, normalize=False)
Check whether form is a stopword.
Stopwords are common words that carry no identifying signal
in name-matching contexts ("the", "and", "of", etc.).
Both the wordlist and the runtime input must be normalised
with the same normalizer for the membership check to be
meaningful.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
form
|
str
|
The token to check. |
required |
normalizer
|
Normalizer
|
Normalizer applied to the wordlist at load
time, and to |
normalize_text
|
normalize
|
bool
|
When |
False
|
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
stopword list. |
Source code in rigour/text/stopwords.py
jaro_winkler(left, right, max_length=env.MAX_NAME_LENGTH)
cached
Compute the Jaro-Winkler similarity of two strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
left
|
str
|
A string. |
required |
right
|
str
|
A string. |
required |
Returns:
| Type | Description |
|---|---|
float
|
A float between 0.0 and 1.0. |
Source code in rigour/text/distance.py
levenshtein(left, right, max_length=env.MAX_NAME_LENGTH, max_edits=None)
cached
Compute the Levenshtein distance between two strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
left
|
str
|
A string. |
required |
right
|
str
|
A string. |
required |
Returns:
| Type | Description |
|---|---|
int
|
An integer of changed characters. |
Source code in rigour/text/distance.py
levenshtein_similarity(left, right, max_edits=env.LEVENSHTEIN_MAX_EDITS, max_percent=env.LEVENSHTEIN_MAX_PERCENT, max_length=env.MAX_NAME_LENGTH)
Compute the Levenshtein similarity of two strings. The similiarity is the percentage distance measured against the length of the longest string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
left
|
str
|
A string. |
required |
right
|
str
|
A string. |
required |
max_edits
|
Optional[int]
|
The maximum number of edits allowed. |
LEVENSHTEIN_MAX_EDITS
|
max_percent
|
float
|
The maximum fraction of the shortest string that is allowed to be edited. |
LEVENSHTEIN_MAX_PERCENT
|
Returns:
| Type | Description |
|---|---|
float
|
A float between 0.0 and 1.0. |
Source code in rigour/text/distance.py
metaphone(token)
cached
Get the metaphone phonetic representation of a token.
Thin Python-level LRU cache over the Rust implementation. The cache pays off in matching workloads where the same name tokens recur across millions of entities — a cache hit skips the FFI crossing entirely.
Source code in rigour/text/phonetics.py
remove_bracketed_text(text)
Remove any text in brackets. This is meant to handle names of companies which include the jurisdiction, like: Turtle Management (Seychelles) Ltd.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
A text including text in brackets. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text where this has been substituted for whitespace. |
Source code in rigour/text/cleaning.py
remove_emoji(string)
Remove unicode ranges used by emoticons, symbols, flags and other visual codepoints from a piece of text. Primary use case is to remove shit emojis from the names of political office holders coming from Wikidata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
string
|
str
|
Text that may include emoji and pictographs. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text that doesn't include those. |
Source code in rigour/text/cleaning.py
soundex(token)
cached
Get the soundex phonetic representation of a token.
Thin Python-level LRU cache over the Rust implementation. Same rationale as metaphone above.
Source code in rigour/text/phonetics.py
text_hash(text)
Generate a hash for the given text, ignoring whitespace and punctuation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text to hash. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The SHA-1 hash of the processed text. |
Source code in rigour/text/checksum.py
rigour.text.normalize
Flag-based text normalisation — the canonical reference for the
normalize_flags / cleanup parameters used across rigour.
This module exposes three things:
- Normalize — a bit-flag set selecting individual normalisation steps (strip, casefold, NFC/NFKC/NFKD, transliterate, squash spaces).
- Cleanup — an enum picking one of two fixed Unicode-category
replacement profiles (
Strong,Slug), orNoopto skip the step. - normalize — the single entry point that runs the composed pipeline on a string.
Two distinct uses of these flags
The same Normalize vocabulary shows up in two places across rigour,
with different lifecycles:
- Input normalisation. The caller runs
normalize(text, flags, cleanup)on a single runtime string and passes the result downstream. This is what this module does directly. - Reference-data normalisation. A lookup/tagger function (e.g.
replace_org_types_compare,
or the AC tagger inside
analyze_names) builds an internal
regex/automaton from static YAML data (aliases, stopwords, AC
patterns) and uses
normalize_flags+cleanupto decide how that static data gets normalised at build time. The caller is expected to normalise its runtime input with the same flags before calling. Functions in this bucket cache one compiled automaton per distinct flag combination.
Pipeline order
Steps run in a fixed order regardless of bit-ordering in the flag value:
1. STRIP — trim leading/trailing whitespace
2. NFKD / NFKC / NFC — at most one is meaningful; if multiple
are set, Rust applies the first one
listed in its dispatch (NFKD)
3. CASEFOLD — Unicode full casefold (ß → ss, not lowercase)
4. Cleanup — category_replace, unless Cleanup.Noop
5. SQUASH_SPACES — collapse whitespace runs, trim ends
6. NAME — tokenize via
[tokenize_name][rigour.names.tokenize.tokenize_name]
and rejoin with a single ASCII space
Transliteration is NOT part of this pipeline. rigour's public
transliteration surface is rigour.text.translit — opportunistic,
limited to Latin/Cyrillic/Greek/Armenian/Georgian/Hangul. For
broader-script lossy romanisation use
normality.ascii_text / normality.latinize_text.
Empty output is coalesced to None, matching the contract of the
pre-flags Optional[str] normalisers.
Common compositions
The flag sets pinned as defaults across the rigour API:
Normalize.CASEFOLD— production default for comparison keys that should preserve whitespace and script.Normalize.CASEFOLD | Normalize.SQUASH_SPACES— adds whitespace collapsing on top. Used when input whitespace is unreliable, and by display-style replacers that need case-insensitive matching with tidied whitespace.Normalize.SQUASH_SPACES— whitespace-tidy without case change. Used by display-form replacers that want to preserve caller case.Normalize.CASEFOLD | Normalize.NAME— casefold and tokenise with rigour.names.tokenize.tokenize_name, yielding a stable space-separated name key for matching.
Implementation note
The actual work runs in Rust via rigour._core._normalize. This
module is the idiomatic Python surface — IntFlag for the bit set,
IntEnum for the variant, both crossing the FFI boundary as plain
ints at ~zero marshalling cost.
Cleanup
Bases: IntEnum
Unicode-category-based cleanup variants.
Cleanup picks one of a small set of fixed category → action
tables that drive the pipeline's category_replace step
(pipeline step 5). The step rewrites or deletes characters based
on their Unicode general category (e.g. punctuation, control,
mark, symbol). The tables are deliberately closed — callers
compose via the flag set, not by passing ad-hoc category maps.
Attributes:
| Name | Type | Description |
|---|---|---|
Noop |
Skip the |
|
Strong |
Aggressive cleanup — punctuation and symbols become whitespace; controls, formats, and marks are deleted. Use when you want a matching key stripped of all decoration. |
|
Slug |
URL-slug-style cleanup. Differs from |
Source code in rigour/text/normalize.py
Normalize
Bases: IntFlag
Bit-flag set selecting individual normalisation steps.
Compose flags with bitwise OR and pass to normalize or to
any rigour function that exposes a normalize_flags= parameter
(e.g. :func:rigour.names.org_types.replace_org_types_compare).
See the module docstring for the fixed pipeline order the flags
compose into and for common flag compositions.
Attributes:
| Name | Type | Description |
|---|---|---|
STRIP |
Trim leading and trailing whitespace. |
|
SQUASH_SPACES |
Collapse runs of whitespace (including newlines, tabs, Unicode whitespace) into single spaces and trim the edges. Runs as the final pipeline step, so cleaning up whitespace introduced by earlier steps (e.g. category replacement) comes out right. |
|
CASEFOLD |
Unicode full casefold (e.g. |
|
NFC |
Apply Unicode Normal Form C (canonical composition). Rarely needed on its own; most callers want NFKC or NFKD. Mutually exclusive with NFKC/NFKD. |
|
NFKC |
Apply Unicode Normal Form KC (compatibility composition). Folds compatibility variants (e.g. full-width digits → ASCII) while keeping a composed form. Mutually exclusive with NFC/NFKD. |
|
NFKD |
Apply Unicode Normal Form KD (compatibility decomposition). Splits composed characters apart. Mutually exclusive with NFC/NFKC. |
|
NAME |
Run the string through tokenize_name and rejoin the tokens with a single ASCII space. Runs as the final pipeline step, so it also subsumes whitespace squashing. |
Source code in rigour/text/normalize.py
noop_normalizer(text)
Identity normalizer that strips whitespace and rejects empty.
Default :data:Normalizer for callers whose input is already
in the desired shape — only edge whitespace is removed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
Optional[str]
|
A string, or |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The stripped string, or |
Optional[str]
|
empty / whitespace-only input. |
Source code in rigour/text/normalize.py
normalize(text, flags=Normalize(0), cleanup=Cleanup.Noop)
Apply a composed sequence of text normalisation steps.
The pipeline order and semantics of each flag are described in the
module docstring. This function is the canonical entry point;
other rigour functions that take normalize_flags= + cleanup=
apply the same pipeline to their internal reference data at
regex/automaton build time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
Optional[str]
|
The text to normalise. If |
required |
flags
|
Normalize
|
Which normalisation steps to apply (see |
Normalize(0)
|
cleanup
|
Cleanup
|
Which category-replacement variant to apply as the
fifth pipeline step (see |
Noop
|
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The normalised string, or |
Optional[str]
|
or if the pipeline reduced the text to an empty string. |
Optional[str]
|
The empty-output-to- |
Optional[str]
|
Optional-string contract. |
Source code in rigour/text/normalize.py
rigour.text.translit
Opportunistic transliteration: should_ascii + maybe_ascii.
The minimal transliteration surface rigour exposes going forward.
Covers only the scripts listed in rigour.text.scripts.LATINIZE_SCRIPTS
(Latin, Cyrillic, Greek, Armenian, Georgian, Hangul). Anything
outside that set passes through unchanged (default) or becomes
empty (drop=True).
For broader-script, lossy transliteration (Han, Arabic, Devanagari,
etc.) use normality.ascii_text / normality.latinize_text —
rigour deliberately does not try to duplicate that surface.
rigour.text.scripts
can_latinize(word)
Check whether every script in a word is latinizable.
Equivalent to text_scripts(word) <= LATINIZE_SCRIPTS. When
True, :func:rigour.text.translit.maybe_ascii will produce
an ASCII output; when False, it returns the input unchanged.
Characters with no distinguishing script (digits, punctuation,
spaces, combining marks) are ignored. Empty input and
pure-Common input ("123") return True vacuously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
word
|
str
|
A string. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True iff every distinguishing script is in |
bool
|
data: |
Source code in rigour/text/scripts.py
can_latinize_cp(cp)
cached
Check whether a single codepoint can be latinized.
Three-valued: distinguishing-script-bearing codepoints get a
True/False; others (digits, punctuation, combining marks)
return None because the question doesn't apply to them.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cp
|
int
|
Codepoint as an integer. |
required |
Returns:
| Type | Description |
|---|---|
Optional[bool]
|
|
Optional[bool]
|
data: |
Optional[bool]
|
outside that set, |
Optional[bool]
|
Common/Inherited/Unknown codepoints. |
Source code in rigour/text/scripts.py
codepoint_script(cp)
cached
Return the Unicode Script long name for a codepoint.
Faithful exposure of the Unicode Script property via ICU4X.
Returns the pseudo-scripts "Common" (digits, punctuation,
spaces) and "Inherited" (combining marks) as-is — callers
that want to filter them out should do so explicitly via
:func:text_scripts, which already does.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cp
|
int
|
Codepoint as an integer. Accepts any |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Script long name ( |
Optional[str]
|
|
Optional[str]
|
or invalid codepoints (including lone surrogates). |
Source code in rigour/text/scripts.py
common_scripts(a, b)
Return the scripts both strings have in common.
Equivalent to text_scripts(a) & text_scripts(b). Cheap
pruning predicate — two strings that share no real script
have no textual bridge unless both are individually
latinizable.
The empty-result caveat is the main subtlety. An empty return
is ambiguous between "scripts are disjoint" (e.g. Latin vs
Han) and "one side has no real scripts" (numeric-only,
punctuation-only, empty). The two cases have different
matching implications — a numeric-only name like "007" can
still match "Agent 007" via shared NUMERIC symbols even
though common_scripts is empty. Pruning callers should
treat empty-script inputs as wildcards that bypass the script
gate, falling through to symbol-overlap or scoring. Callers
that need to distinguish the two cases should call
:func:text_scripts on each side explicitly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
str
|
A string. |
required |
b
|
str
|
Another string. |
required |
Returns:
| Type | Description |
|---|---|
set[str]
|
Intersection of the two strings' real-script sets. |
Source code in rigour/text/scripts.py
is_dense_script(word)
Check whether a word uses any dense (logographic or syllabic) script.
Rough proxy for "doesn't use whitespace to separate name parts" — though Hangul actually does use spaces, it's still grouped here because it encodes syllables rather than individual sounds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
word
|
str
|
A string. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool
|
True iff |
|
in |
bool
|
data: |
Source code in rigour/text/scripts.py
is_latin(word)
Check whether a word is written in the Latin alphabet only.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
word
|
str
|
A string. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True iff the only distinguishing script in |
bool
|
Latin. Pure-ASCII input short-circuits to True. |
Source code in rigour/text/scripts.py
is_modern_alphabet(word)
Check whether a word uses only modern alphabets.
Modern alphabets (Latin, Cyrillic, Greek, Armenian, Georgian) are letter-based systems with explicit vowels that transliterate reliably to Latin without language hints. Excludes Hangul (syllabic) and the dense logographic scripts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
word
|
str
|
A string. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True iff every distinguishing script is in |
bool
|
data: |
bool
|
to True without script detection. |
Source code in rigour/text/scripts.py
text_scripts(text)
Return the set of distinct real scripts present in text.
The right primitive for "which writing systems does this
string use?". Only letters (General_Category L*) and
numbers (N*) contribute; Common, Inherited, and Unknown are
excluded — shared characters (digits, punctuation) and
combining marks don't count as their own script.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Any string, including empty. |
required |
Returns:
| Type | Description |
|---|---|
set[str]
|
Set of script long names. Empty when the input has no |
set[str]
|
letters or numbers (numeric-only, punctuation-only, |
set[str]
|
empty). |