Names
rigour.names
Name handling utilities for person and organisation names. This module contains a large (and growing) set of tools for handling names. In general, there are three types of names: people, organizations, and objects. Different normalization may be required for each of these types, including prefix removal for person names (e.g. "Mr." or "Ms.") and type normalization for organization names (e.g. "Incorporated" -> "Inc" or "Limited" -> "Ltd").
The Name
class is meant to provide a structure for a name, including its original form, normalized form,
metadata on the type of thing described by the name, and the language of the name. The NamePart
class
is used to represent individual parts of a name, such as the first name, middle name, and last name.
Name
Bases: object
A name of a thing, such as a person, organization or object. Each name consists of a
sequence of parts, each of which has a form and a tag. The form is the text of the part, and the tag
is a label indicating the type of part. For example, in the name "John Smith", "John" is a given name
and "Smith" is a family name. The tag for "John" would be NamePartTag.GIVEN
and the tag for "Smith"
would be NamePartTag.FAMILY
. The form for both parts would be the text of the part itself.
Source code in rigour/names/name.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
comparable
property
Return the ASCII representation of the name, if available.
norm_form
property
Return the normalized form of the name by joining name parts.
symbols
property
Return a dictionary of symbols applied to the name.
apply_part(part, symbol)
apply_phrase(phrase, symbol)
Apply a symbol to a phrase in the name.
Source code in rigour/names/name.py
contains(other)
Check if this name contains another name.
Source code in rigour/names/name.py
symbol_map()
Return a mapping of symbols to their string representations.
Source code in rigour/names/name.py
NamePart
Bases: object
A part of a name, such as a given name or family name. This object is used to compare and match names. It generates and caches representations of the name in various processing forms.
Source code in rigour/names/part.py
can_match(other)
Check if this part can match another part. This is based on the tags of the parts.
Source code in rigour/names/part.py
NamePartTag
Bases: Enum
Within a name, identify name part types.
Source code in rigour/names/tag.py
NameTypeTag
Bases: Enum
Metadata on what sort of object is described by a name
Source code in rigour/names/tag.py
Span
A span is a set of parts of a name that have been tagged with a symbol.
Source code in rigour/names/part.py
comparable
property
Return the comparison-suited string representation of the span.
Symbol
A symbol is a semantic interpretation applied to one or more parts of a name. Symbols can represent various categories such as organization classes, initials, names, ordinals, or phonetic transcriptions. Each symbol has a category and an identifier.
Source code in rigour/names/symbol.py
align_name_slop(query, result, max_slop=2)
Align name parts of companies and organizations. The idea here is to allow skipping tokens within the entity name if this improves overall match quality, but never to re-order name parts. The resulting alignment will contain the sorted name parts of both the query and the result, as well as any extra parts that were not aligned.
Note that one name part in one list may correspond to multiple name parts in the other list, so the alignment is not necessarily one-to-one.
The levenshtein
distance is used to determine the best alignment, allowing
for a certain spelling variation between the names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
List[NamePart]
|
The name parts of the query. |
required |
result
|
List[NamePart]
|
The name parts of the result. |
required |
max_slop
|
int
|
The maximum number of tokens that can be skipped in the alignment. Defaults to 2. |
2
|
Returns: Alignment: An object containing the aligned name parts and any extra parts.
Source code in rigour/names/alignment.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
|
align_name_strict(query, result, max_slop=2)
Align name parts of companies and organizations strictly by their token sequence. This implementation does not use fuzzy matching or Levenshtein distance, but rather aligns names only if individual name parts match exactly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
List[NamePart]
|
The name parts of the query. |
required |
result
|
List[NamePart]
|
The name parts of the result. |
required |
Returns: Alignment: An object containing the aligned name parts and any extra parts.
Source code in rigour/names/alignment.py
align_person_name_order(query, result)
Aligns the name parts of a person name for the query and result based on their tags and their string similarity such that the most similar name parts are matched.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
List[NamePart]
|
The name parts from the query. |
required |
result
|
List[NamePart]
|
The name parts from the result. |
required |
Returns:
Name | Type | Description |
---|---|---|
Alignment |
Alignment
|
An object containing the aligned name parts and any extra parts. |
Source code in rigour/names/alignment.py
align_tag_sort(query, result)
Align name parts of companies and organizations by sorting them by their tags. This is a simple alignment that does not allow for any slop or re-ordering of name parts, but it is useful for cases where the names are already well-formed and comparable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
List[NamePart]
|
The name parts of the query. |
required |
result
|
List[NamePart]
|
The name parts of the result. |
required |
Returns: Alignment: An object containing the aligned name parts and any extra parts.
Source code in rigour/names/alignment.py
extract_org_types(name, normalizer=_normalize_compare, generic=False)
Match any organization type designation (e.g. LLC, Inc, GmbH) in the given entity name and return the extracted type.
This can be used as a very poor man's method to determine if a given string is a company name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The text to be processed. It is assumed to be already normalized (see below). |
required |
normalizer
|
Callable[[str | None], str | None]
|
A text normalization function to run on the lookup values before matching to remove text anomalies and make matches more likely. |
_normalize_compare
|
generic
|
bool
|
If True, return the generic form of the organization type (e.g. LLC, JSC) instead of the type-specific comparison form (GmbH, AB, NV). |
False
|
Returns:
Type | Description |
---|---|
List[Tuple[str, str]]
|
Tuple[str, str]: Tuple of the org type as matched, and the compare form of it. |
Source code in rigour/names/org_types.py
is_name(name)
Check if the given string is a name. The string is considered a name if it contains at least one character that is a letter (category 'L' in Unicode).
Source code in rigour/names/check.py
load_person_names()
Load the person QID to name mappings from disk. This is a collection of aliases (in various alphabets) of person name parts mapped to a Wikidata QID representing that name part.
Returns:
Type | Description |
---|---|
None
|
Generator[Tuple[str, List[str]], None, None]: A generator yielding tuples of QID and list of names. |
Source code in rigour/names/person.py
load_person_names_mapping(normalizer=noop_normalizer)
Load the person QID to name mappings from disk. This is a collection of aliases (in various alphabets) of person name parts mapped to a Wikidata QID representing that name part.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
normalizer
|
Normalizer
|
A function to normalize names. Defaults to noop_normalizer. |
noop_normalizer
|
Returns:
Type | Description |
---|---|
Dict[str, Set[str]]
|
Dict[str, Set[str]]: A dictionary mapping normalized names to sets of QIDs. |
Source code in rigour/names/person.py
pick_case(names)
Pick the best mix of lower- and uppercase characters from a set of names that are identical except for case.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
List[str]
|
A list of identical names in different cases. |
required |
Returns:
Type | Description |
---|---|
str
|
Optional[str]: The best name for display. |
Source code in rigour/names/pick.py
pick_name(names)
Pick the best name from a list of names. This is meant to pick a centroid name, with a bias towards names in a latin script.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
List[str]
|
A list of names. |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: The best name for display. |
Source code in rigour/names/pick.py
prenormalize_name(name)
reduce_names(names)
Select a reduced set of names from a list of names. This is used to prepare the set of names linked to a person, organization, or other entity for publication.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
List[str]
|
A list of names. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: The reduced list of names. |
Source code in rigour/names/pick.py
remove_org_prefixes(name)
remove_org_types(name, replacement='', normalizer=_normalize_compare)
Match any organization type designation (e.g. LLC, Inc, GmbH) in the given entity name and replace it with the given fixed string (empty by default, which signals removal).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The text to be processed. It is assumed to be already normalized (see below). |
required |
normalizer
|
Callable[[str | None], str | None]
|
A text normalization function to run on the lookup values before matching to remove text anomalies and make matches more likely. |
_normalize_compare
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with organization types replaced/removed. |
Source code in rigour/names/org_types.py
remove_person_prefixes(name)
replace_org_types_compare(name, normalizer=_normalize_compare, generic=False)
Replace any organization type indicated in the given entity name (often as a prefix or suffix) with a heavily normalized form label. This will re-write country-specific entity types (eg. GmbH) into a simplified spelling suitable for comparison using string distance. The resulting text is meant to be used in comparison processes, but no longer fit for presentation to a user.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The text to be processed. It is assumed to be already normalized (see below). |
required |
normalizer
|
Callable[[str | None], str | None]
|
A text normalization function to run on the lookup values before matching to remove text anomalies and make matches more likely. |
_normalize_compare
|
generic
|
bool
|
If True, return the generic form of the organization type (e.g. LLC, JSC) instead of the type-specific comparison form (GmbH, AB, NV). |
False
|
Returns:
Type | Description |
---|---|
str
|
Optional[str]: The text with organization types replaced. |
Source code in rigour/names/org_types.py
replace_org_types_display(name, normalizer=normalize_display)
Replace organization types in the text with their shortened form. This will perform a display-safe (light) form of normalization, useful for shortening spelt-out legal forms into common abbreviations (eg. Siemens Aktiengesellschaft -> Siemens AG).
If the result of the replacement yields an empty string, the original text is returned as-is.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The text to be processed. It is assumed to be already normalized (see below). |
required |
normalizer
|
Callable[[str | None], str | None]
|
A text normalization function to run on the lookup values before matching to remove text anomalies and make matches more likely. |
normalize_display
|
Returns:
Type | Description |
---|---|
str
|
Optional[str]: The text with organization types replaced. |
Source code in rigour/names/org_types.py
tag_org_name(name, normalizer)
Tag the name with the organization type and symbol tags.
Source code in rigour/names/tagging.py
tag_person_name(name, normalizer, any_initials=False)
Tag a person's name with the person name part and other symbol tags.
Source code in rigour/names/tagging.py
tokenize_name(text, token_min_length=1)
Split a person or entity's name into name parts.