import pandas as pdPlace Name Normalization
The resulting dataset (data/clean/unique_places.csv) serves as the authoritative lookup table for all place references across the project, enabling consistent spatial analysis and record linkage.
Historical documents present significant challenges in place name identification due to spelling variations, ambiguous toponyms, and evolving geographic nomenclature. This notebook establishes an authoritative gazetteer of unique places mentioned in the Sondondo sacramental records through a two-phase approach:
- Manual curation: Apply expert knowledge to disambiguate problematic cases and link canonical forms to authoritative geographic databases (GeoNames, Getty TGN, World Historical Gazetteer, Wikidata)
- Automated extraction: Identify and preliminarily resolve place mentions using the
placeRecognitionmodule
Data Preparation
We begin by loading the cleaned sacramental records, which have already undergone date normalization and attribute harmonization. Place names at this stage retain their original spelling variations as recorded in the historical documents.
PLACES_MAP = '../data/mappings/places_types.json'BAUTISMOS_HARMONIZED = pd.read_csv("../data/clean/bautismos_clean.csv")
MATRIMONIOS_HARMONIZED = pd.read_csv("../data/clean/matrimonios_clean.csv")
ENTIERROS_HARMONIZED = pd.read_csv("../data/clean/entierros_clean.csv")
BAUTISMOS_HARMONIZED| file | identifier | event_type | event_date | baptized_name | baptized_birth_place | baptized_birth_date | baptized_legitimacy_status | father_name | father_lastname | ... | godfather_social_condition | godmother_name | godmother_lastname | godmother_social_condition | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | baptized_lastname | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | APAucará LB L001 | B001 | Bautizo | 1790-10-04 | domingo | NaN | 1790-08-04 | Hijo legitimo | lucas | ayquipa | ... | NaN | NaN | NaN | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN | ayquipa |
| 1 | APAucará LB L001 | B002 | Bautizo | 1790-10-06 | dominga | NaN | 1790-08-04 | Hija legitima | juan | lulia | ... | NaN | NaN | NaN | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN | lulia |
| 2 | APAucará LB L001 | B003 | Bautizo | 1790-10-07 | bartola | NaN | 1790-08-04 | Hija legitima | jacinto | quispe | ... | NaN | rotonda | pocco | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN | quispe |
| 3 | APAucará LB L001 | B004 | Bautizo | 1790-10-20 | francisca | NaN | 1790-10-15 | Hija legitima | juan | cuebas | ... | NaN | ysabel | guillen | NaN | Aucara | Aucara | NaN | NaN | NaN | cuebas |
| 4 | APAucará LB L001 | B005 | Bautizo | 1790-10-20 | pedro | NaN | 1790-10-19 | Hijo legitimo | santos | manxo | ... | NaN | josefa | santiago | NaN | Aucara | Aucara | NaN | NaN | NaN | manxo |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6335 | APAucará LB L004 | B2042 | Bautizo | 1888-12-10 | leocadio | NaN | 1888-12-09 | Hijo natural, mestizo | miguel | pacheco | ... | NaN | NaN | NaN | NaN | Aucará | Aucará | Aucará | NaN | NaN | pacheco |
| 6336 | APAucará LB L004 | B2043 | Bautizo | 1888-12-11 | mariano concepcion | NaN | 1888-12-07 | Hijo legítimo, indio | facundo | vega | ... | NaN | NaN | NaN | NaN | Aucará | Aucará | Aucará | NaN | NaN | vega |
| 6337 | APAucará LB L004 | B2044 | Bautizo | 1888-12-12 | ambrosio | NaN | 1888-12-06 | Hijo legítimo, indio | ysidro | ccasane | ... | NaN | NaN | NaN | NaN | Aucará | Aucará | Mayobamba | NaN | NaN | ccasane |
| 6338 | APAucará LB L004 | B2045 | Bautizo | 1888-12-15 | francisco | NaN | 1888-11-30 | Hijo legítimo, indio | mariano | lopez | ... | Indigna de Huaicahuacho | NaN | NaN | NaN | Aucará | Aucará | Huaicahuacho | NaN | NaN | lopez |
| 6339 | APAucará LB L004 | B2046 | Bautizo | 1888-12-16 | laureana | NaN | 1888-12-01 | Hija legítima, india | bernarda | champa | ... | NaN | manuela | de la cruz | NaN | Aucará | Aucará | Chacralla | NaN | NaN | champa |
6340 rows × 27 columns
Automated Place Extraction
The PlaceExtractor identifies place mentions within text fields and applies initial normalization using the geographic type taxonomy defined in places_types.json. This automated process handles straightforward cases but requires manual review for ambiguous toponyms.
We systematically process all place-related columns across the three sacramental record types, applying the extractor to standardize formatting and identify geographic descriptors.
Extract Place Mentions
from actions.extractors import placeRecognition
extractor = placeRecognition.PlaceExtractor()Baptismal Records
bautismos_place_columns = [
'baptized_birth_place', 'event_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4'
]
for col in bautismos_place_columns:
if col in BAUTISMOS_HARMONIZED.columns:
BAUTISMOS_HARMONIZED[col] = extractor.extract_places_per_row(BAUTISMOS_HARMONIZED[col])
BAUTISMOS_HARMONIZED[bautismos_place_columns]| baptized_birth_place | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | |
|---|---|---|---|---|---|---|
| 0 | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN |
| 1 | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN |
| 2 | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN |
| 3 | NaN | Aucara | Aucara | NaN | NaN | NaN |
| 4 | NaN | Aucara | Aucara | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 6335 | NaN | Aucará | Aucará | Aucará | NaN | NaN |
| 6336 | NaN | Aucará | Aucará | Aucará | NaN | NaN |
| 6337 | NaN | Aucará | Aucará | Mayobamba | NaN | NaN |
| 6338 | NaN | Aucará | Aucará | Huaicahuacho | NaN | NaN |
| 6339 | NaN | Aucará | Aucará | Chacralla | NaN | NaN |
6340 rows × 6 columns
Marriage Records
matrimonios_place_columns = [
'husband_birth_place',
'husband_resident_in',
'wife_birth_place', 'wife_resident_in',
'event_place', 'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
'event_geographic_descriptor_3', 'event_geographic_descriptor_4',
'event_geographic_descriptor_5', 'event_geographic_descriptor_6'
]
for col in matrimonios_place_columns:
if col in MATRIMONIOS_HARMONIZED.columns:
MATRIMONIOS_HARMONIZED[col] = extractor.extract_places_per_row(MATRIMONIOS_HARMONIZED[col])
MATRIMONIOS_HARMONIZED[matrimonios_place_columns]| husband_birth_place | husband_resident_in | wife_birth_place | wife_resident_in | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | event_geographic_descriptor_5 | event_geographic_descriptor_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ciudad de Huamanga | Aucara | NaN | NaN | Aucara | Aucara | Huamanga | Coracora | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | Aucara | Aucara | Colca | NaN | NaN | NaN | NaN |
| 2 | Pampamarca | NaN | Pampamarca | NaN | Aucara | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
| 3 | Pampamarca | NaN | Pampamarca | NaN | Pampamarca|santa iglesia | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | Pampamarca|santa iglesia | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1714 | Pampamarca | NaN | Pampamarca | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
| 1715 | Chacralla | NaN | Chacralla | NaN | Chacralla, iglesia vice-parroquial | Aucara | Chacralla | NaN | NaN | NaN | NaN |
| 1716 | Chacralla | NaN | Chacralla | NaN | Chacralla, iglesia vice-parroquial | Aucara | Chacralla | NaN | NaN | NaN | NaN |
| 1717 | NaN | Aucara | NaN | Aucara | Aucara | Aucara | Queca | NaN | NaN | NaN | NaN |
| 1718 | Pampamarca | NaN | Pampamarca | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
1719 rows × 11 columns
Burial Records
entierros_place_columns = [
'event_place', 'deceased_birth_place', 'burial_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4'
]
for col in entierros_place_columns:
if col in ENTIERROS_HARMONIZED.columns:
ENTIERROS_HARMONIZED[col] = extractor.extract_places_per_row(ENTIERROS_HARMONIZED[col])
ENTIERROS_HARMONIZED[entierros_place_columns]| event_place | deceased_birth_place | burial_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | |
|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| 1 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| 2 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| 3 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| 4 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2187 | Aucara | Santa Ana de Aucara | NaN | Aucara | Santa Ana de Aucara | NaN | NaN |
| 2188 | Aucara | Pampamarca | NaN | Aucara | Pampamarca | NaN | NaN |
| 2189 | Aucara | Santa Ana de Aucara | NaN | Aucara | Santa Ana de Aucara | NaN | NaN |
| 2190 | Aucara | Aucara | NaN | Aucara | NaN | NaN | NaN |
| 2191 | Aucara | Aucara | NaN | Aucara | NaN | NaN | NaN |
2192 rows × 7 columns
Consolidate and Resolve Places
The MapPlaces class aggregates all place mentions from the three datasets, identifies unique toponyms, and attempts automated resolution against authoritative gazetteers. The output is suppressed here as it produces verbose logging; results are saved for subsequent manual review.
%%capture
bautismos_places = BAUTISMOS_HARMONIZED[bautismos_place_columns]
matrimonios_places = MATRIMONIOS_HARMONIZED[matrimonios_place_columns]
entierros_places = ENTIERROS_HARMONIZED[entierros_place_columns]
map_places = placeRecognition.MapPlaces([bautismos_places, matrimonios_places, entierros_places], places_map=PLACES_MAP)
all_unique_places = map_places.resolve_places()
print("All unique places extracted:")
print(all_unique_places)all_unique_places.to_csv("../data/interim/unique_places.csv", index=False)standardized_places = all_unique_places.loc[all_unique_places['uri'].notna()]
standardized_places.groupby('uri').first().reset_index().sort_values(by='standardize_label').to_csv("../data/interim/standardized_places.csv", index=False)Database Integration
To prepare the gazetteer for relational database storage, we assign a unique integer identifier (place_id) to each canonical place. This serves as the primary key in the places table and enables efficient joins with sacramental records.
result_df['place_id'] = result_df.index + 1
result_df = result_df.set_index('place_id')
result_df| manually_normalized_place | standardize_label | language | latitude | longitude | source | id | uri | country_code | part_of | part_of_uri | confidence | threshold | match_type | mentioned_as | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| place_id | |||||||||||||||
| 1 | Acobamba | Acobamba | es | -12.07757 | -74.87127 | GeoNames | 8663907.0 | http://sws.geonames.org/8663907/ | PE | 100.0 | 90.0 | exact | [Acobamba] | ||
| 4 | Andamarca | Andamarca | es | -15.63833 | -70.58848 | GeoNames | 3947725.0 | http://sws.geonames.org/3947725/ | PE | 100.0 | 90.0 | exact | [Andamarca] | ||
| 6 | Apongo | Apongo | es | -14.01327 | -73.93247 | GeoNames | 3947431.0 | http://sws.geonames.org/3947431/ | PE | 100.0 | 90.0 | exact | [Apongo] | ||
| 8 | Aucará | Aucará | es | -14.25000 | -74.08333 | GeoNames | 3947087.0 | http://sws.geonames.org/3947087/ | PE | 100.0 | 90.0 | exact | [Aucara, Aucara Barrio de Mayo, Aucará, Barrio... | ||
| 12 | Huaycahuaycho | Huaycahuaycho | es | -14.15000 | -74.01667 | GeoNames | 3939003.0 | http://sws.geonames.org/3939003/ | PE | 100.0 | 90.0 | exact | [Aucara Huaycahuacho, Huaicahuacho] | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85 | Soras Pata | Soras Pata | es | -14.23741 | -70.65011 | GeoNames | 13238703.0 | http://sws.geonames.org/13238703/ | PE | 100.0 | 90.0 | exact | [Soras] | ||
| 87 | Umasi | Umasi | es | -14.89142 | -70.68701 | GeoNames | 13238711.0 | http://sws.geonames.org/13238711/ | PE | 100.0 | 90.0 | exact | [Umaci, Umasi] | ||
| 88 | Urubamba | Urubamba | es | -13.30472 | -72.11583 | GeoNames | 3926438.0 | http://sws.geonames.org/3926438/ | PE | 100.0 | 90.0 | exact | [Urabamba] | ||
| 89 | Vilcashuamán | Vilcashuamán | es | -13.65361 | -73.95306 | GeoNames | 3926141.0 | http://sws.geonames.org/3926141/ | PE | 100.0 | 90.0 | exact | [Vilcas] | ||
| 90 | Villa San Juan | Villa San Juan | es | -6.37252 | -79.80292 | GeoNames | 3820188.0 | http://sws.geonames.org/3820188/ | PE | 100.0 | 90.0 | exact | [Villa de San Juan] |
74 rows × 15 columns
result_df.to_csv("../data/clean/unique_places.csv", index=True)Summary
This notebook established an authoritative gazetteer for the Sondondo sacramental records through a two-phase normalization process:
- Automated extraction identified place mentions across all three sacramental record types
- Manual curation resolved ambiguous toponyms and linked canonical forms to authoritative geographic databases
- Authority control produced a verified lookup table with stable identifiers and coordinates
The resulting dataset (data/clean/unique_places.csv) provides:
- Controlled vocabulary for place names, reducing spelling variations
- Geographic coordinates enabling spatial analysis
- Links to authoritative gazetteers (GeoNames, TGN, WHG, Wikidata) for interoperability
- Hierarchical geographic context (administrative divisions, place types)
- Stable identifiers for database integration
This gazetteer is essential for subsequent analyses involving geographic mobility, spatial clustering, and linking records across events.