Place Name Normalization

The resulting dataset (data/clean/unique_places.csv) serves as the authoritative lookup table for all place references across the project, enabling consistent spatial analysis and record linkage.

Historical documents present significant challenges in place name identification due to spelling variations, ambiguous toponyms, and evolving geographic nomenclature. This notebook establishes an authoritative gazetteer of unique places mentioned in the Sondondo sacramental records through a two-phase approach:

  1. Manual curation: Apply expert knowledge to disambiguate problematic cases and link canonical forms to authoritative geographic databases (GeoNames, Getty TGN, World Historical Gazetteer, Wikidata)
  2. Automated extraction: Identify and preliminarily resolve place mentions using the placeRecognition module

Data Preparation

We begin by loading the cleaned sacramental records, which have already undergone date normalization and attribute harmonization. Place names at this stage retain their original spelling variations as recorded in the historical documents.

import pandas as pd
PLACES_MAP = '../data/mappings/places_types.json'
BAUTISMOS_HARMONIZED = pd.read_csv("../data/clean/bautismos_clean.csv")
MATRIMONIOS_HARMONIZED = pd.read_csv("../data/clean/matrimonios_clean.csv")
ENTIERROS_HARMONIZED = pd.read_csv("../data/clean/entierros_clean.csv")

BAUTISMOS_HARMONIZED
file identifier event_type event_date baptized_name baptized_birth_place baptized_birth_date baptized_legitimacy_status father_name father_lastname ... godfather_social_condition godmother_name godmother_lastname godmother_social_condition event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4 baptized_lastname
0 APAucará LB L001 B001 Bautizo 1790-10-04 domingo NaN 1790-08-04 Hijo legitimo lucas ayquipa ... NaN NaN NaN NaN Pampamarca Aucara Pampamarca NaN NaN ayquipa
1 APAucará LB L001 B002 Bautizo 1790-10-06 dominga NaN 1790-08-04 Hija legitima juan lulia ... NaN NaN NaN NaN Pampamarca Aucara Pampamarca NaN NaN lulia
2 APAucará LB L001 B003 Bautizo 1790-10-07 bartola NaN 1790-08-04 Hija legitima jacinto quispe ... NaN rotonda pocco NaN Pampamarca Aucara Pampamarca NaN NaN quispe
3 APAucará LB L001 B004 Bautizo 1790-10-20 francisca NaN 1790-10-15 Hija legitima juan cuebas ... NaN ysabel guillen NaN Aucara Aucara NaN NaN NaN cuebas
4 APAucará LB L001 B005 Bautizo 1790-10-20 pedro NaN 1790-10-19 Hijo legitimo santos manxo ... NaN josefa santiago NaN Aucara Aucara NaN NaN NaN manxo
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6335 APAucará LB L004 B2042 Bautizo 1888-12-10 leocadio NaN 1888-12-09 Hijo natural, mestizo miguel pacheco ... NaN NaN NaN NaN Aucará Aucará Aucará NaN NaN pacheco
6336 APAucará LB L004 B2043 Bautizo 1888-12-11 mariano concepcion NaN 1888-12-07 Hijo legítimo, indio facundo vega ... NaN NaN NaN NaN Aucará Aucará Aucará NaN NaN vega
6337 APAucará LB L004 B2044 Bautizo 1888-12-12 ambrosio NaN 1888-12-06 Hijo legítimo, indio ysidro ccasane ... NaN NaN NaN NaN Aucará Aucará Mayobamba NaN NaN ccasane
6338 APAucará LB L004 B2045 Bautizo 1888-12-15 francisco NaN 1888-11-30 Hijo legítimo, indio mariano lopez ... Indigna de Huaicahuacho NaN NaN NaN Aucará Aucará Huaicahuacho NaN NaN lopez
6339 APAucará LB L004 B2046 Bautizo 1888-12-16 laureana NaN 1888-12-01 Hija legítima, india bernarda champa ... NaN manuela de la cruz NaN Aucará Aucará Chacralla NaN NaN champa

6340 rows × 27 columns

Automated Place Extraction

The PlaceExtractor identifies place mentions within text fields and applies initial normalization using the geographic type taxonomy defined in places_types.json. This automated process handles straightforward cases but requires manual review for ambiguous toponyms.

We systematically process all place-related columns across the three sacramental record types, applying the extractor to standardize formatting and identify geographic descriptors.

Extract Place Mentions

from actions.extractors import placeRecognition

extractor = placeRecognition.PlaceExtractor()

Baptismal Records

bautismos_place_columns = [
    'baptized_birth_place', 'event_place', 'event_geographic_descriptor_1',
        'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
        'event_geographic_descriptor_4'
]

for col in bautismos_place_columns:
    if col in BAUTISMOS_HARMONIZED.columns:
        BAUTISMOS_HARMONIZED[col] = extractor.extract_places_per_row(BAUTISMOS_HARMONIZED[col])

BAUTISMOS_HARMONIZED[bautismos_place_columns]
baptized_birth_place event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4
0 NaN Pampamarca Aucara Pampamarca NaN NaN
1 NaN Pampamarca Aucara Pampamarca NaN NaN
2 NaN Pampamarca Aucara Pampamarca NaN NaN
3 NaN Aucara Aucara NaN NaN NaN
4 NaN Aucara Aucara NaN NaN NaN
... ... ... ... ... ... ...
6335 NaN Aucará Aucará Aucará NaN NaN
6336 NaN Aucará Aucará Aucará NaN NaN
6337 NaN Aucará Aucará Mayobamba NaN NaN
6338 NaN Aucará Aucará Huaicahuacho NaN NaN
6339 NaN Aucará Aucará Chacralla NaN NaN

6340 rows × 6 columns

Marriage Records

matrimonios_place_columns = [
    'husband_birth_place',
       'husband_resident_in', 
       'wife_birth_place', 'wife_resident_in', 
       'event_place', 'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
       'event_geographic_descriptor_3', 'event_geographic_descriptor_4',
       'event_geographic_descriptor_5', 'event_geographic_descriptor_6'
]

for col in matrimonios_place_columns:
    if col in MATRIMONIOS_HARMONIZED.columns:
        MATRIMONIOS_HARMONIZED[col] = extractor.extract_places_per_row(MATRIMONIOS_HARMONIZED[col])

MATRIMONIOS_HARMONIZED[matrimonios_place_columns]
husband_birth_place husband_resident_in wife_birth_place wife_resident_in event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4 event_geographic_descriptor_5 event_geographic_descriptor_6
0 Ciudad de Huamanga Aucara NaN NaN Aucara Aucara Huamanga Coracora NaN NaN NaN
1 NaN NaN NaN NaN Aucara Aucara Colca NaN NaN NaN NaN
2 Pampamarca NaN Pampamarca NaN Aucara Aucara Pampamarca NaN NaN NaN NaN
3 Pampamarca NaN Pampamarca NaN Pampamarca|santa iglesia Aucara Pampamarca NaN NaN NaN NaN
4 NaN NaN NaN NaN Pampamarca|santa iglesia Aucara Pampamarca NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
1714 Pampamarca NaN Pampamarca NaN Pampamarca Aucara Pampamarca NaN NaN NaN NaN
1715 Chacralla NaN Chacralla NaN Chacralla, iglesia vice-parroquial Aucara Chacralla NaN NaN NaN NaN
1716 Chacralla NaN Chacralla NaN Chacralla, iglesia vice-parroquial Aucara Chacralla NaN NaN NaN NaN
1717 NaN Aucara NaN Aucara Aucara Aucara Queca NaN NaN NaN NaN
1718 Pampamarca NaN Pampamarca NaN Pampamarca Aucara Pampamarca NaN NaN NaN NaN

1719 rows × 11 columns

Burial Records

entierros_place_columns = [
    'event_place', 'deceased_birth_place', 'burial_place', 'event_geographic_descriptor_1',
    'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
    'event_geographic_descriptor_4'
]

for col in entierros_place_columns:
    if col in ENTIERROS_HARMONIZED.columns:
        ENTIERROS_HARMONIZED[col] = extractor.extract_places_per_row(ENTIERROS_HARMONIZED[col])

ENTIERROS_HARMONIZED[entierros_place_columns]
event_place deceased_birth_place burial_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4
0 NaN NaN NaN Aucará Lucanas NaN NaN
1 NaN NaN NaN Aucará Lucanas NaN NaN
2 NaN NaN NaN Aucará Lucanas NaN NaN
3 NaN NaN NaN Aucará Lucanas NaN NaN
4 NaN NaN NaN Aucará Lucanas NaN NaN
... ... ... ... ... ... ... ...
2187 Aucara Santa Ana de Aucara NaN Aucara Santa Ana de Aucara NaN NaN
2188 Aucara Pampamarca NaN Aucara Pampamarca NaN NaN
2189 Aucara Santa Ana de Aucara NaN Aucara Santa Ana de Aucara NaN NaN
2190 Aucara Aucara NaN Aucara NaN NaN NaN
2191 Aucara Aucara NaN Aucara NaN NaN NaN

2192 rows × 7 columns

Consolidate and Resolve Places

The MapPlaces class aggregates all place mentions from the three datasets, identifies unique toponyms, and attempts automated resolution against authoritative gazetteers. The output is suppressed here as it produces verbose logging; results are saved for subsequent manual review.

%%capture

bautismos_places = BAUTISMOS_HARMONIZED[bautismos_place_columns]
matrimonios_places = MATRIMONIOS_HARMONIZED[matrimonios_place_columns] 
entierros_places = ENTIERROS_HARMONIZED[entierros_place_columns]

map_places = placeRecognition.MapPlaces([bautismos_places, matrimonios_places, entierros_places], places_map=PLACES_MAP)
all_unique_places = map_places.resolve_places()
print("All unique places extracted:")
print(all_unique_places)
all_unique_places.to_csv("../data/interim/unique_places.csv", index=False)
standardized_places = all_unique_places.loc[all_unique_places['uri'].notna()]
standardized_places.groupby('uri').first().reset_index().sort_values(by='standardize_label').to_csv("../data/interim/standardized_places.csv", index=False)

Manual Curation and Authority Control

Automated place resolution, while effective for unambiguous cases, cannot handle toponymic challenges such as:

  • Homonyms: Multiple places sharing the same name (e.g., “Pampamarca” appears in multiple regions)
  • Historical name changes: Places known by different names across time periods
  • Spelling variations: Inconsistent orthography in colonial records (e.g., “Ishua” vs “Ischua”)
  • Ambiguous references: Generic descriptors that could refer to multiple locations

To address these issues, we created data/interim/unique_places_manual.csv, an authority file that maps variant forms (mentioned_as) to canonical place names with verified geographic coordinates and gazetteer identifiers.

Validation Against Manual Normalization

We compare the automated extraction results against the manually curated authority file to identify discrepancies and ensure completeness.

uplaces = pd.read_csv('../data/interim/unique_places_manual.csv')
set_diff = set(standardized_places['standardize_label']) - set(uplaces['manually_normalized_place'])
set_diff
{'Cachihuancaray',
 'Carhuanca',
 'Carlos Fitzcarrald',
 'Ceibo Roto',
 'Champa',
 'Chaupicancha',
 'Chipao',
 'Chuschi',
 'Ciudad Libertad de las Américas',
 'Cochas',
 'Collay',
 'Coracora',
 'Dos de Mayo',
 'Huambo',
 'Huanacopampa',
 'Huancapi',
 'Huancaraylla',
 'Illapata',
 'India Muerta',
 'Indio Piro',
 'Julca',
 'Llusita',
 'Marca',
 'Paico',
 'Paire',
 'Palco',
 'Pausa',
 'Pincocalla',
 'Poma Patacollo',
 'Queca',
 'Querco',
 'San Pedro de Lloc',
 'Santa Anita - Los Ficus',
 'Santa Iglesia',
 'Santa María',
 'Taulli',
 'Yanaccollpa'}

Authoritative Place Resolution

The AuthoritativePlaceResolver applies the manually curated authority file to resolve all place mentions. For each canonical place name, it queries multiple authoritative gazetteers in priority order:

  1. GeoNames - comprehensive global gazetteer with detailed administrative hierarchies
  2. Getty Thesaurus of Geographic Names (TGN) - art historical geographic authority
  3. World Historical Gazetteer (WHG) - specialized in historical place names
  4. Wikidata - linked data resource with extensive geographic coverage

This process produces data/clean/unique_places.csv, the authoritative lookup table linking historical place mentions to verified geographic entities with coordinates, hierarchical context, and stable identifiers.

Note: We extended places_types.json to include “administrative division” as a recognized geographic type, enabling better classification of jurisdictional references in the historical records:

"administrative division": {
    "geonames": "A",
    "wikidata": "Q5",
    "tgn": "administrative divisions",
    "whg": "a"
  }
%%capture

manual_data = pd.read_csv('../data/interim/unique_places_manual.csv')

resolver = placeRecognition.AuthoritativePlaceResolver(data=manual_data, places_map=PLACES_MAP)
result_df = resolver.resolve_places()

print("Resolved places:")
print(result_df)

Database Integration

To prepare the gazetteer for relational database storage, we assign a unique integer identifier (place_id) to each canonical place. This serves as the primary key in the places table and enables efficient joins with sacramental records.

result_df['place_id'] = result_df.index + 1
result_df = result_df.set_index('place_id')

result_df
manually_normalized_place standardize_label language latitude longitude source id uri country_code part_of part_of_uri confidence threshold match_type mentioned_as
place_id
1 Acobamba Acobamba es -12.07757 -74.87127 GeoNames 8663907.0 http://sws.geonames.org/8663907/ PE 100.0 90.0 exact [Acobamba]
4 Andamarca Andamarca es -15.63833 -70.58848 GeoNames 3947725.0 http://sws.geonames.org/3947725/ PE 100.0 90.0 exact [Andamarca]
6 Apongo Apongo es -14.01327 -73.93247 GeoNames 3947431.0 http://sws.geonames.org/3947431/ PE 100.0 90.0 exact [Apongo]
8 Aucará Aucará es -14.25000 -74.08333 GeoNames 3947087.0 http://sws.geonames.org/3947087/ PE 100.0 90.0 exact [Aucara, Aucara Barrio de Mayo, Aucará, Barrio...
12 Huaycahuaycho Huaycahuaycho es -14.15000 -74.01667 GeoNames 3939003.0 http://sws.geonames.org/3939003/ PE 100.0 90.0 exact [Aucara Huaycahuacho, Huaicahuacho]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85 Soras Pata Soras Pata es -14.23741 -70.65011 GeoNames 13238703.0 http://sws.geonames.org/13238703/ PE 100.0 90.0 exact [Soras]
87 Umasi Umasi es -14.89142 -70.68701 GeoNames 13238711.0 http://sws.geonames.org/13238711/ PE 100.0 90.0 exact [Umaci, Umasi]
88 Urubamba Urubamba es -13.30472 -72.11583 GeoNames 3926438.0 http://sws.geonames.org/3926438/ PE 100.0 90.0 exact [Urabamba]
89 Vilcashuamán Vilcashuamán es -13.65361 -73.95306 GeoNames 3926141.0 http://sws.geonames.org/3926141/ PE 100.0 90.0 exact [Vilcas]
90 Villa San Juan Villa San Juan es -6.37252 -79.80292 GeoNames 3820188.0 http://sws.geonames.org/3820188/ PE 100.0 90.0 exact [Villa de San Juan]

74 rows × 15 columns

result_df.to_csv("../data/clean/unique_places.csv", index=True)

Summary

This notebook established an authoritative gazetteer for the Sondondo sacramental records through a two-phase normalization process:

  1. Automated extraction identified place mentions across all three sacramental record types
  2. Manual curation resolved ambiguous toponyms and linked canonical forms to authoritative geographic databases
  3. Authority control produced a verified lookup table with stable identifiers and coordinates

The resulting dataset (data/clean/unique_places.csv) provides:

  • Controlled vocabulary for place names, reducing spelling variations
  • Geographic coordinates enabling spatial analysis
  • Links to authoritative gazetteers (GeoNames, TGN, WHG, Wikidata) for interoperability
  • Hierarchical geographic context (administrative divisions, place types)
  • Stable identifiers for database integration

This gazetteer is essential for subsequent analyses involving geographic mobility, spatial clustering, and linking records across events.