Data Cleaning

This notebook performs comprehensive data cleaning and standardization of historical sacramental records from Sondondo. The process includes column harmonization, null value handling, temporal normalization, name standardization, and geographic data extraction, preparing the datasets for subsequent person entity extraction and probabilistic record linkage.

# Libraries
import json
import pandas as pd
import numpy as np
import re
from pathlib import Path

1. Data Loading

The raw datasets are stored in the data/raw directory and include three types of sacramental records:

  • bautismos.csv: Baptism records
  • matrimonios.csv: Marriage records
  • entierros.csv: Burial records
BAUTISMOS_RAW = pd.read_csv("../data/raw/bautismos.csv")
MATRIMONIOS_RAW = pd.read_csv("../data/raw/matrimonios.csv")
ENTIERROS_RAW = pd.read_csv("../data/raw/entierros.csv")

BAUTISMOS_RAW.head()
Secuencia Unidad Documental Compuesta (a la que pertenece) Identificador (es recomendable seguir una secuencia numeral como la mostrada en los ejemplos) Título (incluir un título breve para cada documento) Folio inicial del documento (convertir como se muestra abajo) Folio final del documento (convertir como se muestra abajo) Imagen inicial (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes) Imagen final (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes) Tipo de evento Fecha aaaa-mm-dd ... Condición de la madrina Lugar de bautizo Notas adicionales del documento Descriptor Geográfico 1 Descriptor Geográfico 2 Descriptor Geográfico 3 Descriptor Geográfico 4 5 Características físicas (Estado de conservación de los materiales físicos) Historia de revisión (de los materiales digitalizados)
0 1.0 APAucará LB L001 B001 Bautizo. Domingo. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-04 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
1 2.0 APAucará LB L001 B002 Bautizo. Dominga. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-06 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
2 3.0 APAucará LB L001 B003 Bautizo. Bartola. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-07 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
3 4.0 APAucará LB L001 B004 Bautizo. Francisca 3v 3v IMG_7000b IMG_7000b Bautizo 1790-10-20 ... NaN Aucara, iglesia Abreviatura poco visible en el margen Aucara NaN NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
4 5.0 APAucará LB L001 B005 Bautizo. Pedro 3v 3v IMG_7000b IMG_7000b Bautizo 1790-10-20 ... NaN Aucara, iglesia Margen roto y manchado de tinta Aucara NaN NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023

5 rows × 36 columns

2. Column Harmonization

Standardize column names across datasets to ensure consistency. Different historical sources may use different field names for equivalent data (e.g., “nombre” vs “name”), so we map these variants to a unified schema.

Schema Mapping

Column mappings are defined in JSON files located in data/mappings/, which specify how raw column names should be renamed to match our standardized schema.

# Import the ColumnManager utility for applying schema mappings
from utils.ColumnManager import ColumnManager
bautismoMapping = Path("../data/mappings/bautismosMapping.json")
matrimonioMapping = Path("../data/mappings/matrimoniosMapping.json")
entierroMapping = Path("../data/mappings/entierrosMapping.json")

column_manager = ColumnManager()

BAUTISMOS_HARMONIZED = column_manager.harmonize_columns(BAUTISMOS_RAW, bautismoMapping)
MATRIMONIOS_HARMONIZED = column_manager.harmonize_columns(MATRIMONIOS_RAW, matrimonioMapping)
ENTIERROS_HARMONIZED = column_manager.harmonize_columns(ENTIERROS_RAW, entierroMapping)

BAUTISMOS_HARMONIZED.head()
id file identifier title start_folio end_folio start_image end_image event_type event_date ... godmother_social_condition event_place event_additional_notes event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4 event_other record_physical_characteristics revision_history
0 1.0 APAucará LB L001 B001 Bautizo. Domingo. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-04 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
1 2.0 APAucará LB L001 B002 Bautizo. Dominga. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-06 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
2 3.0 APAucará LB L001 B003 Bautizo. Bartola. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-07 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
3 4.0 APAucará LB L001 B004 Bautizo. Francisca 3v 3v IMG_7000b IMG_7000b Bautizo 1790-10-20 ... NaN Aucara, iglesia Abreviatura poco visible en el margen Aucara NaN NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
4 5.0 APAucará LB L001 B005 Bautizo. Pedro 3v 3v IMG_7000b IMG_7000b Bautizo 1790-10-20 ... NaN Aucara, iglesia Margen roto y manchado de tinta Aucara NaN NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023

5 rows × 36 columns

Column Selection and Filtering

After harmonization, we reduce each dataset to only the columns relevant for person entity extraction and record linkage. This simplifies downstream processing and removes fields that don’t contribute to linkage (e.g., administrative notes, scribal annotations).

# Load the mapping of useful columns for each record type
# This defines which fields are relevant for entity extraction and linkage
useful_columns = json.load(open("../data/mappings/usefulColumnsMapping.json"))
# Filter baptism records to useful columns only
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED[useful_columns['bautizo']]

# Remove any columns that are entirely empty
BAUTISMOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)

BAUTISMOS_HARMONIZED.columns
Index(['file', 'identifier', 'event_type', 'event_date', 'baptized_name',
       'baptized_birth_place', 'baptized_birth_date',
       'baptized_legitimacy_status', 'father_name', 'father_lastname',
       'father_social_condition', 'mother_name', 'mother_lastname',
       'mother_social_condition', 'parents_social_condition', 'godfather_name',
       'godfather_lastname', 'godfather_social_condition', 'godmother_name',
       'godmother_lastname', 'godmother_social_condition', 'event_place',
       'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
       'event_geographic_descriptor_3', 'event_geographic_descriptor_4'],
      dtype='object')
# Filter marriage records to useful columns only
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED[useful_columns['matrimonio']]

# Remove any columns that are entirely empty
MATRIMONIOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)

MATRIMONIOS_HARMONIZED.columns
Index(['file', 'identifier', 'event_type', 'event_date', 'husband_name',
       'husband_lastname', 'husband_social_condition',
       'husband_marital_status', 'husband_birth_date', 'husband_birth_place',
       'husband_resident_in', 'husband_legitimacy_status',
       'husband_father_name', 'husband_father_lastname',
       'husband_father_social_condition', 'husband_mother_name',
       'husband_mother_lastname', 'husband_mother_social_condition',
       'wife_name', 'wife_lastname', 'wife_social_condition',
       'wife_marital_status', 'wife_birth_date', 'wife_birth_place',
       'wife_resident_in', 'wife_legitimacy_status', 'wife_father_name',
       'wife_father_lastname', 'wife_father_social_condition',
       'wife_mother_name', 'wife_mother_lastname',
       'wife_mother_social_condition', 'godparent_1_name',
       'godparent_1_lastname', 'godparent_1_social_condition',
       'godparent_2_name', 'godparent_2_lastname',
       'godparent_2_social_condition', 'godparent_3_name',
       'godparent_3_lastname', 'witness_1_name', 'witness_1_lastname',
       'witness_2_name', 'witness_2_lastname', 'witness_3_name',
       'witness_3_lastname', 'witness_4_name', 'witness_4_lastname',
       'event_place', 'event_geographic_descriptor_1',
       'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
       'event_geographic_descriptor_4', 'event_geographic_descriptor_5',
       'event_geographic_descriptor_6'],
      dtype='object')
# Filter burial records to useful columns only
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED[useful_columns['entierro']]

# Remove any columns that are entirely empty
ENTIERROS_HARMONIZED.dropna(axis=1, how='all', inplace=True)

ENTIERROS_HARMONIZED.columns
Index(['file', 'identifier', 'event_type', 'event_date', 'doctrine',
       'event_place', 'deceased_name', 'deceased_lastname',
       'deceased_birth_date', 'deceased_birth_place',
       'deceased_social_condition', 'deceased_marital_status',
       'deceased_legitimacy_status', 'father_name', 'father_lastname',
       'mother_name', 'mother_lastname', 'husband_name', 'wife_name',
       'burial_place', 'event_geographic_descriptor_1',
       'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
       'event_geographic_descriptor_4'],
      dtype='object')

3. Data Quality Preprocessing

Null Value Standardization

Standardize the representation of missing data by converting various placeholder strings (e.g., ‘-’, ‘n/a’, ‘null’) to proper np.nan values. This ensures consistent missing data handling throughout the analysis pipeline.

# Define helper function to replace placeholder strings with numpy NaN
def replace_empty_with_na(df):
    """
    Replace placeholder strings with np.nan in string columns only.
    Handles common representations of missing data found in historical records.
    """
    placeholders = {'', '-', '--', 'n/a', 'na', 'null', 'None'}

    def clean_cell(val):
        if isinstance(val, str) and val.strip().lower() in placeholders:
            return np.nan
        return val

    return df.map(clean_cell)
BAUTISMOS_HARMONIZED = replace_empty_with_na(BAUTISMOS_HARMONIZED)
MATRIMONIOS_HARMONIZED = replace_empty_with_na(MATRIMONIOS_HARMONIZED)
ENTIERROS_HARMONIZED = replace_empty_with_na(ENTIERROS_HARMONIZED)

Empty Row Removal

Remove rows where all fields are empty, as these represent blank entries or page breaks in the original record books rather than actual sacramental events.

BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.dropna(how='all')
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.dropna(how='all')
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.dropna(how='all')

Event Type Corrections

Baptism Event Type Errors

Two records in the baptism dataset have incorrect event types that need correction or removal.

BAUTISMOS_HARMONIZED.loc[~(BAUTISMOS_HARMONIZED['event_type'] == 'Bautizo')]
file identifier event_type event_date baptized_name baptized_birth_place baptized_birth_date baptized_legitimacy_status father_name father_lastname ... godfather_lastname godfather_social_condition godmother_name godmother_lastname godmother_social_condition event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4
1900 APAucará LB L001 B1901 Concurso NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5414 APAucará LB L003 B1120 44444 1869-09-17 Juan Yshua 12 días Hijo natural Ygnacio Sevidola ... Origuela NaN NaN NaN NaN Aucará Aucará NaN NaN NaN

2 rows × 26 columns

Event 1900 is basically an empty record, so we will drop it. While event 5414 requires correction:

BAUTISMOS_HARMONIZED.drop(1900, inplace=True)

BAUTISMOS_HARMONIZED.loc[5414, 'event_type'] = 'Bautizo'
BAUTISMOS_HARMONIZED.iloc[5414]
file                             APAucará LB L003
identifier                                  B1121
event_type                                Bautizo
event_date                             1869-09-17
baptized_name                        Maria Isabel
baptized_birth_place                    Chacralla
baptized_birth_date                        4 días
baptized_legitimacy_status          Hija legítima
father_name                                  José
father_lastname                           Cerrano
father_social_condition                       NaN
mother_name                               Manuela
mother_lastname                            Avalos
mother_social_condition                       NaN
parents_social_condition                      NaN
godfather_name                                NaN
godfather_lastname                            NaN
godfather_social_condition                    NaN
godmother_name                            Brijida
godmother_lastname                            NaN
godmother_social_condition                    NaN
event_place                                Aucará
event_geographic_descriptor_1              Aucará
event_geographic_descriptor_2                 NaN
event_geographic_descriptor_3                 NaN
event_geographic_descriptor_4                 NaN
Name: 5415, dtype: object

4. Temporal Data Normalization

Date Standardization

Ensure all dates are in a consistent ISO format (YYYY-MM-DD) across all datasets. Historical records often contain dates in various formats (e.g., “15 de Marzo de 1850”, “3/15/1850”) that need to be parsed and standardized.

from actions.normalizers.DatesNormalizer import DateNormalizer
BAUTISMOS_HARMONIZED['event_date'] = DateNormalizer(BAUTISMOS_HARMONIZED['event_date']).normalize()
BAUTISMOS_HARMONIZED['event_date']
0       1790-10-04
1       1790-10-06
2       1790-10-07
3       1790-10-20
4       1790-10-20
           ...    
6336    1888-12-10
6337    1888-12-11
6338    1888-12-12
6339    1888-12-15
6340    1888-12-16
Name: event_date, Length: 6340, dtype: object
MATRIMONIOS_HARMONIZED['event_date'] = DateNormalizer(MATRIMONIOS_HARMONIZED['event_date']).normalize()
MATRIMONIOS_HARMONIZED['event_date']
0       1816-12-06
1       1816-12-12
2       1817-03-05
3       1817-03-10
4       1817-03-12
           ...    
1714    1907-10-27
1715    1908-01-13
1716    1908-01-15
1717    1908-02-15
1718    1908-03-17
Name: event_date, Length: 1719, dtype: object
ENTIERROS_HARMONIZED['event_date'] = DateNormalizer(ENTIERROS_HARMONIZED['event_date']).normalize()
ENTIERROS_HARMONIZED['event_date']
0       1846-10-06
1       1846-10-07
2       1846-11-02
3       1846-12-08
4       1847-02-23
           ...    
2193    1920-10-12
2194    1920-10-19
2195    1920-10-19
2196    1920-10-20
2197    1920-10-21
Name: event_date, Length: 2192, dtype: object

Birth Date Inference

When ages are recorded instead of birth dates, infer birth dates by subtracting the recorded age from the event date. This creates comparable temporal data across records that originally used different date recording conventions.

from actions.generators.AgeInferrer import AgeInferrer
BAUTISMOS_HARMONIZED['baptized_birth_date'] = AgeInferrer(BAUTISMOS_HARMONIZED['event_date']).infer_all(BAUTISMOS_HARMONIZED['baptized_birth_date'])
BAUTISMOS_HARMONIZED[['event_date', 'baptized_birth_date']]
event_date baptized_birth_date
0 1790-10-04 1790-08-04
1 1790-10-06 1790-08-04
2 1790-10-07 1790-08-04
3 1790-10-20 1790-10-15
4 1790-10-20 1790-10-19
... ... ...
6336 1888-12-10 1888-12-09
6337 1888-12-11 1888-12-07
6338 1888-12-12 1888-12-06
6339 1888-12-15 1888-11-30
6340 1888-12-16 1888-12-01

6340 rows × 2 columns

### Date Validation

# Check for chronologically inconsistent dates: event_date should be after baptized_birth_date
invalid_mask = pd.to_datetime(BAUTISMOS_HARMONIZED['event_date'], errors='coerce') < pd.to_datetime(BAUTISMOS_HARMONIZED['baptized_birth_date'], errors='coerce')
if invalid_mask.any():
    print("Found invalid records:")
    print(BAUTISMOS_HARMONIZED[invalid_mask][['event_date', 'baptized_birth_date']])
Found invalid records:
      event_date baptized_birth_date
135   1792-03-29          1792-04-08
290   1794-01-01          1794-01-27
671   1797-07-15          1797-07-24
2814  1900-04-01          1900-04-09

Unfortunately, these records were incorrectly recorded. To fix this, it is necessary to check with the original records.

MATRIMONIOS_HARMONIZED['husband_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['husband_birth_date'])
MATRIMONIOS_HARMONIZED['wife_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['wife_birth_date'])
MATRIMONIOS_HARMONIZED[['event_date', 'husband_birth_date', 'wife_birth_date']]
event_date husband_birth_date wife_birth_date
0 1816-12-06 NaN NaN
1 1816-12-12 NaN NaN
2 1817-03-05 NaN NaN
3 1817-03-10 NaN NaN
4 1817-03-12 NaN NaN
... ... ... ...
1714 1907-10-27 1882-11-01 1880-11-01
1715 1908-01-13 1880-01-19 1879-01-19
1716 1908-01-15 1886-01-19 1888-01-19
1717 1908-02-15 1883-02-20 1886-02-19
1718 1908-03-17 1880-03-23 1879-03-24

1719 rows × 3 columns

ENTIERROS_HARMONIZED['deceased_birth_date'] = AgeInferrer(ENTIERROS_HARMONIZED['event_date']).infer_all(ENTIERROS_HARMONIZED['deceased_birth_date'])
ENTIERROS_HARMONIZED[['event_date', 'deceased_birth_date']]
event_date deceased_birth_date
0 1846-10-06 NaN
1 1846-10-07 1821-10-13
2 1846-11-02 1766-11-21
3 1846-12-08 1806-12-18
4 1847-02-23 1797-03-06
... ... ...
2193 1920-10-12 1920-02-15
2194 1920-10-19 1870-10-31
2195 1920-10-19 1913-10-21
2196 1920-10-20 1919-10-21
2197 1920-10-21 1920-10-19

2192 rows × 2 columns

5. Name Standardization

General Name Normalization

Standardize name fields to ensure consistency for matching. This includes normalizing whitespace, removing punctuation, handling abbreviations, and standardizing capitalization.

from actions.normalizers.NamesNormalizer import NamesNormalizer
# Define helper function to apply name normalization to a series
def normalize_names_columns(series):
    """Apply NamesNormalizer to clean and standardize a series of name values."""
    namesManager = NamesNormalizer()
    return namesManager.clean_series(series)
names_columns = [
    'baptized_name', 
    'father_name', 'father_lastname',
    'mother_name', 'mother_lastname',
    'godfather_name', 'godfather_lastname', 
    'godmother_name', 'godmother_lastname',
]

for col in names_columns:
    if col in BAUTISMOS_HARMONIZED.columns:
        BAUTISMOS_HARMONIZED[col] = normalize_names_columns(BAUTISMOS_HARMONIZED[col])

BAUTISMOS_HARMONIZED[names_columns]
baptized_name father_name father_lastname mother_name mother_lastname godfather_name godfather_lastname godmother_name godmother_lastname
0 domingo lucas ayquipa sevastiana quispe vicente guamani NaN NaN
1 dominga juan lulia jospha gomes ignacio varientos NaN NaN
2 bartola jacinto quispe juliana chinchay NaN NaN rotonda pocco
3 francisca juan cuebas clemenzia manco NaN NaN ysabel guillen
4 pedro santos manxo baleriana arango NaN NaN josefa santiago
... ... ... ... ... ... ... ... ... ...
6336 leocadio miguel pacheco rosa huarcaya josé julián bendezú NaN NaN
6337 mariano concepcion facundo vega silvestra urbano fernando mancco NaN NaN
6338 ambrosio ysidro ccasane rita palomino juan tito NaN NaN
6339 francisco mariano lopez leocadia medina feliciano dias NaN NaN
6340 laureana bernarda champa rufina lopez NaN NaN manuela de la cruz

6340 rows × 9 columns

Dataset-Specific Name Processing

Baptisms: Surname Inference

Baptized individuals’ surnames were not directly recorded in the original registers. We infer the surname from the father’s surname (following patrilineal naming conventions), with the mother’s surname as a fallback if the father’s is unavailable.

BAUTISMOS_HARMONIZED['baptized_lastname'] = BAUTISMOS_HARMONIZED.apply(
    lambda row: row['father_lastname'] if pd.notna(row['father_lastname']) else (
        row['mother_lastname'] if pd.notna(row['mother_lastname']) else 'na'
    ),
    axis=1
)
BAUTISMOS_HARMONIZED['baptized_lastname']
0       ayquipa
1         lulia
2        quispe
3        cuebas
4         manxo
         ...   
6336    pacheco
6337       vega
6338    ccasane
6339      lopez
6340     champa
Name: baptized_lastname, Length: 6340, dtype: object

Marriages: Name Field Normalization

Marriage records contain numerous name fields for both spouses, their parents, godparents, and witnesses. All of these require standardization.

matrimonios_names_columns = [
    'husband_name', 'husband_lastname', 
       'husband_father_name', 'husband_father_lastname', 
       'husband_mother_name', 'husband_mother_lastname',
       'wife_name', 'wife_lastname',
       'wife_father_name', 'wife_father_lastname',
       'wife_mother_name', 'wife_mother_lastname', 
       'godparent_1_name', 'godparent_1_lastname',
       'godparent_2_name', 'godparent_2_lastname', 
       'godparent_3_name', 'godparent_3_lastname', 
       'witness_1_name', 'witness_1_lastname', 
       'witness_2_name', 'witness_2_lastname',
       'witness_3_name', 'witness_3_lastname', 
       'witness_4_name', 'witness_4_lastname'
]

for col in matrimonios_names_columns:
    if col in MATRIMONIOS_HARMONIZED.columns:
        MATRIMONIOS_HARMONIZED[col] = normalize_names_columns(MATRIMONIOS_HARMONIZED[col])

MATRIMONIOS_HARMONIZED[matrimonios_names_columns]
husband_name husband_lastname husband_father_name husband_father_lastname husband_mother_name husband_mother_lastname wife_name wife_lastname wife_father_name wife_father_lastname ... godparent_3_name godparent_3_lastname witness_1_name witness_1_lastname witness_2_name witness_2_lastname witness_3_name witness_3_lastname witness_4_name witness_4_lastname
0 josé manl manuel de la roca acencio roca leonor guerrero juana rodrigues pedro rodrigues ... NaN NaN agustin castro mariano castro juan baldes NaN NaN
1 esteban castillo matheo castillo ma maria torres ambrocia tasqui pedro tasqui ... NaN NaN pedro manco carlos canto pedro guamani NaN NaN
2 alexandro ramires leonor romani franca francisca paucar sipriana coillo cristobal coillo ... NaN NaN marcelo llamuca julian urbano antonio urbano NaN NaN
3 jose cuchu acencio cuchu baleriana antay cacimira flores NaN NaN ... NaN NaN pablo roque antonio urbano cristobal coillo NaN NaN
4 domingo tito NaN NaN marcela guauya petrona guallpatuiru agustin guallpatuiru ... NaN NaN marcelo llamuca antonio guamani mariano guallpatuiru NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1714 patrocinio chinchay miguel chinchay andrea polanco lorenza quispe gervacio quispe ... NaN NaN juan de dios barrientos manuel espinosa crisostomo pumarino NaN NaN
1715 gerónimo cucho ambrocio cucho gertrudis serrano teresa jimenes aniseto jimenes ... NaN NaN victor saravia mateo aiquipa felix cucho NaN NaN
1716 josé coro felix coro natividad cucho emilia huamani patricio huamani ... NaN NaN pablo de la cruz victor saravia marcelo ramos NaN NaN
1717 pedro gutierres ruperto gutierrez micaila oscco juliana huarcaya hilario huarcaya ... NaN NaN rafael delgado josé vivanco agustin vicente NaN NaN
1718 benito sanchez tiburcio sanchez lorenza aymi teresa poma bernardo poma ... NaN NaN santos espejo felix sanches mariano urbano NaN NaN

1719 rows × 26 columns

entierros_names_columns = [
    "deceased_name", "deceased_lastname",
    "father_name", "father_lastname",
    "mother_name", "mother_lastname",
    "husband_name", "wife_name",
]

for col in entierros_names_columns:
    if col in ENTIERROS_HARMONIZED.columns:
        ENTIERROS_HARMONIZED[col] = normalize_names_columns(ENTIERROS_HARMONIZED[col])

ENTIERROS_HARMONIZED[entierros_names_columns]
deceased_name deceased_lastname father_name father_lastname mother_name mother_lastname husband_name wife_name
0 julian xavies NaN NaN NaN NaN NaN mercedes lupa
1 joce raime NaN NaN NaN NaN NaN francisca cucho
2 martina condori NaN NaN NaN NaN luciano ccoyllo NaN
3 dorotea ccoyllo NaN NaN NaN NaN josé espinosa NaN
4 maría romani NaN NaN NaN NaN mariano huallpatuiro NaN
... ... ... ... ... ... ... ... ...
2193 nieves huallpatuero patrocinio huallpatuero teresa urbano NaN NaN
2194 sinforiano huamani eustaquio huamani martina llamoca NaN gregoria ccoillo
2195 salomé condori sebastian condori anacla roque NaN NaN
2196 maría chinchay abdon chinchay floriza lopez NaN NaN
2197 fortunato florez cecilio florez guadalupe ramos NaN NaN

2192 rows × 8 columns

Burials: Name Field Normalization

Burial records contain name fields for the deceased and their family members.

Burials: Full Name Splitting

In burial records, deceased spouses (husband/wife) were recorded as full names in a single field. We need to split these into separate first name and surname fields for consistency with other datasets.

# Define helper function to split full names into first name and surname
def name_splitter(name):
    """
    Split a full name into first name and surname components.
    Handles compound surnames like 'de la' and multi-word first names.
    """
    if pd.isnull(name):
        return np.nan, np.nan

    name_parts = name.split()
    if len(name_parts) > 2:
        if "de la" in name:
            return name_parts[0], " ".join(name_parts[1:])
        return " ".join(name_parts[0:2]), " ".join(name_parts[2:])
    return name_parts[0], name_parts[1] if len(name_parts) > 1 else np.nan
ENTIERROS_HARMONIZED[['husband_name', 'husband_lastname']] = ENTIERROS_HARMONIZED['husband_name'].apply(name_splitter).apply(pd.Series)
ENTIERROS_HARMONIZED[['wife_name', 'wife_lastname']] = ENTIERROS_HARMONIZED['wife_name'].apply(name_splitter).apply(pd.Series)

6. Geographic Data Extraction

Place Name Recognition

Extract place name entities from text fields using Named Entity Recognition (NER). Historical records often contain place names embedded within longer descriptive text (e.g., “natural de la villa de Sondondo”), which need to be identified and extracted for geographic analysis.

from actions.extractors import placeRecognition

extractor = placeRecognition.PlaceExtractor()
bautismos_place_columns = [
    'baptized_birth_place', 'event_place', 'event_geographic_descriptor_1',
        'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
        'event_geographic_descriptor_4'
]

BAUTISMOS_PLACES_RAW = BAUTISMOS_HARMONIZED[bautismos_place_columns]

for col in bautismos_place_columns:
    if col in BAUTISMOS_HARMONIZED.columns:
        BAUTISMOS_HARMONIZED[col] = extractor.extract_places_per_row(BAUTISMOS_HARMONIZED[col])

BAUTISMOS_HARMONIZED[bautismos_place_columns]
baptized_birth_place event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4
0 NaN Pampamarca Aucara Pampamarca NaN NaN
1 NaN Pampamarca Aucara Pampamarca NaN NaN
2 NaN Pampamarca Aucara Pampamarca NaN NaN
3 NaN Aucara Aucara NaN NaN NaN
4 NaN Aucara Aucara NaN NaN NaN
... ... ... ... ... ... ...
6336 NaN Aucará Aucará Aucará NaN NaN
6337 NaN Aucará Aucará Aucará NaN NaN
6338 NaN Aucará Aucará Mayobamba NaN NaN
6339 NaN Aucará Aucará Huaicahuacho NaN NaN
6340 NaN Aucará Aucará Chacralla NaN NaN

6340 rows × 6 columns

Marriage Place Extraction

matrimonios_place_columns = [
    'husband_birth_place',
       'husband_resident_in', 
       'wife_birth_place', 'wife_resident_in', 
       'event_place', 'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
       'event_geographic_descriptor_3', 'event_geographic_descriptor_4',
       'event_geographic_descriptor_5', 'event_geographic_descriptor_6'
]

MATRIMONIOS_PLACES_RAW = MATRIMONIOS_HARMONIZED[matrimonios_place_columns]

for col in matrimonios_place_columns:
    if col in MATRIMONIOS_HARMONIZED.columns:
        MATRIMONIOS_HARMONIZED[col] = extractor.extract_places_per_row(MATRIMONIOS_HARMONIZED[col])

MATRIMONIOS_HARMONIZED[matrimonios_place_columns]
husband_birth_place husband_resident_in wife_birth_place wife_resident_in event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4 event_geographic_descriptor_5 event_geographic_descriptor_6
0 Ciudad de Huamanga Aucara NaN NaN Aucara Aucara Huamanga Coracora NaN NaN NaN
1 NaN NaN NaN NaN Aucara Aucara Colca NaN NaN NaN NaN
2 Pampamarca NaN Pampamarca NaN Aucara Aucara Pampamarca NaN NaN NaN NaN
3 Pampamarca NaN Pampamarca NaN Pampamarca|santa iglesia Aucara Pampamarca NaN NaN NaN NaN
4 NaN NaN NaN NaN Pampamarca|santa iglesia Aucara Pampamarca NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
1714 Pampamarca NaN Pampamarca NaN Pampamarca Aucara Pampamarca NaN NaN NaN NaN
1715 Chacralla NaN Chacralla NaN Chacralla, iglesia vice-parroquial Aucara Chacralla NaN NaN NaN NaN
1716 Chacralla NaN Chacralla NaN Chacralla, iglesia vice-parroquial Aucara Chacralla NaN NaN NaN NaN
1717 NaN Aucara NaN Aucara Aucara Aucara Queca NaN NaN NaN NaN
1718 Pampamarca NaN Pampamarca NaN Pampamarca Aucara Pampamarca NaN NaN NaN NaN

1719 rows × 11 columns

Burial Place Extraction

entierros_place_columns = [
    'event_place', 'deceased_birth_place', 'burial_place', 'event_geographic_descriptor_1',
    'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
    'event_geographic_descriptor_4'
]

ENTIERROS_PLACES_RAW = ENTIERROS_HARMONIZED[entierros_place_columns]

for col in entierros_place_columns:
    if col in ENTIERROS_HARMONIZED.columns:
        ENTIERROS_HARMONIZED[col] = extractor.extract_places_per_row(ENTIERROS_HARMONIZED[col])

ENTIERROS_HARMONIZED[entierros_place_columns]
event_place deceased_birth_place burial_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4
0 NaN NaN NaN Aucará Lucanas NaN NaN
1 NaN NaN NaN Aucará Lucanas NaN NaN
2 NaN NaN NaN Aucará Lucanas NaN NaN
3 NaN NaN NaN Aucará Lucanas NaN NaN
4 NaN NaN NaN Aucará Lucanas NaN NaN
... ... ... ... ... ... ... ...
2193 Aucara Santa Ana de Aucara NaN Aucara Santa Ana de Aucara NaN NaN
2194 Aucara Pampamarca NaN Aucara Pampamarca NaN NaN
2195 Aucara Santa Ana de Aucara NaN Aucara Santa Ana de Aucara NaN NaN
2196 Aucara Aucara NaN Aucara NaN NaN NaN
2197 Aucara Aucara NaN Aucara NaN NaN NaN

2192 rows × 7 columns

### Raw Place Export

# Extract all unique place names from all datasets and save for subsequent normalization
# in the place mapping notebook (2_placeMapping.ipynb)
from utils import UniqueValues

uniqueValues = UniqueValues.UniqueValuesExtractor(
    [BAUTISMOS_PLACES_RAW, MATRIMONIOS_PLACES_RAW, ENTIERROS_PLACES_RAW]
).get_unique_values(return_dataframe=True)

uniqueValues.to_csv("../data/raw/raw_places.csv", index=False) # type: ignore

7. Export Cleaned Data

All cleaned datasets are exported to the data/clean/ directory for use in subsequent analysis notebooks (term extraction, person entity creation, and record linkage).

clean_data_folder = Path("../data/clean/")

# Fill NaN values for consistency
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.fillna(value=np.nan)
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.fillna(value=np.nan)
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.fillna(value=np.nan)

BAUTISMOS_HARMONIZED.to_csv(clean_data_folder / "bautismos_clean.csv", index=False)
MATRIMONIOS_HARMONIZED.to_csv(clean_data_folder / "matrimonios_clean.csv", index=False)
ENTIERROS_HARMONIZED.to_csv(clean_data_folder / "entierros_clean.csv", index=False)

Summary

This notebook successfully cleaned and standardized three sacramental record datasets:

Key transformations applied: 1. Column Harmonization: Unified schema across heterogeneous historical sources 2. Null Standardization: Converted diverse missing data representations to np.nan 3. Temporal Normalization: Standardized dates to ISO format and inferred birth dates from ages 4. Name Standardization: Applied consistent name cleaning and inferred missing surnames 5. Geographic Extraction: Identified place name entities using NER 6. Data Export: Saved cleaned datasets to data/clean/ for downstream analysis

Outputs: - bautismos_clean.csv: {len(BAUTISMOS_HARMONIZED)} baptism records - matrimonios_clean.csv: {len(MATRIMONIOS_HARMONIZED)} marriage records - entierros_clean.csv: {len(ENTIERROS_HARMONIZED)} burial records - raw_places.csv: Raw place names for normalization in notebook 2

The cleaned data is now ready for term extraction analysis (notebook 3), place name normalization (notebook 2), and person entity creation (notebook 4).