Data Cleaning

This notebook performs comprehensive data cleaning and standardization of historical sacramental records from Sondondo. The process includes column harmonization, null value handling, temporal normalization, name standardization, and geographic data extraction, preparing the datasets for subsequent person entity extraction and probabilistic record linkage.

# Libraries
import json
import pandas as pd
import numpy as np
import re
from pathlib import Path

1. Data Loading

The raw datasets are stored in the data/raw directory and include three types of sacramental records:

bautismos.csv: Baptism records
matrimonios.csv: Marriage records
entierros.csv: Burial records

BAUTISMOS_RAW = pd.read_csv("../data/raw/bautismos.csv")
MATRIMONIOS_RAW = pd.read_csv("../data/raw/matrimonios.csv")
ENTIERROS_RAW = pd.read_csv("../data/raw/entierros.csv")

BAUTISMOS_RAW.head()

	Secuencia	Unidad Documental Compuesta (a la que pertenece)	Identificador (es recomendable seguir una secuencia numeral como la mostrada en los ejemplos)	Título (incluir un título breve para cada documento)	Folio inicial del documento (convertir como se muestra abajo)	Folio final del documento (convertir como se muestra abajo)	Imagen inicial (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes)	Imagen final (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes)	Tipo de evento	Fecha aaaa-mm-dd	...	Condición de la madrina	Lugar de bautizo	Notas adicionales del documento	Descriptor Geográfico 1	Descriptor Geográfico 2	Descriptor Geográfico 3	Descriptor Geográfico 4	5	Características físicas (Estado de conservación de los materiales físicos)	Historia de revisión (de los materiales digitalizados)
0	1.0	APAucará LB L001	B001	Bautizo. Domingo. Tributarios	3r	3r	IMG_7000a	IMG_7000a	Bautizo	1790-10-04	...	NaN	Pampamarca, iglesia	NaN	Aucara	Pampamarca	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023
1	2.0	APAucará LB L001	B002	Bautizo. Dominga. Tributarios	3r	3r	IMG_7000a	IMG_7000a	Bautizo	1790-10-06	...	NaN	Pampamarca, iglesia	NaN	Aucara	Pampamarca	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023
2	3.0	APAucará LB L001	B003	Bautizo. Bartola. Tributarios	3r	3r	IMG_7000a	IMG_7000a	Bautizo	1790-10-07	...	NaN	Pampamarca, iglesia	NaN	Aucara	Pampamarca	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023
3	4.0	APAucará LB L001	B004	Bautizo. Francisca	3v	3v	IMG_7000b	IMG_7000b	Bautizo	1790-10-20	...	NaN	Aucara, iglesia	Abreviatura poco visible en el margen	Aucara	NaN	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023
4	5.0	APAucará LB L001	B005	Bautizo. Pedro	3v	3v	IMG_7000b	IMG_7000b	Bautizo	1790-10-20	...	NaN	Aucara, iglesia	Margen roto y manchado de tinta	Aucara	NaN	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023

5 rows × 36 columns

2. Column Harmonization

Standardize column names across datasets to ensure consistency. Different historical sources may use different field names for equivalent data (e.g., “nombre” vs “name”), so we map these variants to a unified schema.

Schema Mapping

Column mappings are defined in JSON files located in data/mappings/, which specify how raw column names should be renamed to match our standardized schema.

# Import the ColumnManager utility for applying schema mappings
from utils.ColumnManager import ColumnManager

bautismoMapping = Path("../data/mappings/bautismosMapping.json")
matrimonioMapping = Path("../data/mappings/matrimoniosMapping.json")
entierroMapping = Path("../data/mappings/entierrosMapping.json")

column_manager = ColumnManager()

BAUTISMOS_HARMONIZED = column_manager.harmonize_columns(BAUTISMOS_RAW, bautismoMapping)
MATRIMONIOS_HARMONIZED = column_manager.harmonize_columns(MATRIMONIOS_RAW, matrimonioMapping)
ENTIERROS_HARMONIZED = column_manager.harmonize_columns(ENTIERROS_RAW, entierroMapping)

BAUTISMOS_HARMONIZED.head()

	id	file	identifier	title	start_folio	end_folio	start_image	end_image	event_type	event_date	...	godmother_social_condition	event_place	event_additional_notes	event_geographic_descriptor_1	event_geographic_descriptor_2	event_geographic_descriptor_3	event_geographic_descriptor_4	event_other	record_physical_characteristics	revision_history
0	1.0	APAucará LB L001	B001	Bautizo. Domingo. Tributarios	3r	3r	IMG_7000a	IMG_7000a	Bautizo	1790-10-04	...	NaN	Pampamarca, iglesia	NaN	Aucara	Pampamarca	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023
1	2.0	APAucará LB L001	B002	Bautizo. Dominga. Tributarios	3r	3r	IMG_7000a	IMG_7000a	Bautizo	1790-10-06	...	NaN	Pampamarca, iglesia	NaN	Aucara	Pampamarca	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023
2	3.0	APAucará LB L001	B003	Bautizo. Bartola. Tributarios	3r	3r	IMG_7000a	IMG_7000a	Bautizo	1790-10-07	...	NaN	Pampamarca, iglesia	NaN	Aucara	Pampamarca	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023
3	4.0	APAucará LB L001	B004	Bautizo. Francisca	3v	3v	IMG_7000b	IMG_7000b	Bautizo	1790-10-20	...	NaN	Aucara, iglesia	Abreviatura poco visible en el margen	Aucara	NaN	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023
4	5.0	APAucará LB L001	B005	Bautizo. Pedro	3v	3v	IMG_7000b	IMG_7000b	Bautizo	1790-10-20	...	NaN	Aucara, iglesia	Margen roto y manchado de tinta	Aucara	NaN	NaN	NaN	NaN	Regular	Registrado por Edwin Gonzales en 2023

5 rows × 36 columns

Column Selection and Filtering

After harmonization, we reduce each dataset to only the columns relevant for person entity extraction and record linkage. This simplifies downstream processing and removes fields that don’t contribute to linkage (e.g., administrative notes, scribal annotations).

# Load the mapping of useful columns for each record type
# This defines which fields are relevant for entity extraction and linkage
useful_columns = json.load(open("../data/mappings/usefulColumnsMapping.json"))

# Filter baptism records to useful columns only
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED[useful_columns['bautizo']]

# Remove any columns that are entirely empty
BAUTISMOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)

BAUTISMOS_HARMONIZED.columns

Index(['file', 'identifier', 'event_type', 'event_date', 'baptized_name',
       'baptized_birth_place', 'baptized_birth_date',
       'baptized_legitimacy_status', 'father_name', 'father_lastname',
       'father_social_condition', 'mother_name', 'mother_lastname',
       'mother_social_condition', 'parents_social_condition', 'godfather_name',
       'godfather_lastname', 'godfather_social_condition', 'godmother_name',
       'godmother_lastname', 'godmother_social_condition', 'event_place',
       'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
       'event_geographic_descriptor_3', 'event_geographic_descriptor_4'],
      dtype='object')

# Filter marriage records to useful columns only
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED[useful_columns['matrimonio']]

# Remove any columns that are entirely empty
MATRIMONIOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)

MATRIMONIOS_HARMONIZED.columns

Index(['file', 'identifier', 'event_type', 'event_date', 'husband_name',
       'husband_lastname', 'husband_social_condition',
       'husband_marital_status', 'husband_birth_date', 'husband_birth_place',
       'husband_resident_in', 'husband_legitimacy_status',
       'husband_father_name', 'husband_father_lastname',
       'husband_father_social_condition', 'husband_mother_name',
       'husband_mother_lastname', 'husband_mother_social_condition',
       'wife_name', 'wife_lastname', 'wife_social_condition',
       'wife_marital_status', 'wife_birth_date', 'wife_birth_place',
       'wife_resident_in', 'wife_legitimacy_status', 'wife_father_name',
       'wife_father_lastname', 'wife_father_social_condition',
       'wife_mother_name', 'wife_mother_lastname',
       'wife_mother_social_condition', 'godparent_1_name',
       'godparent_1_lastname', 'godparent_1_social_condition',
       'godparent_2_name', 'godparent_2_lastname',
       'godparent_2_social_condition', 'godparent_3_name',
       'godparent_3_lastname', 'witness_1_name', 'witness_1_lastname',
       'witness_2_name', 'witness_2_lastname', 'witness_3_name',
       'witness_3_lastname', 'witness_4_name', 'witness_4_lastname',
       'event_place', 'event_geographic_descriptor_1',
       'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
       'event_geographic_descriptor_4', 'event_geographic_descriptor_5',
       'event_geographic_descriptor_6'],
      dtype='object')

# Filter burial records to useful columns only
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED[useful_columns['entierro']]

# Remove any columns that are entirely empty
ENTIERROS_HARMONIZED.dropna(axis=1, how='all', inplace=True)

ENTIERROS_HARMONIZED.columns

Index(['file', 'identifier', 'event_type', 'event_date', 'doctrine',
       'event_place', 'deceased_name', 'deceased_lastname',
       'deceased_birth_date', 'deceased_birth_place',
       'deceased_social_condition', 'deceased_marital_status',
       'deceased_legitimacy_status', 'father_name', 'father_lastname',
       'mother_name', 'mother_lastname', 'husband_name', 'wife_name',
       'burial_place', 'event_geographic_descriptor_1',
       'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
       'event_geographic_descriptor_4'],
      dtype='object')

3. Data Quality Preprocessing

Null Value Standardization

Standardize the representation of missing data by converting various placeholder strings (e.g., ‘-’, ‘n/a’, ‘null’) to proper np.nan values. This ensures consistent missing data handling throughout the analysis pipeline.

# Define helper function to replace placeholder strings with numpy NaN
def replace_empty_with_na(df):
    """
    Replace placeholder strings with np.nan in string columns only.
    Handles common representations of missing data found in historical records.
    """
    placeholders = {'', '-', '--', 'n/a', 'na', 'null', 'None'}

    def clean_cell(val):
        if isinstance(val, str) and val.strip().lower() in placeholders:
            return np.nan
        return val

    return df.map(clean_cell)

BAUTISMOS_HARMONIZED = replace_empty_with_na(BAUTISMOS_HARMONIZED)
MATRIMONIOS_HARMONIZED = replace_empty_with_na(MATRIMONIOS_HARMONIZED)
ENTIERROS_HARMONIZED = replace_empty_with_na(ENTIERROS_HARMONIZED)

Empty Row Removal

Remove rows where all fields are empty, as these represent blank entries or page breaks in the original record books rather than actual sacramental events.

BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.dropna(how='all')
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.dropna(how='all')
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.dropna(how='all')

Event Type Corrections

Baptism Event Type Errors

Two records in the baptism dataset have incorrect event types that need correction or removal.

BAUTISMOS_HARMONIZED.loc[~(BAUTISMOS_HARMONIZED['event_type'] == 'Bautizo')]

	file	identifier	event_type	event_date	baptized_name	baptized_birth_place	baptized_birth_date	baptized_legitimacy_status	father_name	father_lastname	...	godfather_lastname	godfather_social_condition	godmother_name	godmother_lastname	godmother_social_condition	event_place	event_geographic_descriptor_1	event_geographic_descriptor_2	event_geographic_descriptor_3	event_geographic_descriptor_4
1900	APAucará LB L001	B1901	Concurso	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5414	APAucará LB L003	B1120	44444	1869-09-17	Juan	Yshua	12 días	Hijo natural	Ygnacio	Sevidola	...	Origuela	NaN	NaN	NaN	NaN	Aucará	Aucará	NaN	NaN	NaN

2 rows × 26 columns

Event 1900 is basically an empty record, so we will drop it. While event 5414 requires correction:

BAUTISMOS_HARMONIZED.drop(1900, inplace=True)

BAUTISMOS_HARMONIZED.loc[5414, 'event_type'] = 'Bautizo'
BAUTISMOS_HARMONIZED.iloc[5414]

file                             APAucará LB L003
identifier                                  B1121
event_type                                Bautizo
event_date                             1869-09-17
baptized_name                        Maria Isabel
baptized_birth_place                    Chacralla
baptized_birth_date                        4 días
baptized_legitimacy_status          Hija legítima
father_name                                  José
father_lastname                           Cerrano
father_social_condition                       NaN
mother_name                               Manuela
mother_lastname                            Avalos
mother_social_condition                       NaN
parents_social_condition                      NaN
godfather_name                                NaN
godfather_lastname                            NaN
godfather_social_condition                    NaN
godmother_name                            Brijida
godmother_lastname                            NaN
godmother_social_condition                    NaN
event_place                                Aucará
event_geographic_descriptor_1              Aucará
event_geographic_descriptor_2                 NaN
event_geographic_descriptor_3                 NaN
event_geographic_descriptor_4                 NaN
Name: 5415, dtype: object

4. Temporal Data Normalization

Date Standardization

Ensure all dates are in a consistent ISO format (YYYY-MM-DD) across all datasets. Historical records often contain dates in various formats (e.g., “15 de Marzo de 1850”, “3/15/1850”) that need to be parsed and standardized.

from actions.normalizers.DatesNormalizer import DateNormalizer

BAUTISMOS_HARMONIZED['event_date'] = DateNormalizer(BAUTISMOS_HARMONIZED['event_date']).normalize()
BAUTISMOS_HARMONIZED['event_date']

0       1790-10-04
1       1790-10-06
2       1790-10-07
3       1790-10-20
4       1790-10-20
           ...    
6336    1888-12-10
6337    1888-12-11
6338    1888-12-12
6339    1888-12-15
6340    1888-12-16
Name: event_date, Length: 6340, dtype: object

MATRIMONIOS_HARMONIZED['event_date'] = DateNormalizer(MATRIMONIOS_HARMONIZED['event_date']).normalize()
MATRIMONIOS_HARMONIZED['event_date']

0       1816-12-06
1       1816-12-12
2       1817-03-05
3       1817-03-10
4       1817-03-12
           ...    
1714    1907-10-27
1715    1908-01-13
1716    1908-01-15
1717    1908-02-15
1718    1908-03-17
Name: event_date, Length: 1719, dtype: object

ENTIERROS_HARMONIZED['event_date'] = DateNormalizer(ENTIERROS_HARMONIZED['event_date']).normalize()
ENTIERROS_HARMONIZED['event_date']

0       1846-10-06
1       1846-10-07
2       1846-11-02
3       1846-12-08
4       1847-02-23
           ...    
2193    1920-10-12
2194    1920-10-19
2195    1920-10-19
2196    1920-10-20
2197    1920-10-21
Name: event_date, Length: 2192, dtype: object

Birth Date Inference

When ages are recorded instead of birth dates, infer birth dates by subtracting the recorded age from the event date. This creates comparable temporal data across records that originally used different date recording conventions.

from actions.generators.AgeInferrer import AgeInferrer

BAUTISMOS_HARMONIZED['baptized_birth_date'] = AgeInferrer(BAUTISMOS_HARMONIZED['event_date']).infer_all(BAUTISMOS_HARMONIZED['baptized_birth_date'])
BAUTISMOS_HARMONIZED[['event_date', 'baptized_birth_date']]

	event_date	baptized_birth_date
0	1790-10-04	1790-08-04
1	1790-10-06	1790-08-04
2	1790-10-07	1790-08-04
3	1790-10-20	1790-10-15
4	1790-10-20	1790-10-19
...	...	...
6336	1888-12-10	1888-12-09
6337	1888-12-11	1888-12-07
6338	1888-12-12	1888-12-06
6339	1888-12-15	1888-11-30
6340	1888-12-16	1888-12-01

6340 rows × 2 columns

### Date Validation

# Check for chronologically inconsistent dates: event_date should be after baptized_birth_date
invalid_mask = pd.to_datetime(BAUTISMOS_HARMONIZED['event_date'], errors='coerce') < pd.to_datetime(BAUTISMOS_HARMONIZED['baptized_birth_date'], errors='coerce')
if invalid_mask.any():
    print("Found invalid records:")
    print(BAUTISMOS_HARMONIZED[invalid_mask][['event_date', 'baptized_birth_date']])

Found invalid records:
      event_date baptized_birth_date
135   1792-03-29          1792-04-08
290   1794-01-01          1794-01-27
671   1797-07-15          1797-07-24
2814  1900-04-01          1900-04-09

Unfortunately, these records were incorrectly recorded. To fix this, it is necessary to check with the original records.

MATRIMONIOS_HARMONIZED['husband_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['husband_birth_date'])
MATRIMONIOS_HARMONIZED['wife_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['wife_birth_date'])
MATRIMONIOS_HARMONIZED[['event_date', 'husband_birth_date', 'wife_birth_date']]

	event_date	husband_birth_date	wife_birth_date
0	1816-12-06	NaN	NaN
1	1816-12-12	NaN	NaN
2	1817-03-05	NaN	NaN
3	1817-03-10	NaN	NaN
4	1817-03-12	NaN	NaN
...	...	...	...
1714	1907-10-27	1882-11-01	1880-11-01
1715	1908-01-13	1880-01-19	1879-01-19
1716	1908-01-15	1886-01-19	1888-01-19
1717	1908-02-15	1883-02-20	1886-02-19
1718	1908-03-17	1880-03-23	1879-03-24

1719 rows × 3 columns

ENTIERROS_HARMONIZED['deceased_birth_date'] = AgeInferrer(ENTIERROS_HARMONIZED['event_date']).infer_all(ENTIERROS_HARMONIZED['deceased_birth_date'])
ENTIERROS_HARMONIZED[['event_date', 'deceased_birth_date']]

	event_date	deceased_birth_date
0	1846-10-06	NaN
1	1846-10-07	1821-10-13
2	1846-11-02	1766-11-21
3	1846-12-08	1806-12-18
4	1847-02-23	1797-03-06
...	...	...
2193	1920-10-12	1920-02-15
2194	1920-10-19	1870-10-31
2195	1920-10-19	1913-10-21
2196	1920-10-20	1919-10-21
2197	1920-10-21	1920-10-19

2192 rows × 2 columns

5. Name Standardization

General Name Normalization

Standardize name fields to ensure consistency for matching. This includes normalizing whitespace, removing punctuation, handling abbreviations, and standardizing capitalization.

from actions.normalizers.NamesNormalizer import NamesNormalizer

# Define helper function to apply name normalization to a series
def normalize_names_columns(series):
    """Apply NamesNormalizer to clean and standardize a series of name values."""
    namesManager = NamesNormalizer()
    return namesManager.clean_series(series)

names_columns = [
    'baptized_name', 
    'father_name', 'father_lastname',
    'mother_name', 'mother_lastname',
    'godfather_name', 'godfather_lastname', 
    'godmother_name', 'godmother_lastname',
]

for col in names_columns:
    if col in BAUTISMOS_HARMONIZED.columns:
        BAUTISMOS_HARMONIZED[col] = normalize_names_columns(BAUTISMOS_HARMONIZED[col])

BAUTISMOS_HARMONIZED[names_columns]

	baptized_name	father_name	father_lastname	mother_name	mother_lastname	godfather_name	godfather_lastname	godmother_name	godmother_lastname
0	domingo	lucas	ayquipa	sevastiana	quispe	vicente	guamani	NaN	NaN
1	dominga	juan	lulia	jospha	gomes	ignacio	varientos	NaN	NaN
2	bartola	jacinto	quispe	juliana	chinchay	NaN	NaN	rotonda	pocco
3	francisca	juan	cuebas	clemenzia	manco	NaN	NaN	ysabel	guillen
4	pedro	santos	manxo	baleriana	arango	NaN	NaN	josefa	santiago
...	...	...	...	...	...	...	...	...	...
6336	leocadio	miguel	pacheco	rosa	huarcaya	josé julián	bendezú	NaN	NaN
6337	mariano concepcion	facundo	vega	silvestra	urbano	fernando	mancco	NaN	NaN
6338	ambrosio	ysidro	ccasane	rita	palomino	juan	tito	NaN	NaN
6339	francisco	mariano	lopez	leocadia	medina	feliciano	dias	NaN	NaN
6340	laureana	bernarda	champa	rufina	lopez	NaN	NaN	manuela	de la cruz

6340 rows × 9 columns

Dataset-Specific Name Processing

Baptisms: Surname Inference

Baptized individuals’ surnames were not directly recorded in the original registers. We infer the surname from the father’s surname (following patrilineal naming conventions), with the mother’s surname as a fallback if the father’s is unavailable.

BAUTISMOS_HARMONIZED['baptized_lastname'] = BAUTISMOS_HARMONIZED.apply(
    lambda row: row['father_lastname'] if pd.notna(row['father_lastname']) else (
        row['mother_lastname'] if pd.notna(row['mother_lastname']) else 'na'
    ),
    axis=1
)
BAUTISMOS_HARMONIZED['baptized_lastname']

0       ayquipa
1         lulia
2        quispe
3        cuebas
4         manxo
         ...   
6336    pacheco
6337       vega
6338    ccasane
6339      lopez
6340     champa
Name: baptized_lastname, Length: 6340, dtype: object

Marriages: Name Field Normalization

Marriage records contain numerous name fields for both spouses, their parents, godparents, and witnesses. All of these require standardization.

matrimonios_names_columns = [
    'husband_name', 'husband_lastname', 
       'husband_father_name', 'husband_father_lastname', 
       'husband_mother_name', 'husband_mother_lastname',
       'wife_name', 'wife_lastname',
       'wife_father_name', 'wife_father_lastname',
       'wife_mother_name', 'wife_mother_lastname', 
       'godparent_1_name', 'godparent_1_lastname',
       'godparent_2_name', 'godparent_2_lastname', 
       'godparent_3_name', 'godparent_3_lastname', 
       'witness_1_name', 'witness_1_lastname', 
       'witness_2_name', 'witness_2_lastname',
       'witness_3_name', 'witness_3_lastname', 
       'witness_4_name', 'witness_4_lastname'
]

for col in matrimonios_names_columns:
    if col in MATRIMONIOS_HARMONIZED.columns:
        MATRIMONIOS_HARMONIZED[col] = normalize_names_columns(MATRIMONIOS_HARMONIZED[col])

MATRIMONIOS_HARMONIZED[matrimonios_names_columns]

	husband_name	husband_lastname	husband_father_name	husband_father_lastname	husband_mother_name	husband_mother_lastname	wife_name	wife_lastname	wife_father_name	wife_father_lastname	...	godparent_3_name	godparent_3_lastname	witness_1_name	witness_1_lastname	witness_2_name	witness_2_lastname	witness_3_name	witness_3_lastname	witness_4_name	witness_4_lastname
0	josé manl manuel	de la roca	acencio	roca	leonor	guerrero	juana	rodrigues	pedro	rodrigues	...	NaN	NaN	agustin	castro	mariano	castro	juan	baldes	NaN	NaN
1	esteban	castillo	matheo	castillo	ma maria	torres	ambrocia	tasqui	pedro	tasqui	...	NaN	NaN	pedro	manco	carlos	canto	pedro	guamani	NaN	NaN
2	alexandro	ramires	leonor	romani	franca francisca	paucar	sipriana	coillo	cristobal	coillo	...	NaN	NaN	marcelo	llamuca	julian	urbano	antonio	urbano	NaN	NaN
3	jose	cuchu	acencio	cuchu	baleriana	antay	cacimira	flores	NaN	NaN	...	NaN	NaN	pablo	roque	antonio	urbano	cristobal	coillo	NaN	NaN
4	domingo	tito	NaN	NaN	marcela	guauya	petrona	guallpatuiru	agustin	guallpatuiru	...	NaN	NaN	marcelo	llamuca	antonio	guamani	mariano	guallpatuiru	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1714	patrocinio	chinchay	miguel	chinchay	andrea	polanco	lorenza	quispe	gervacio	quispe	...	NaN	NaN	juan de dios	barrientos	manuel	espinosa	crisostomo	pumarino	NaN	NaN
1715	gerónimo	cucho	ambrocio	cucho	gertrudis	serrano	teresa	jimenes	aniseto	jimenes	...	NaN	NaN	victor	saravia	mateo	aiquipa	felix	cucho	NaN	NaN
1716	josé	coro	felix	coro	natividad	cucho	emilia	huamani	patricio	huamani	...	NaN	NaN	pablo	de la cruz	victor	saravia	marcelo	ramos	NaN	NaN
1717	pedro	gutierres	ruperto	gutierrez	micaila	oscco	juliana	huarcaya	hilario	huarcaya	...	NaN	NaN	rafael	delgado	josé	vivanco	agustin	vicente	NaN	NaN
1718	benito	sanchez	tiburcio	sanchez	lorenza	aymi	teresa	poma	bernardo	poma	...	NaN	NaN	santos	espejo	felix	sanches	mariano	urbano	NaN	NaN

1719 rows × 26 columns

entierros_names_columns = [
    "deceased_name", "deceased_lastname",
    "father_name", "father_lastname",
    "mother_name", "mother_lastname",
    "husband_name", "wife_name",
]

for col in entierros_names_columns:
    if col in ENTIERROS_HARMONIZED.columns:
        ENTIERROS_HARMONIZED[col] = normalize_names_columns(ENTIERROS_HARMONIZED[col])

ENTIERROS_HARMONIZED[entierros_names_columns]

	deceased_name	deceased_lastname	father_name	father_lastname	mother_name	mother_lastname	husband_name	wife_name
0	julian	xavies	NaN	NaN	NaN	NaN	NaN	mercedes lupa
1	joce	raime	NaN	NaN	NaN	NaN	NaN	francisca cucho
2	martina	condori	NaN	NaN	NaN	NaN	luciano ccoyllo	NaN
3	dorotea	ccoyllo	NaN	NaN	NaN	NaN	josé espinosa	NaN
4	maría	romani	NaN	NaN	NaN	NaN	mariano huallpatuiro	NaN
...	...	...	...	...	...	...	...	...
2193	nieves	huallpatuero	patrocinio	huallpatuero	teresa	urbano	NaN	NaN
2194	sinforiano	huamani	eustaquio	huamani	martina	llamoca	NaN	gregoria ccoillo
2195	salomé	condori	sebastian	condori	anacla	roque	NaN	NaN
2196	maría	chinchay	abdon	chinchay	floriza	lopez	NaN	NaN
2197	fortunato	florez	cecilio	florez	guadalupe	ramos	NaN	NaN

2192 rows × 8 columns

Burials: Name Field Normalization

Burial records contain name fields for the deceased and their family members.

Burials: Full Name Splitting

In burial records, deceased spouses (husband/wife) were recorded as full names in a single field. We need to split these into separate first name and surname fields for consistency with other datasets.

# Define helper function to split full names into first name and surname
def name_splitter(name):
    """
    Split a full name into first name and surname components.
    Handles compound surnames like 'de la' and multi-word first names.
    """
    if pd.isnull(name):
        return np.nan, np.nan

    name_parts = name.split()
    if len(name_parts) > 2:
        if "de la" in name:
            return name_parts[0], " ".join(name_parts[1:])
        return " ".join(name_parts[0:2]), " ".join(name_parts[2:])
    return name_parts[0], name_parts[1] if len(name_parts) > 1 else np.nan

ENTIERROS_HARMONIZED[['husband_name', 'husband_lastname']] = ENTIERROS_HARMONIZED['husband_name'].apply(name_splitter).apply(pd.Series)
ENTIERROS_HARMONIZED[['wife_name', 'wife_lastname']] = ENTIERROS_HARMONIZED['wife_name'].apply(name_splitter).apply(pd.Series)

6. Geographic Data Extraction

Place Name Recognition

Extract place name entities from text fields using Named Entity Recognition (NER). Historical records often contain place names embedded within longer descriptive text (e.g., “natural de la villa de Sondondo”), which need to be identified and extracted for geographic analysis.

from actions.extractors import placeRecognition

extractor = placeRecognition.PlaceExtractor()

bautismos_place_columns = [
    'baptized_birth_place', 'event_place', 'event_geographic_descriptor_1',
        'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
        'event_geographic_descriptor_4'
]

BAUTISMOS_PLACES_RAW = BAUTISMOS_HARMONIZED[bautismos_place_columns]

for col in bautismos_place_columns:
    if col in BAUTISMOS_HARMONIZED.columns:
        BAUTISMOS_HARMONIZED[col] = extractor.extract_places_per_row(BAUTISMOS_HARMONIZED[col])

BAUTISMOS_HARMONIZED[bautismos_place_columns]

	baptized_birth_place	event_place	event_geographic_descriptor_1	event_geographic_descriptor_2	event_geographic_descriptor_3	event_geographic_descriptor_4
0	NaN	Pampamarca	Aucara	Pampamarca	NaN	NaN
1	NaN	Pampamarca	Aucara	Pampamarca	NaN	NaN
2	NaN	Pampamarca	Aucara	Pampamarca	NaN	NaN
3	NaN	Aucara	Aucara	NaN	NaN	NaN
4	NaN	Aucara	Aucara	NaN	NaN	NaN
...	...	...	...	...	...	...
6336	NaN	Aucará	Aucará	Aucará	NaN	NaN
6337	NaN	Aucará	Aucará	Aucará	NaN	NaN
6338	NaN	Aucará	Aucará	Mayobamba	NaN	NaN
6339	NaN	Aucará	Aucará	Huaicahuacho	NaN	NaN
6340	NaN	Aucará	Aucará	Chacralla	NaN	NaN

6340 rows × 6 columns

Marriage Place Extraction

matrimonios_place_columns = [
    'husband_birth_place',
       'husband_resident_in', 
       'wife_birth_place', 'wife_resident_in', 
       'event_place', 'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
       'event_geographic_descriptor_3', 'event_geographic_descriptor_4',
       'event_geographic_descriptor_5', 'event_geographic_descriptor_6'
]

MATRIMONIOS_PLACES_RAW = MATRIMONIOS_HARMONIZED[matrimonios_place_columns]

for col in matrimonios_place_columns:
    if col in MATRIMONIOS_HARMONIZED.columns:
        MATRIMONIOS_HARMONIZED[col] = extractor.extract_places_per_row(MATRIMONIOS_HARMONIZED[col])

MATRIMONIOS_HARMONIZED[matrimonios_place_columns]

	husband_birth_place	husband_resident_in	wife_birth_place	wife_resident_in	event_place	event_geographic_descriptor_1	event_geographic_descriptor_2	event_geographic_descriptor_3	event_geographic_descriptor_4	event_geographic_descriptor_5	event_geographic_descriptor_6
0	Ciudad de Huamanga	Aucara	NaN	NaN	Aucara	Aucara	Huamanga	Coracora	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	Aucara	Aucara	Colca	NaN	NaN	NaN	NaN
2	Pampamarca	NaN	Pampamarca	NaN	Aucara	Aucara	Pampamarca	NaN	NaN	NaN	NaN
3	Pampamarca	NaN	Pampamarca	NaN	Pampamarca\|santa iglesia	Aucara	Pampamarca	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	Pampamarca\|santa iglesia	Aucara	Pampamarca	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...
1714	Pampamarca	NaN	Pampamarca	NaN	Pampamarca	Aucara	Pampamarca	NaN	NaN	NaN	NaN
1715	Chacralla	NaN	Chacralla	NaN	Chacralla, iglesia vice-parroquial	Aucara	Chacralla	NaN	NaN	NaN	NaN
1716	Chacralla	NaN	Chacralla	NaN	Chacralla, iglesia vice-parroquial	Aucara	Chacralla	NaN	NaN	NaN	NaN
1717	NaN	Aucara	NaN	Aucara	Aucara	Aucara	Queca	NaN	NaN	NaN	NaN
1718	Pampamarca	NaN	Pampamarca	NaN	Pampamarca	Aucara	Pampamarca	NaN	NaN	NaN	NaN

1719 rows × 11 columns

Burial Place Extraction

entierros_place_columns = [
    'event_place', 'deceased_birth_place', 'burial_place', 'event_geographic_descriptor_1',
    'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
    'event_geographic_descriptor_4'
]

ENTIERROS_PLACES_RAW = ENTIERROS_HARMONIZED[entierros_place_columns]

for col in entierros_place_columns:
    if col in ENTIERROS_HARMONIZED.columns:
        ENTIERROS_HARMONIZED[col] = extractor.extract_places_per_row(ENTIERROS_HARMONIZED[col])

ENTIERROS_HARMONIZED[entierros_place_columns]

	event_place	deceased_birth_place	burial_place	event_geographic_descriptor_1	event_geographic_descriptor_2	event_geographic_descriptor_3	event_geographic_descriptor_4
0	NaN	NaN	NaN	Aucará	Lucanas	NaN	NaN
1	NaN	NaN	NaN	Aucará	Lucanas	NaN	NaN
2	NaN	NaN	NaN	Aucará	Lucanas	NaN	NaN
3	NaN	NaN	NaN	Aucará	Lucanas	NaN	NaN
4	NaN	NaN	NaN	Aucará	Lucanas	NaN	NaN
...	...	...	...	...	...	...	...
2193	Aucara	Santa Ana de Aucara	NaN	Aucara	Santa Ana de Aucara	NaN	NaN
2194	Aucara	Pampamarca	NaN	Aucara	Pampamarca	NaN	NaN
2195	Aucara	Santa Ana de Aucara	NaN	Aucara	Santa Ana de Aucara	NaN	NaN
2196	Aucara	Aucara	NaN	Aucara	NaN	NaN	NaN
2197	Aucara	Aucara	NaN	Aucara	NaN	NaN	NaN

2192 rows × 7 columns

### Raw Place Export

# Extract all unique place names from all datasets and save for subsequent normalization
# in the place mapping notebook (2_placeMapping.ipynb)
from utils import UniqueValues

uniqueValues = UniqueValues.UniqueValuesExtractor(
    [BAUTISMOS_PLACES_RAW, MATRIMONIOS_PLACES_RAW, ENTIERROS_PLACES_RAW]
).get_unique_values(return_dataframe=True)

uniqueValues.to_csv("../data/raw/raw_places.csv", index=False) # type: ignore

7. Export Cleaned Data

All cleaned datasets are exported to the data/clean/ directory for use in subsequent analysis notebooks (term extraction, person entity creation, and record linkage).

clean_data_folder = Path("../data/clean/")

# Fill NaN values for consistency
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.fillna(value=np.nan)
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.fillna(value=np.nan)
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.fillna(value=np.nan)

BAUTISMOS_HARMONIZED.to_csv(clean_data_folder / "bautismos_clean.csv", index=False)
MATRIMONIOS_HARMONIZED.to_csv(clean_data_folder / "matrimonios_clean.csv", index=False)
ENTIERROS_HARMONIZED.to_csv(clean_data_folder / "entierros_clean.csv", index=False)

Summary

This notebook successfully cleaned and standardized three sacramental record datasets:

Key transformations applied: 1. Column Harmonization: Unified schema across heterogeneous historical sources 2. Null Standardization: Converted diverse missing data representations to np.nan 3. Temporal Normalization: Standardized dates to ISO format and inferred birth dates from ages 4. Name Standardization: Applied consistent name cleaning and inferred missing surnames 5. Geographic Extraction: Identified place name entities using NER 6. Data Export: Saved cleaned datasets to data/clean/ for downstream analysis

Outputs: - bautismos_clean.csv: {len(BAUTISMOS_HARMONIZED)} baptism records - matrimonios_clean.csv: {len(MATRIMONIOS_HARMONIZED)} marriage records - entierros_clean.csv: {len(ENTIERROS_HARMONIZED)} burial records - raw_places.csv: Raw place names for normalization in notebook 2

The cleaned data is now ready for term extraction analysis (notebook 3), place name normalization (notebook 2), and person entity creation (notebook 4).