# Libraries
import json
import pandas as pd
import numpy as np
import re
from pathlib import PathData Cleaning
This notebook performs comprehensive data cleaning and standardization of historical sacramental records from Sondondo. The process includes column harmonization, null value handling, temporal normalization, name standardization, and geographic data extraction, preparing the datasets for subsequent person entity extraction and probabilistic record linkage.
1. Data Loading
The raw datasets are stored in the data/raw directory and include three types of sacramental records:
bautismos.csv: Baptism recordsmatrimonios.csv: Marriage recordsentierros.csv: Burial records
BAUTISMOS_RAW = pd.read_csv("../data/raw/bautismos.csv")
MATRIMONIOS_RAW = pd.read_csv("../data/raw/matrimonios.csv")
ENTIERROS_RAW = pd.read_csv("../data/raw/entierros.csv")
BAUTISMOS_RAW.head()| Secuencia | Unidad Documental Compuesta (a la que pertenece) | Identificador (es recomendable seguir una secuencia numeral como la mostrada en los ejemplos) | Título (incluir un título breve para cada documento) | Folio inicial del documento (convertir como se muestra abajo) | Folio final del documento (convertir como se muestra abajo) | Imagen inicial (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes) | Imagen final (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes) | Tipo de evento | Fecha aaaa-mm-dd | ... | Condición de la madrina | Lugar de bautizo | Notas adicionales del documento | Descriptor Geográfico 1 | Descriptor Geográfico 2 | Descriptor Geográfico 3 | Descriptor Geográfico 4 | 5 | Características físicas (Estado de conservación de los materiales físicos) | Historia de revisión (de los materiales digitalizados) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | APAucará LB L001 | B001 | Bautizo. Domingo. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-04 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 1 | 2.0 | APAucará LB L001 | B002 | Bautizo. Dominga. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-06 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 2 | 3.0 | APAucará LB L001 | B003 | Bautizo. Bartola. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-07 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 3 | 4.0 | APAucará LB L001 | B004 | Bautizo. Francisca | 3v | 3v | IMG_7000b | IMG_7000b | Bautizo | 1790-10-20 | ... | NaN | Aucara, iglesia | Abreviatura poco visible en el margen | Aucara | NaN | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 4 | 5.0 | APAucará LB L001 | B005 | Bautizo. Pedro | 3v | 3v | IMG_7000b | IMG_7000b | Bautizo | 1790-10-20 | ... | NaN | Aucara, iglesia | Margen roto y manchado de tinta | Aucara | NaN | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
5 rows × 36 columns
2. Column Harmonization
Standardize column names across datasets to ensure consistency. Different historical sources may use different field names for equivalent data (e.g., “nombre” vs “name”), so we map these variants to a unified schema.
Schema Mapping
Column mappings are defined in JSON files located in data/mappings/, which specify how raw column names should be renamed to match our standardized schema.
# Import the ColumnManager utility for applying schema mappings
from utils.ColumnManager import ColumnManagerbautismoMapping = Path("../data/mappings/bautismosMapping.json")
matrimonioMapping = Path("../data/mappings/matrimoniosMapping.json")
entierroMapping = Path("../data/mappings/entierrosMapping.json")
column_manager = ColumnManager()
BAUTISMOS_HARMONIZED = column_manager.harmonize_columns(BAUTISMOS_RAW, bautismoMapping)
MATRIMONIOS_HARMONIZED = column_manager.harmonize_columns(MATRIMONIOS_RAW, matrimonioMapping)
ENTIERROS_HARMONIZED = column_manager.harmonize_columns(ENTIERROS_RAW, entierroMapping)
BAUTISMOS_HARMONIZED.head()| id | file | identifier | title | start_folio | end_folio | start_image | end_image | event_type | event_date | ... | godmother_social_condition | event_place | event_additional_notes | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | event_other | record_physical_characteristics | revision_history | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | APAucará LB L001 | B001 | Bautizo. Domingo. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-04 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 1 | 2.0 | APAucará LB L001 | B002 | Bautizo. Dominga. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-06 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 2 | 3.0 | APAucará LB L001 | B003 | Bautizo. Bartola. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-07 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 3 | 4.0 | APAucará LB L001 | B004 | Bautizo. Francisca | 3v | 3v | IMG_7000b | IMG_7000b | Bautizo | 1790-10-20 | ... | NaN | Aucara, iglesia | Abreviatura poco visible en el margen | Aucara | NaN | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 4 | 5.0 | APAucará LB L001 | B005 | Bautizo. Pedro | 3v | 3v | IMG_7000b | IMG_7000b | Bautizo | 1790-10-20 | ... | NaN | Aucara, iglesia | Margen roto y manchado de tinta | Aucara | NaN | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
5 rows × 36 columns
Column Selection and Filtering
After harmonization, we reduce each dataset to only the columns relevant for person entity extraction and record linkage. This simplifies downstream processing and removes fields that don’t contribute to linkage (e.g., administrative notes, scribal annotations).
# Load the mapping of useful columns for each record type
# This defines which fields are relevant for entity extraction and linkage
useful_columns = json.load(open("../data/mappings/usefulColumnsMapping.json"))# Filter baptism records to useful columns only
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED[useful_columns['bautizo']]
# Remove any columns that are entirely empty
BAUTISMOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)
BAUTISMOS_HARMONIZED.columnsIndex(['file', 'identifier', 'event_type', 'event_date', 'baptized_name',
'baptized_birth_place', 'baptized_birth_date',
'baptized_legitimacy_status', 'father_name', 'father_lastname',
'father_social_condition', 'mother_name', 'mother_lastname',
'mother_social_condition', 'parents_social_condition', 'godfather_name',
'godfather_lastname', 'godfather_social_condition', 'godmother_name',
'godmother_lastname', 'godmother_social_condition', 'event_place',
'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
'event_geographic_descriptor_3', 'event_geographic_descriptor_4'],
dtype='object')
# Filter marriage records to useful columns only
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED[useful_columns['matrimonio']]
# Remove any columns that are entirely empty
MATRIMONIOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)
MATRIMONIOS_HARMONIZED.columnsIndex(['file', 'identifier', 'event_type', 'event_date', 'husband_name',
'husband_lastname', 'husband_social_condition',
'husband_marital_status', 'husband_birth_date', 'husband_birth_place',
'husband_resident_in', 'husband_legitimacy_status',
'husband_father_name', 'husband_father_lastname',
'husband_father_social_condition', 'husband_mother_name',
'husband_mother_lastname', 'husband_mother_social_condition',
'wife_name', 'wife_lastname', 'wife_social_condition',
'wife_marital_status', 'wife_birth_date', 'wife_birth_place',
'wife_resident_in', 'wife_legitimacy_status', 'wife_father_name',
'wife_father_lastname', 'wife_father_social_condition',
'wife_mother_name', 'wife_mother_lastname',
'wife_mother_social_condition', 'godparent_1_name',
'godparent_1_lastname', 'godparent_1_social_condition',
'godparent_2_name', 'godparent_2_lastname',
'godparent_2_social_condition', 'godparent_3_name',
'godparent_3_lastname', 'witness_1_name', 'witness_1_lastname',
'witness_2_name', 'witness_2_lastname', 'witness_3_name',
'witness_3_lastname', 'witness_4_name', 'witness_4_lastname',
'event_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4', 'event_geographic_descriptor_5',
'event_geographic_descriptor_6'],
dtype='object')
# Filter burial records to useful columns only
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED[useful_columns['entierro']]
# Remove any columns that are entirely empty
ENTIERROS_HARMONIZED.dropna(axis=1, how='all', inplace=True)
ENTIERROS_HARMONIZED.columnsIndex(['file', 'identifier', 'event_type', 'event_date', 'doctrine',
'event_place', 'deceased_name', 'deceased_lastname',
'deceased_birth_date', 'deceased_birth_place',
'deceased_social_condition', 'deceased_marital_status',
'deceased_legitimacy_status', 'father_name', 'father_lastname',
'mother_name', 'mother_lastname', 'husband_name', 'wife_name',
'burial_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4'],
dtype='object')
3. Data Quality Preprocessing
Null Value Standardization
Standardize the representation of missing data by converting various placeholder strings (e.g., ‘-’, ‘n/a’, ‘null’) to proper np.nan values. This ensures consistent missing data handling throughout the analysis pipeline.
# Define helper function to replace placeholder strings with numpy NaN
def replace_empty_with_na(df):
"""
Replace placeholder strings with np.nan in string columns only.
Handles common representations of missing data found in historical records.
"""
placeholders = {'', '-', '--', 'n/a', 'na', 'null', 'None'}
def clean_cell(val):
if isinstance(val, str) and val.strip().lower() in placeholders:
return np.nan
return val
return df.map(clean_cell)BAUTISMOS_HARMONIZED = replace_empty_with_na(BAUTISMOS_HARMONIZED)
MATRIMONIOS_HARMONIZED = replace_empty_with_na(MATRIMONIOS_HARMONIZED)
ENTIERROS_HARMONIZED = replace_empty_with_na(ENTIERROS_HARMONIZED)Empty Row Removal
Remove rows where all fields are empty, as these represent blank entries or page breaks in the original record books rather than actual sacramental events.
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.dropna(how='all')
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.dropna(how='all')
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.dropna(how='all')Event Type Corrections
Baptism Event Type Errors
Two records in the baptism dataset have incorrect event types that need correction or removal.
BAUTISMOS_HARMONIZED.loc[~(BAUTISMOS_HARMONIZED['event_type'] == 'Bautizo')]| file | identifier | event_type | event_date | baptized_name | baptized_birth_place | baptized_birth_date | baptized_legitimacy_status | father_name | father_lastname | ... | godfather_lastname | godfather_social_condition | godmother_name | godmother_lastname | godmother_social_condition | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1900 | APAucará LB L001 | B1901 | Concurso | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5414 | APAucará LB L003 | B1120 | 44444 | 1869-09-17 | Juan | Yshua | 12 días | Hijo natural | Ygnacio | Sevidola | ... | Origuela | NaN | NaN | NaN | NaN | Aucará | Aucará | NaN | NaN | NaN |
2 rows × 26 columns
Event 1900 is basically an empty record, so we will drop it. While event 5414 requires correction:
BAUTISMOS_HARMONIZED.drop(1900, inplace=True)
BAUTISMOS_HARMONIZED.loc[5414, 'event_type'] = 'Bautizo'
BAUTISMOS_HARMONIZED.iloc[5414]file APAucará LB L003
identifier B1121
event_type Bautizo
event_date 1869-09-17
baptized_name Maria Isabel
baptized_birth_place Chacralla
baptized_birth_date 4 días
baptized_legitimacy_status Hija legítima
father_name José
father_lastname Cerrano
father_social_condition NaN
mother_name Manuela
mother_lastname Avalos
mother_social_condition NaN
parents_social_condition NaN
godfather_name NaN
godfather_lastname NaN
godfather_social_condition NaN
godmother_name Brijida
godmother_lastname NaN
godmother_social_condition NaN
event_place Aucará
event_geographic_descriptor_1 Aucará
event_geographic_descriptor_2 NaN
event_geographic_descriptor_3 NaN
event_geographic_descriptor_4 NaN
Name: 5415, dtype: object
4. Temporal Data Normalization
Date Standardization
Ensure all dates are in a consistent ISO format (YYYY-MM-DD) across all datasets. Historical records often contain dates in various formats (e.g., “15 de Marzo de 1850”, “3/15/1850”) that need to be parsed and standardized.
from actions.normalizers.DatesNormalizer import DateNormalizerBAUTISMOS_HARMONIZED['event_date'] = DateNormalizer(BAUTISMOS_HARMONIZED['event_date']).normalize()
BAUTISMOS_HARMONIZED['event_date']0 1790-10-04
1 1790-10-06
2 1790-10-07
3 1790-10-20
4 1790-10-20
...
6336 1888-12-10
6337 1888-12-11
6338 1888-12-12
6339 1888-12-15
6340 1888-12-16
Name: event_date, Length: 6340, dtype: object
MATRIMONIOS_HARMONIZED['event_date'] = DateNormalizer(MATRIMONIOS_HARMONIZED['event_date']).normalize()
MATRIMONIOS_HARMONIZED['event_date']0 1816-12-06
1 1816-12-12
2 1817-03-05
3 1817-03-10
4 1817-03-12
...
1714 1907-10-27
1715 1908-01-13
1716 1908-01-15
1717 1908-02-15
1718 1908-03-17
Name: event_date, Length: 1719, dtype: object
ENTIERROS_HARMONIZED['event_date'] = DateNormalizer(ENTIERROS_HARMONIZED['event_date']).normalize()
ENTIERROS_HARMONIZED['event_date']0 1846-10-06
1 1846-10-07
2 1846-11-02
3 1846-12-08
4 1847-02-23
...
2193 1920-10-12
2194 1920-10-19
2195 1920-10-19
2196 1920-10-20
2197 1920-10-21
Name: event_date, Length: 2192, dtype: object
Birth Date Inference
When ages are recorded instead of birth dates, infer birth dates by subtracting the recorded age from the event date. This creates comparable temporal data across records that originally used different date recording conventions.
from actions.generators.AgeInferrer import AgeInferrerBAUTISMOS_HARMONIZED['baptized_birth_date'] = AgeInferrer(BAUTISMOS_HARMONIZED['event_date']).infer_all(BAUTISMOS_HARMONIZED['baptized_birth_date'])
BAUTISMOS_HARMONIZED[['event_date', 'baptized_birth_date']]| event_date | baptized_birth_date | |
|---|---|---|
| 0 | 1790-10-04 | 1790-08-04 |
| 1 | 1790-10-06 | 1790-08-04 |
| 2 | 1790-10-07 | 1790-08-04 |
| 3 | 1790-10-20 | 1790-10-15 |
| 4 | 1790-10-20 | 1790-10-19 |
| ... | ... | ... |
| 6336 | 1888-12-10 | 1888-12-09 |
| 6337 | 1888-12-11 | 1888-12-07 |
| 6338 | 1888-12-12 | 1888-12-06 |
| 6339 | 1888-12-15 | 1888-11-30 |
| 6340 | 1888-12-16 | 1888-12-01 |
6340 rows × 2 columns
### Date Validation
# Check for chronologically inconsistent dates: event_date should be after baptized_birth_date
invalid_mask = pd.to_datetime(BAUTISMOS_HARMONIZED['event_date'], errors='coerce') < pd.to_datetime(BAUTISMOS_HARMONIZED['baptized_birth_date'], errors='coerce')
if invalid_mask.any():
print("Found invalid records:")
print(BAUTISMOS_HARMONIZED[invalid_mask][['event_date', 'baptized_birth_date']])Found invalid records:
event_date baptized_birth_date
135 1792-03-29 1792-04-08
290 1794-01-01 1794-01-27
671 1797-07-15 1797-07-24
2814 1900-04-01 1900-04-09
Unfortunately, these records were incorrectly recorded. To fix this, it is necessary to check with the original records.
MATRIMONIOS_HARMONIZED['husband_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['husband_birth_date'])
MATRIMONIOS_HARMONIZED['wife_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['wife_birth_date'])
MATRIMONIOS_HARMONIZED[['event_date', 'husband_birth_date', 'wife_birth_date']]| event_date | husband_birth_date | wife_birth_date | |
|---|---|---|---|
| 0 | 1816-12-06 | NaN | NaN |
| 1 | 1816-12-12 | NaN | NaN |
| 2 | 1817-03-05 | NaN | NaN |
| 3 | 1817-03-10 | NaN | NaN |
| 4 | 1817-03-12 | NaN | NaN |
| ... | ... | ... | ... |
| 1714 | 1907-10-27 | 1882-11-01 | 1880-11-01 |
| 1715 | 1908-01-13 | 1880-01-19 | 1879-01-19 |
| 1716 | 1908-01-15 | 1886-01-19 | 1888-01-19 |
| 1717 | 1908-02-15 | 1883-02-20 | 1886-02-19 |
| 1718 | 1908-03-17 | 1880-03-23 | 1879-03-24 |
1719 rows × 3 columns
ENTIERROS_HARMONIZED['deceased_birth_date'] = AgeInferrer(ENTIERROS_HARMONIZED['event_date']).infer_all(ENTIERROS_HARMONIZED['deceased_birth_date'])
ENTIERROS_HARMONIZED[['event_date', 'deceased_birth_date']]| event_date | deceased_birth_date | |
|---|---|---|
| 0 | 1846-10-06 | NaN |
| 1 | 1846-10-07 | 1821-10-13 |
| 2 | 1846-11-02 | 1766-11-21 |
| 3 | 1846-12-08 | 1806-12-18 |
| 4 | 1847-02-23 | 1797-03-06 |
| ... | ... | ... |
| 2193 | 1920-10-12 | 1920-02-15 |
| 2194 | 1920-10-19 | 1870-10-31 |
| 2195 | 1920-10-19 | 1913-10-21 |
| 2196 | 1920-10-20 | 1919-10-21 |
| 2197 | 1920-10-21 | 1920-10-19 |
2192 rows × 2 columns
5. Name Standardization
General Name Normalization
Standardize name fields to ensure consistency for matching. This includes normalizing whitespace, removing punctuation, handling abbreviations, and standardizing capitalization.
from actions.normalizers.NamesNormalizer import NamesNormalizer# Define helper function to apply name normalization to a series
def normalize_names_columns(series):
"""Apply NamesNormalizer to clean and standardize a series of name values."""
namesManager = NamesNormalizer()
return namesManager.clean_series(series)names_columns = [
'baptized_name',
'father_name', 'father_lastname',
'mother_name', 'mother_lastname',
'godfather_name', 'godfather_lastname',
'godmother_name', 'godmother_lastname',
]
for col in names_columns:
if col in BAUTISMOS_HARMONIZED.columns:
BAUTISMOS_HARMONIZED[col] = normalize_names_columns(BAUTISMOS_HARMONIZED[col])
BAUTISMOS_HARMONIZED[names_columns]| baptized_name | father_name | father_lastname | mother_name | mother_lastname | godfather_name | godfather_lastname | godmother_name | godmother_lastname | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | domingo | lucas | ayquipa | sevastiana | quispe | vicente | guamani | NaN | NaN |
| 1 | dominga | juan | lulia | jospha | gomes | ignacio | varientos | NaN | NaN |
| 2 | bartola | jacinto | quispe | juliana | chinchay | NaN | NaN | rotonda | pocco |
| 3 | francisca | juan | cuebas | clemenzia | manco | NaN | NaN | ysabel | guillen |
| 4 | pedro | santos | manxo | baleriana | arango | NaN | NaN | josefa | santiago |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6336 | leocadio | miguel | pacheco | rosa | huarcaya | josé julián | bendezú | NaN | NaN |
| 6337 | mariano concepcion | facundo | vega | silvestra | urbano | fernando | mancco | NaN | NaN |
| 6338 | ambrosio | ysidro | ccasane | rita | palomino | juan | tito | NaN | NaN |
| 6339 | francisco | mariano | lopez | leocadia | medina | feliciano | dias | NaN | NaN |
| 6340 | laureana | bernarda | champa | rufina | lopez | NaN | NaN | manuela | de la cruz |
6340 rows × 9 columns
Dataset-Specific Name Processing
Baptisms: Surname Inference
Baptized individuals’ surnames were not directly recorded in the original registers. We infer the surname from the father’s surname (following patrilineal naming conventions), with the mother’s surname as a fallback if the father’s is unavailable.
BAUTISMOS_HARMONIZED['baptized_lastname'] = BAUTISMOS_HARMONIZED.apply(
lambda row: row['father_lastname'] if pd.notna(row['father_lastname']) else (
row['mother_lastname'] if pd.notna(row['mother_lastname']) else 'na'
),
axis=1
)
BAUTISMOS_HARMONIZED['baptized_lastname']0 ayquipa
1 lulia
2 quispe
3 cuebas
4 manxo
...
6336 pacheco
6337 vega
6338 ccasane
6339 lopez
6340 champa
Name: baptized_lastname, Length: 6340, dtype: object
Marriages: Name Field Normalization
Marriage records contain numerous name fields for both spouses, their parents, godparents, and witnesses. All of these require standardization.
matrimonios_names_columns = [
'husband_name', 'husband_lastname',
'husband_father_name', 'husband_father_lastname',
'husband_mother_name', 'husband_mother_lastname',
'wife_name', 'wife_lastname',
'wife_father_name', 'wife_father_lastname',
'wife_mother_name', 'wife_mother_lastname',
'godparent_1_name', 'godparent_1_lastname',
'godparent_2_name', 'godparent_2_lastname',
'godparent_3_name', 'godparent_3_lastname',
'witness_1_name', 'witness_1_lastname',
'witness_2_name', 'witness_2_lastname',
'witness_3_name', 'witness_3_lastname',
'witness_4_name', 'witness_4_lastname'
]
for col in matrimonios_names_columns:
if col in MATRIMONIOS_HARMONIZED.columns:
MATRIMONIOS_HARMONIZED[col] = normalize_names_columns(MATRIMONIOS_HARMONIZED[col])
MATRIMONIOS_HARMONIZED[matrimonios_names_columns]| husband_name | husband_lastname | husband_father_name | husband_father_lastname | husband_mother_name | husband_mother_lastname | wife_name | wife_lastname | wife_father_name | wife_father_lastname | ... | godparent_3_name | godparent_3_lastname | witness_1_name | witness_1_lastname | witness_2_name | witness_2_lastname | witness_3_name | witness_3_lastname | witness_4_name | witness_4_lastname | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | josé manl manuel | de la roca | acencio | roca | leonor | guerrero | juana | rodrigues | pedro | rodrigues | ... | NaN | NaN | agustin | castro | mariano | castro | juan | baldes | NaN | NaN |
| 1 | esteban | castillo | matheo | castillo | ma maria | torres | ambrocia | tasqui | pedro | tasqui | ... | NaN | NaN | pedro | manco | carlos | canto | pedro | guamani | NaN | NaN |
| 2 | alexandro | ramires | leonor | romani | franca francisca | paucar | sipriana | coillo | cristobal | coillo | ... | NaN | NaN | marcelo | llamuca | julian | urbano | antonio | urbano | NaN | NaN |
| 3 | jose | cuchu | acencio | cuchu | baleriana | antay | cacimira | flores | NaN | NaN | ... | NaN | NaN | pablo | roque | antonio | urbano | cristobal | coillo | NaN | NaN |
| 4 | domingo | tito | NaN | NaN | marcela | guauya | petrona | guallpatuiru | agustin | guallpatuiru | ... | NaN | NaN | marcelo | llamuca | antonio | guamani | mariano | guallpatuiru | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1714 | patrocinio | chinchay | miguel | chinchay | andrea | polanco | lorenza | quispe | gervacio | quispe | ... | NaN | NaN | juan de dios | barrientos | manuel | espinosa | crisostomo | pumarino | NaN | NaN |
| 1715 | gerónimo | cucho | ambrocio | cucho | gertrudis | serrano | teresa | jimenes | aniseto | jimenes | ... | NaN | NaN | victor | saravia | mateo | aiquipa | felix | cucho | NaN | NaN |
| 1716 | josé | coro | felix | coro | natividad | cucho | emilia | huamani | patricio | huamani | ... | NaN | NaN | pablo | de la cruz | victor | saravia | marcelo | ramos | NaN | NaN |
| 1717 | pedro | gutierres | ruperto | gutierrez | micaila | oscco | juliana | huarcaya | hilario | huarcaya | ... | NaN | NaN | rafael | delgado | josé | vivanco | agustin | vicente | NaN | NaN |
| 1718 | benito | sanchez | tiburcio | sanchez | lorenza | aymi | teresa | poma | bernardo | poma | ... | NaN | NaN | santos | espejo | felix | sanches | mariano | urbano | NaN | NaN |
1719 rows × 26 columns
entierros_names_columns = [
"deceased_name", "deceased_lastname",
"father_name", "father_lastname",
"mother_name", "mother_lastname",
"husband_name", "wife_name",
]
for col in entierros_names_columns:
if col in ENTIERROS_HARMONIZED.columns:
ENTIERROS_HARMONIZED[col] = normalize_names_columns(ENTIERROS_HARMONIZED[col])
ENTIERROS_HARMONIZED[entierros_names_columns]| deceased_name | deceased_lastname | father_name | father_lastname | mother_name | mother_lastname | husband_name | wife_name | |
|---|---|---|---|---|---|---|---|---|
| 0 | julian | xavies | NaN | NaN | NaN | NaN | NaN | mercedes lupa |
| 1 | joce | raime | NaN | NaN | NaN | NaN | NaN | francisca cucho |
| 2 | martina | condori | NaN | NaN | NaN | NaN | luciano ccoyllo | NaN |
| 3 | dorotea | ccoyllo | NaN | NaN | NaN | NaN | josé espinosa | NaN |
| 4 | maría | romani | NaN | NaN | NaN | NaN | mariano huallpatuiro | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2193 | nieves | huallpatuero | patrocinio | huallpatuero | teresa | urbano | NaN | NaN |
| 2194 | sinforiano | huamani | eustaquio | huamani | martina | llamoca | NaN | gregoria ccoillo |
| 2195 | salomé | condori | sebastian | condori | anacla | roque | NaN | NaN |
| 2196 | maría | chinchay | abdon | chinchay | floriza | lopez | NaN | NaN |
| 2197 | fortunato | florez | cecilio | florez | guadalupe | ramos | NaN | NaN |
2192 rows × 8 columns
Burials: Name Field Normalization
Burial records contain name fields for the deceased and their family members.
Burials: Full Name Splitting
In burial records, deceased spouses (husband/wife) were recorded as full names in a single field. We need to split these into separate first name and surname fields for consistency with other datasets.
# Define helper function to split full names into first name and surname
def name_splitter(name):
"""
Split a full name into first name and surname components.
Handles compound surnames like 'de la' and multi-word first names.
"""
if pd.isnull(name):
return np.nan, np.nan
name_parts = name.split()
if len(name_parts) > 2:
if "de la" in name:
return name_parts[0], " ".join(name_parts[1:])
return " ".join(name_parts[0:2]), " ".join(name_parts[2:])
return name_parts[0], name_parts[1] if len(name_parts) > 1 else np.nanENTIERROS_HARMONIZED[['husband_name', 'husband_lastname']] = ENTIERROS_HARMONIZED['husband_name'].apply(name_splitter).apply(pd.Series)
ENTIERROS_HARMONIZED[['wife_name', 'wife_lastname']] = ENTIERROS_HARMONIZED['wife_name'].apply(name_splitter).apply(pd.Series)6. Geographic Data Extraction
Place Name Recognition
Extract place name entities from text fields using Named Entity Recognition (NER). Historical records often contain place names embedded within longer descriptive text (e.g., “natural de la villa de Sondondo”), which need to be identified and extracted for geographic analysis.
from actions.extractors import placeRecognition
extractor = placeRecognition.PlaceExtractor()bautismos_place_columns = [
'baptized_birth_place', 'event_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4'
]
BAUTISMOS_PLACES_RAW = BAUTISMOS_HARMONIZED[bautismos_place_columns]
for col in bautismos_place_columns:
if col in BAUTISMOS_HARMONIZED.columns:
BAUTISMOS_HARMONIZED[col] = extractor.extract_places_per_row(BAUTISMOS_HARMONIZED[col])
BAUTISMOS_HARMONIZED[bautismos_place_columns]| baptized_birth_place | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | |
|---|---|---|---|---|---|---|
| 0 | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN |
| 1 | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN |
| 2 | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN |
| 3 | NaN | Aucara | Aucara | NaN | NaN | NaN |
| 4 | NaN | Aucara | Aucara | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 6336 | NaN | Aucará | Aucará | Aucará | NaN | NaN |
| 6337 | NaN | Aucará | Aucará | Aucará | NaN | NaN |
| 6338 | NaN | Aucará | Aucará | Mayobamba | NaN | NaN |
| 6339 | NaN | Aucará | Aucará | Huaicahuacho | NaN | NaN |
| 6340 | NaN | Aucará | Aucará | Chacralla | NaN | NaN |
6340 rows × 6 columns
Marriage Place Extraction
matrimonios_place_columns = [
'husband_birth_place',
'husband_resident_in',
'wife_birth_place', 'wife_resident_in',
'event_place', 'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
'event_geographic_descriptor_3', 'event_geographic_descriptor_4',
'event_geographic_descriptor_5', 'event_geographic_descriptor_6'
]
MATRIMONIOS_PLACES_RAW = MATRIMONIOS_HARMONIZED[matrimonios_place_columns]
for col in matrimonios_place_columns:
if col in MATRIMONIOS_HARMONIZED.columns:
MATRIMONIOS_HARMONIZED[col] = extractor.extract_places_per_row(MATRIMONIOS_HARMONIZED[col])
MATRIMONIOS_HARMONIZED[matrimonios_place_columns]| husband_birth_place | husband_resident_in | wife_birth_place | wife_resident_in | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | event_geographic_descriptor_5 | event_geographic_descriptor_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ciudad de Huamanga | Aucara | NaN | NaN | Aucara | Aucara | Huamanga | Coracora | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | Aucara | Aucara | Colca | NaN | NaN | NaN | NaN |
| 2 | Pampamarca | NaN | Pampamarca | NaN | Aucara | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
| 3 | Pampamarca | NaN | Pampamarca | NaN | Pampamarca|santa iglesia | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | Pampamarca|santa iglesia | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1714 | Pampamarca | NaN | Pampamarca | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
| 1715 | Chacralla | NaN | Chacralla | NaN | Chacralla, iglesia vice-parroquial | Aucara | Chacralla | NaN | NaN | NaN | NaN |
| 1716 | Chacralla | NaN | Chacralla | NaN | Chacralla, iglesia vice-parroquial | Aucara | Chacralla | NaN | NaN | NaN | NaN |
| 1717 | NaN | Aucara | NaN | Aucara | Aucara | Aucara | Queca | NaN | NaN | NaN | NaN |
| 1718 | Pampamarca | NaN | Pampamarca | NaN | Pampamarca | Aucara | Pampamarca | NaN | NaN | NaN | NaN |
1719 rows × 11 columns
Burial Place Extraction
entierros_place_columns = [
'event_place', 'deceased_birth_place', 'burial_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4'
]
ENTIERROS_PLACES_RAW = ENTIERROS_HARMONIZED[entierros_place_columns]
for col in entierros_place_columns:
if col in ENTIERROS_HARMONIZED.columns:
ENTIERROS_HARMONIZED[col] = extractor.extract_places_per_row(ENTIERROS_HARMONIZED[col])
ENTIERROS_HARMONIZED[entierros_place_columns]| event_place | deceased_birth_place | burial_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | |
|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| 1 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| 2 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| 3 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| 4 | NaN | NaN | NaN | Aucará | Lucanas | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2193 | Aucara | Santa Ana de Aucara | NaN | Aucara | Santa Ana de Aucara | NaN | NaN |
| 2194 | Aucara | Pampamarca | NaN | Aucara | Pampamarca | NaN | NaN |
| 2195 | Aucara | Santa Ana de Aucara | NaN | Aucara | Santa Ana de Aucara | NaN | NaN |
| 2196 | Aucara | Aucara | NaN | Aucara | NaN | NaN | NaN |
| 2197 | Aucara | Aucara | NaN | Aucara | NaN | NaN | NaN |
2192 rows × 7 columns
### Raw Place Export
# Extract all unique place names from all datasets and save for subsequent normalization
# in the place mapping notebook (2_placeMapping.ipynb)
from utils import UniqueValues
uniqueValues = UniqueValues.UniqueValuesExtractor(
[BAUTISMOS_PLACES_RAW, MATRIMONIOS_PLACES_RAW, ENTIERROS_PLACES_RAW]
).get_unique_values(return_dataframe=True)
uniqueValues.to_csv("../data/raw/raw_places.csv", index=False) # type: ignore7. Export Cleaned Data
All cleaned datasets are exported to the data/clean/ directory for use in subsequent analysis notebooks (term extraction, person entity creation, and record linkage).
clean_data_folder = Path("../data/clean/")
# Fill NaN values for consistency
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.fillna(value=np.nan)
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.fillna(value=np.nan)
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.fillna(value=np.nan)
BAUTISMOS_HARMONIZED.to_csv(clean_data_folder / "bautismos_clean.csv", index=False)
MATRIMONIOS_HARMONIZED.to_csv(clean_data_folder / "matrimonios_clean.csv", index=False)
ENTIERROS_HARMONIZED.to_csv(clean_data_folder / "entierros_clean.csv", index=False)Summary
This notebook successfully cleaned and standardized three sacramental record datasets:
Key transformations applied: 1. Column Harmonization: Unified schema across heterogeneous historical sources 2. Null Standardization: Converted diverse missing data representations to np.nan 3. Temporal Normalization: Standardized dates to ISO format and inferred birth dates from ages 4. Name Standardization: Applied consistent name cleaning and inferred missing surnames 5. Geographic Extraction: Identified place name entities using NER 6. Data Export: Saved cleaned datasets to data/clean/ for downstream analysis
Outputs: - bautismos_clean.csv: {len(BAUTISMOS_HARMONIZED)} baptism records - matrimonios_clean.csv: {len(MATRIMONIOS_HARMONIZED)} marriage records - entierros_clean.csv: {len(ENTIERROS_HARMONIZED)} burial records - raw_places.csv: Raw place names for normalization in notebook 2
The cleaned data is now ready for term extraction analysis (notebook 3), place name normalization (notebook 2), and person entity creation (notebook 4).