Data Cleaning

This notebook handles the data cleaning process for the datasets, documenting the process at the same time.

# Libraries
import json
import pandas as pd
import numpy as np
import re
from pathlib import Path

Raw Datasets

The raw datasets are stored in the data/raw directory. The datasets include:

  • bautismos.csv: Baptism records
  • matrimonios.csv: Marriage records
  • entierros.csv: Burial records
BAUTISMOS_RAW = pd.read_csv("../data/raw/bautismos.csv")
MATRIMONIOS_RAW = pd.read_csv("../data/raw/matrimonios.csv")
ENTIERROS_RAW = pd.read_csv("../data/raw/entierros.csv")

BAUTISMOS_RAW.head()
Secuencia Unidad Documental Compuesta (a la que pertenece) Identificador (es recomendable seguir una secuencia numeral como la mostrada en los ejemplos) Título (incluir un título breve para cada documento) Folio inicial del documento (convertir como se muestra abajo) Folio final del documento (convertir como se muestra abajo) Imagen inicial (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes) Imagen final (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes) Tipo de evento Fecha aaaa-mm-dd ... Condición de la madrina Lugar de bautizo Notas adicionales del documento Descriptor Geográfico 1 Descriptor Geográfico 2 Descriptor Geográfico 3 Descriptor Geográfico 4 5 Características físicas (Estado de conservación de los materiales físicos) Historia de revisión (de los materiales digitalizados)
0 1.0 APAucará LB L001 B001 Bautizo. Domingo. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-04 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
1 2.0 APAucará LB L001 B002 Bautizo. Dominga. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-06 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
2 3.0 APAucará LB L001 B003 Bautizo. Bartola. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-07 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
3 4.0 APAucará LB L001 B004 Bautizo. Francisca 3v 3v IMG_7000b IMG_7000b Bautizo 1790-10-20 ... NaN Aucara, iglesia Abreviatura poco visible en el margen Aucara NaN NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
4 5.0 APAucará LB L001 B005 Bautizo. Pedro 3v 3v IMG_7000b IMG_7000b Bautizo 1790-10-20 ... NaN Aucara, iglesia Margen roto y manchado de tinta Aucara NaN NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023

5 rows × 36 columns

Column Harmonization

Rename the columns in the datasets to ensure consistency across different data sources.

Values are mapped using the json files located in the data/mappings directory.

from utils.ColumnManager import ColumnManager
bautismoMapping = Path("../data/mappings/bautismosMapping.json")
matrimonioMapping = Path("../data/mappings/matrimoniosMapping.json")
entierroMapping = Path("../data/mappings/entierrosMapping.json")

column_manager = ColumnManager()

BAUTISMOS_HARMONIZED = column_manager.harmonize_columns(BAUTISMOS_RAW, bautismoMapping)
MATRIMONIOS_HARMONIZED = column_manager.harmonize_columns(MATRIMONIOS_RAW, matrimonioMapping)
ENTIERROS_HARMONIZED = column_manager.harmonize_columns(ENTIERROS_RAW, entierroMapping)

BAUTISMOS_HARMONIZED.head()
id file identifier title start_folio end_folio start_image end_image event_type event_date ... godmother_social_condition event_place event_additional_notes event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4 event_other record_physical_characteristics revision_history
0 1.0 APAucará LB L001 B001 Bautizo. Domingo. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-04 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
1 2.0 APAucará LB L001 B002 Bautizo. Dominga. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-06 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
2 3.0 APAucará LB L001 B003 Bautizo. Bartola. Tributarios 3r 3r IMG_7000a IMG_7000a Bautizo 1790-10-07 ... NaN Pampamarca, iglesia NaN Aucara Pampamarca NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
3 4.0 APAucará LB L001 B004 Bautizo. Francisca 3v 3v IMG_7000b IMG_7000b Bautizo 1790-10-20 ... NaN Aucara, iglesia Abreviatura poco visible en el margen Aucara NaN NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023
4 5.0 APAucará LB L001 B005 Bautizo. Pedro 3v 3v IMG_7000b IMG_7000b Bautizo 1790-10-20 ... NaN Aucara, iglesia Margen roto y manchado de tinta Aucara NaN NaN NaN NaN Regular Registrado por Edwin Gonzales en 2023

5 rows × 36 columns

Reduce DataFrames to their relevant columns

useful_columns = json.load(open("../data/mappings/usefulColumnsMapping.json"))
# Bautismos useful columns

BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED[useful_columns['bautizo']]

# remove empty columns
BAUTISMOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)

BAUTISMOS_HARMONIZED.columns
Index(['file', 'identifier', 'event_type', 'event_date', 'baptized_name',
       'baptized_birth_place', 'baptized_birth_date',
       'baptized_legitimacy_status', 'father_name', 'father_lastname',
       'father_social_condition', 'mother_name', 'mother_lastname',
       'mother_social_condition', 'parents_social_condition', 'godfather_name',
       'godfather_lastname', 'godfather_social_condition', 'godmother_name',
       'godmother_lastname', 'godmother_social_condition', 'event_place',
       'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
       'event_geographic_descriptor_3', 'event_geographic_descriptor_4'],
      dtype='object')
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED[useful_columns['matrimonio']]

# remove empty columns
MATRIMONIOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)

MATRIMONIOS_HARMONIZED.columns
Index(['file', 'identifier', 'event_type', 'event_date', 'groom_name',
       'groom_lastname', 'groom_social_condition', 'groom_marital_status',
       'groom_birth_date', 'groom_birth_place', 'groom_resident_in',
       'groom_legitimacy_status', 'groom_father_name', 'groom_father_lastname',
       'groom_father_social_condition', 'groom_mother_name',
       'groom_mother_lastname', 'groom_mother_social_condition', 'bride_name',
       'bride_lastname', 'bride_social_condition', 'bride_marital_status',
       'bride_birth_date', 'bride_birth_place', 'bride_resident_in',
       'bride_legitimacy_status', 'bride_father_name', 'bride_father_lastname',
       'bride_father_social_condition', 'bride_mother_name',
       'bride_mother_lastname', 'bride_mother_social_condition',
       'godparent_1_name', 'godparent_1_lastname',
       'godparent_1_social_condition', 'godparent_2_name',
       'godparent_2_lastname', 'godparent_2_social_condition',
       'godparent_3_name', 'godparent_3_lastname', 'witness_1_name',
       'witness_1_lastname', 'witness_2_name', 'witness_2_lastname',
       'witness_3_name', 'witness_3_lastname', 'witness_4_name',
       'witness_4_lastname', 'event_place', 'event_geographic_descriptor_1',
       'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
       'event_geographic_descriptor_4', 'event_geographic_descriptor_5',
       'event_geographic_descriptor_6'],
      dtype='object')
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED[useful_columns['entierro']]

# remove empty columns
ENTIERROS_HARMONIZED.dropna(axis=1, how='all', inplace=True)

ENTIERROS_HARMONIZED.columns
Index(['file', 'identifier', 'event_type', 'event_date', 'doctrine',
       'event_place', 'deceased_name', 'deceased_lastname',
       'deceased_birth_date', 'deceased_birth_place',
       'deceased_social_condition', 'deceased_marital_status',
       'deceased_legitimacy_status', 'father_name', 'father_lastname',
       'mother_name', 'mother_lastname', 'husband_name', 'wife_name',
       'burial_place', 'event_geographic_descriptor_1',
       'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
       'event_geographic_descriptor_4'],
      dtype='object')

Replace empty or null values with ‘na’

## replace cells with no textual information with numpy na
def replace_empty_with_na(df):
    """
    Replace placeholder strings with np.nan in string columns only.
    """
    placeholders = {'', '-', '--', 'n/a', 'na', 'null', 'None'}

    def clean_cell(val):
        if isinstance(val, str) and val.strip().lower() in placeholders:
            return np.nan
        return val

    return df.map(clean_cell)
BAUTISMOS_HARMONIZED = replace_empty_with_na(BAUTISMOS_HARMONIZED)
MATRIMONIOS_HARMONIZED = replace_empty_with_na(MATRIMONIOS_HARMONIZED)
ENTIERROS_HARMONIZED = replace_empty_with_na(ENTIERROS_HARMONIZED)

Remove all empty rows from the DataFrames

BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.dropna(how='all')
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.dropna(how='all')
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.dropna(how='all')

Particular cases

Baptism event type error correction

In the bautismos dataset, two records have an incorrect event type:

BAUTISMOS_HARMONIZED.loc[~(BAUTISMOS_HARMONIZED['event_type'] == 'Bautizo')]
file identifier event_type event_date baptized_name baptized_birth_place baptized_birth_date baptized_legitimacy_status father_name father_lastname ... godfather_lastname godfather_social_condition godmother_name godmother_lastname godmother_social_condition event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4
1900 APAucará LB L001 B1901 Concurso NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5414 APAucará LB L003 B1120 44444 1869-09-17 Juan Yshua 12 días Hijo natural Ygnacio Sevidola ... Origuela NaN NaN NaN NaN Aucará Aucará NaN NaN NaN

2 rows × 26 columns

Event 1900 is basically an empty record, so we will drop it. While event 5414 requires correction:

BAUTISMOS_HARMONIZED.drop(1900, inplace=True)

BAUTISMOS_HARMONIZED.loc[5414, 'event_type'] = 'Bautizo'
BAUTISMOS_HARMONIZED.iloc[5414]
file                             APAucará LB L003
identifier                                  B1121
event_type                                Bautizo
event_date                             1869-09-17
baptized_name                        Maria Isabel
baptized_birth_place                    Chacralla
baptized_birth_date                        4 días
baptized_legitimacy_status          Hija legítima
father_name                                  José
father_lastname                           Cerrano
father_social_condition                       NaN
mother_name                               Manuela
mother_lastname                            Avalos
mother_social_condition                       NaN
parents_social_condition                      NaN
godfather_name                                NaN
godfather_lastname                            NaN
godfather_social_condition                    NaN
godmother_name                            Brijida
godmother_lastname                            NaN
godmother_social_condition                    NaN
event_place                                Aucará
event_geographic_descriptor_1              Aucará
event_geographic_descriptor_2                 NaN
event_geographic_descriptor_3                 NaN
event_geographic_descriptor_4                 NaN
Name: 5415, dtype: object

Dates Normalization

Ensure dates are in a consistent format across all datasets. The dates should be in the format YYYY-MM-DD.

from actions.normalizers.DatesNormalizer import DateNormalizer
BAUTISMOS_HARMONIZED['event_date'] = DateNormalizer(BAUTISMOS_HARMONIZED['event_date']).normalize()
BAUTISMOS_HARMONIZED['event_date']
0       1790-10-04
1       1790-10-06
2       1790-10-07
3       1790-10-20
4       1790-10-20
           ...    
6336    1888-12-10
6337    1888-12-11
6338    1888-12-12
6339    1888-12-15
6340    1888-12-16
Name: event_date, Length: 6340, dtype: object
MATRIMONIOS_HARMONIZED['event_date'] = DateNormalizer(MATRIMONIOS_HARMONIZED['event_date']).normalize()
MATRIMONIOS_HARMONIZED['event_date']
0       1816-12-06
1       1816-12-12
2       1817-03-05
3       1817-03-10
4       1817-03-12
           ...    
1714    1907-10-27
1715    1908-01-13
1716    1908-01-15
1717    1908-02-15
1718    1908-03-17
Name: event_date, Length: 1719, dtype: object
ENTIERROS_HARMONIZED['event_date'] = DateNormalizer(ENTIERROS_HARMONIZED['event_date']).normalize()
ENTIERROS_HARMONIZED['event_date']
0       1846-10-06
1       1846-10-07
2       1846-11-02
3       1846-12-08
4       1847-02-23
           ...    
2193    1920-10-12
2194    1920-10-19
2195    1920-10-19
2196    1920-10-20
2197    1920-10-21
Name: event_date, Length: 2192, dtype: object

Age Inferring

Infer the age of individuals based on their birthdates and the date of the event (baptism, marriage, burial).

from actions.generators.AgeInferrer import AgeInferrer
BAUTISMOS_HARMONIZED['baptized_birth_date'] = AgeInferrer(BAUTISMOS_HARMONIZED['event_date']).infer_all(BAUTISMOS_HARMONIZED['baptized_birth_date'])
BAUTISMOS_HARMONIZED[['event_date', 'baptized_birth_date']]
event_date baptized_birth_date
0 1790-10-04 1790-08-04
1 1790-10-06 1790-08-04
2 1790-10-07 1790-08-04
3 1790-10-20 1790-10-15
4 1790-10-20 1790-10-19
... ... ...
6336 1888-12-10 1888-12-09
6337 1888-12-11 1888-12-07
6338 1888-12-12 1888-12-06
6339 1888-12-15 1888-11-30
6340 1888-12-16 1888-12-01

6340 rows × 2 columns

# Inconsistent dates: event_date should be after baptized_birth_date
invalid_mask = pd.to_datetime(BAUTISMOS_HARMONIZED['event_date'], errors='coerce') < pd.to_datetime(BAUTISMOS_HARMONIZED['baptized_birth_date'], errors='coerce')
if invalid_mask.any():
    print("Found invalid records:")
    print(BAUTISMOS_HARMONIZED[invalid_mask][['event_date', 'baptized_birth_date']])
Found invalid records:
      event_date baptized_birth_date
135   1792-03-29          1792-04-08
290   1794-01-01          1794-01-27
671   1797-07-15          1797-07-24
2814  1900-04-01          1900-04-09

Unfortunately, these records were incorrectly recorded. To fix this, it is necessary to check with the original records.

MATRIMONIOS_HARMONIZED['groom_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['groom_birth_date'])
MATRIMONIOS_HARMONIZED['bride_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['bride_birth_date'])
MATRIMONIOS_HARMONIZED[['event_date', 'groom_birth_date', 'bride_birth_date']]
event_date groom_birth_date bride_birth_date
0 1816-12-06 NaN NaN
1 1816-12-12 NaN NaN
2 1817-03-05 NaN NaN
3 1817-03-10 NaN NaN
4 1817-03-12 NaN NaN
... ... ... ...
1714 1907-10-27 1882-11-01 1880-11-01
1715 1908-01-13 1880-01-19 1879-01-19
1716 1908-01-15 1886-01-19 1888-01-19
1717 1908-02-15 1883-02-20 1886-02-19
1718 1908-03-17 1880-03-23 1879-03-24

1719 rows × 3 columns

ENTIERROS_HARMONIZED['deceased_birth_date'] = AgeInferrer(ENTIERROS_HARMONIZED['event_date']).infer_all(ENTIERROS_HARMONIZED['deceased_birth_date'])
ENTIERROS_HARMONIZED[['event_date', 'deceased_birth_date']]
event_date deceased_birth_date
0 1846-10-06 NaN
1 1846-10-07 1821-10-13
2 1846-11-02 1766-11-21
3 1846-12-08 1806-12-18
4 1847-02-23 1797-03-06
... ... ...
2193 1920-10-12 1920-02-15
2194 1920-10-19 1870-10-31
2195 1920-10-19 1913-10-21
2196 1920-10-20 1919-10-21
2197 1920-10-21 1920-10-19

2192 rows × 2 columns

Names Normalization

Standardize names across datasets to ensure consistency. This includes normalizing first names, last names, and any other relevant name fields.

from actions.normalizers.NamesNormalizer import NamesNormalizer
def normalize_names_columns(series):
    namesManager = NamesNormalizer()
    return namesManager.clean_series(series)
names_columns = [
    'baptized_name', 
    'father_name', 'father_lastname',
    'mother_name', 'mother_lastname',
    'godfather_name', 'godfather_lastname', 
    'godmother_name', 'godmother_lastname',
]

for col in names_columns:
    if col in BAUTISMOS_HARMONIZED.columns:
        BAUTISMOS_HARMONIZED[col] = normalize_names_columns(BAUTISMOS_HARMONIZED[col])

BAUTISMOS_HARMONIZED[names_columns]
baptized_name father_name father_lastname mother_name mother_lastname godfather_name godfather_lastname godmother_name godmother_lastname
0 domingo lucas ayquipa sevastiana quispe vicente guamani NaN NaN
1 dominga juan lulia jospha gomes ignacio varientos NaN NaN
2 bartola jacinto quispe juliana chinchay NaN NaN rotonda pocco
3 francisca juan cuebas clemenzia manco NaN NaN ysabel guillen
4 pedro santos manxo baleriana arango NaN NaN josefa santiago
... ... ... ... ... ... ... ... ... ...
6336 leocadio miguel pacheco rosa huarcaya josé julián bendezú NaN NaN
6337 mariano concepcion facundo vega silvestra urbano fernando mancco NaN NaN
6338 ambrosio ysidro ccasane rita palomino juan tito NaN NaN
6339 francisco mariano lopez leocadia medina feliciano dias NaN NaN
6340 laureana bernarda champa rufina lopez NaN NaN manuela de la cruz

6340 rows × 9 columns

Baptisms special case

Because baptized surname wasn’t recorded, we need to infer it from the father’s surname.

BAUTISMOS_HARMONIZED['baptized_lastname'] = BAUTISMOS_HARMONIZED['father_lastname']
matrimonios_names_columns = [
    'groom_name', 'groom_lastname', 
       'groom_father_name', 'groom_father_lastname', 
       'groom_mother_name', 'groom_mother_lastname',
       'bride_name', 'bride_lastname',
       'bride_father_name', 'bride_father_lastname',
       'bride_mother_name', 'bride_mother_lastname', 
       'godparent_1_name', 'godparent_1_lastname',
       'godparent_2_name', 'godparent_2_lastname', 
       'godparent_3_name', 'godparent_3_lastname', 
       'witness_1_name', 'witness_1_lastname', 
       'witness_2_name', 'witness_2_lastname',
       'witness_3_name', 'witness_3_lastname', 
       'witness_4_name', 'witness_4_lastname'
]

for col in matrimonios_names_columns:
    if col in MATRIMONIOS_HARMONIZED.columns:
        MATRIMONIOS_HARMONIZED[col] = normalize_names_columns(MATRIMONIOS_HARMONIZED[col])

MATRIMONIOS_HARMONIZED[matrimonios_names_columns]
groom_name groom_lastname groom_father_name groom_father_lastname groom_mother_name groom_mother_lastname bride_name bride_lastname bride_father_name bride_father_lastname ... godparent_3_name godparent_3_lastname witness_1_name witness_1_lastname witness_2_name witness_2_lastname witness_3_name witness_3_lastname witness_4_name witness_4_lastname
0 josé manl manuel de la roca acencio roca leonor guerrero juana rodrigues pedro rodrigues ... NaN NaN agustin castro mariano castro juan baldes NaN NaN
1 esteban castillo matheo castillo ma maria torres ambrocia tasqui pedro tasqui ... NaN NaN pedro manco carlos canto pedro guamani NaN NaN
2 alexandro ramires leonor romani franca francisca paucar sipriana coillo cristobal coillo ... NaN NaN marcelo llamuca julian urbano antonio urbano NaN NaN
3 jose cuchu acencio cuchu baleriana antay cacimira flores NaN NaN ... NaN NaN pablo roque antonio urbano cristobal coillo NaN NaN
4 domingo tito NaN NaN marcela guauya petrona guallpatuiru agustin guallpatuiru ... NaN NaN marcelo llamuca antonio guamani mariano guallpatuiru NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1714 patrocinio chinchay miguel chinchay andrea polanco lorenza quispe gervacio quispe ... NaN NaN juan de dios barrientos manuel espinosa crisostomo pumarino NaN NaN
1715 gerónimo cucho ambrocio cucho gertrudis serrano teresa jimenes aniseto jimenes ... NaN NaN victor saravia mateo aiquipa felix cucho NaN NaN
1716 josé coro felix coro natividad cucho emilia huamani patricio huamani ... NaN NaN pablo de la cruz victor saravia marcelo ramos NaN NaN
1717 pedro gutierres ruperto gutierrez micaila oscco juliana huarcaya hilario huarcaya ... NaN NaN rafael delgado josé vivanco agustin vicente NaN NaN
1718 benito sanchez tiburcio sanchez lorenza aymi teresa poma bernardo poma ... NaN NaN santos espejo felix sanches mariano urbano NaN NaN

1719 rows × 26 columns

entierros_names_columns = [
    "deceased_name", "deceased_lastname",
    "father_name", "father_lastname",
    "mother_name", "mother_lastname",
    "husband_name", "wife_name",
]

for col in entierros_names_columns:
    if col in ENTIERROS_HARMONIZED.columns:
        ENTIERROS_HARMONIZED[col] = normalize_names_columns(ENTIERROS_HARMONIZED[col])

ENTIERROS_HARMONIZED[entierros_names_columns]
deceased_name deceased_lastname father_name father_lastname mother_name mother_lastname husband_name wife_name
0 julian xavies NaN NaN NaN NaN NaN mercedes lupa
1 joce raime NaN NaN NaN NaN NaN francisca cucho
2 martina condori NaN NaN NaN NaN luciano ccoyllo NaN
3 dorotea ccoyllo NaN NaN NaN NaN josé espinosa NaN
4 maría romani NaN NaN NaN NaN mariano huallpatuiro NaN
... ... ... ... ... ... ... ... ...
2193 nieves huallpatuero patrocinio huallpatuero teresa urbano NaN NaN
2194 sinforiano huamani eustaquio huamani martina llamoca NaN gregoria ccoillo
2195 salomé condori sebastian condori anacla roque NaN NaN
2196 maría chinchay abdon chinchay floriza lopez NaN NaN
2197 fortunato florez cecilio florez guadalupe ramos NaN NaN

2192 rows × 8 columns

Entierros special case

Deceased wife or husband were recorded as full name. Therefore, we need to split the full name into first name and last name.

def name_splitter(name):
    if pd.isnull(name):
        return np.nan, np.nan

    name_parts = name.split()
    if len(name_parts) > 2:
        if "de la" in name:
            return name_parts[0], " ".join(name_parts[1:])
        return " ".join(name_parts[0:2]), " ".join(name_parts[2:])
    return name_parts[0], name_parts[1] if len(name_parts) > 1 else np.nan
ENTIERROS_HARMONIZED[['husband_name', 'husband_lastname']] = ENTIERROS_HARMONIZED['husband_name'].apply(name_splitter).apply(pd.Series)
ENTIERROS_HARMONIZED[['wife_name', 'wife_lastname']] = ENTIERROS_HARMONIZED['wife_name'].apply(name_splitter).apply(pd.Series)

Place Recognition

This process involves NER recognition to extract places entities from the text fields in the datasets.

from actions.extractors import placeRecognition

extractor = placeRecognition.PlaceExtractor()
bautismos_place_columns = [
    'baptized_birth_place', 'event_place', 'event_geographic_descriptor_1',
        'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
        'event_geographic_descriptor_4'
]

BAUTISMOS_PLACES_RAW = BAUTISMOS_HARMONIZED[bautismos_place_columns]

for col in bautismos_place_columns:
    if col in BAUTISMOS_HARMONIZED.columns:
        BAUTISMOS_HARMONIZED[col] = extractor.extract_places_per_row(BAUTISMOS_HARMONIZED[col])

BAUTISMOS_HARMONIZED[bautismos_place_columns]
baptized_birth_place event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4
0 NaN Pampamarca NaN Pampamarca NaN NaN
1 NaN Pampamarca NaN Pampamarca NaN NaN
2 NaN Pampamarca NaN Pampamarca NaN NaN
3 NaN Aucara NaN NaN NaN NaN
4 NaN Aucara NaN NaN NaN NaN
... ... ... ... ... ... ...
6336 NaN Aucará NaN NaN NaN NaN
6337 NaN Aucará NaN NaN NaN NaN
6338 NaN Aucará NaN Mayobamba NaN NaN
6339 NaN Aucará NaN Huaicahuacho NaN NaN
6340 NaN Aucará NaN NaN NaN NaN

6340 rows × 6 columns

matrimonios_place_columns = [
    'groom_birth_place',
       'groom_resident_in', 
       'bride_birth_place', 'bride_resident_in', 
       'event_place', 'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
       'event_geographic_descriptor_3', 'event_geographic_descriptor_4',
       'event_geographic_descriptor_5', 'event_geographic_descriptor_6'
]

MATRIMONIOS_PLACES_RAW = MATRIMONIOS_HARMONIZED[matrimonios_place_columns]

for col in matrimonios_place_columns:
    if col in MATRIMONIOS_HARMONIZED.columns:
        MATRIMONIOS_HARMONIZED[col] = extractor.extract_places_per_row(MATRIMONIOS_HARMONIZED[col])

MATRIMONIOS_HARMONIZED[matrimonios_place_columns]
groom_birth_place groom_resident_in bride_birth_place bride_resident_in event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4 event_geographic_descriptor_5 event_geographic_descriptor_6
0 Ciudad de Huamanga Aucara NaN NaN Aucara NaN Huamanga NaN NaN NaN NaN
1 NaN NaN NaN NaN Aucara NaN Colca NaN NaN NaN NaN
2 Pampamarca NaN Pampamarca NaN Aucara NaN Pampamarca NaN NaN NaN NaN
3 Pampamarca NaN Pampamarca NaN Pampamarca|santa iglesia NaN Pampamarca NaN NaN NaN NaN
4 NaN NaN NaN NaN Pampamarca|santa iglesia NaN Pampamarca NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
1714 Pampamarca NaN Pampamarca NaN Pampamarca NaN Pampamarca NaN NaN NaN NaN
1715 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1716 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1717 NaN NaN NaN NaN Aucara NaN NaN NaN NaN NaN NaN
1718 Pampamarca NaN Pampamarca NaN Pampamarca NaN Pampamarca NaN NaN NaN NaN

1719 rows × 11 columns

entierros_place_columns = [
    'event_place', 'deceased_birth_place', 'burial_place', 'event_geographic_descriptor_1',
    'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
    'event_geographic_descriptor_4'
]

ENTIERROS_PLACES_RAW = ENTIERROS_HARMONIZED[entierros_place_columns]

for col in entierros_place_columns:
    if col in ENTIERROS_HARMONIZED.columns:
        ENTIERROS_HARMONIZED[col] = extractor.extract_places_per_row(ENTIERROS_HARMONIZED[col])

ENTIERROS_HARMONIZED[entierros_place_columns]
event_place deceased_birth_place burial_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4
0 NaN NaN NaN NaN Lucanas NaN NaN
1 NaN NaN NaN NaN Lucanas NaN NaN
2 NaN NaN NaN NaN Lucanas NaN NaN
3 NaN NaN NaN NaN Lucanas NaN NaN
4 NaN NaN NaN NaN Lucanas NaN NaN
... ... ... ... ... ... ... ...
2193 Aucara Santa Ana de Aucara NaN NaN Santa Ana de Aucara NaN NaN
2194 Aucara Pampamarca NaN NaN Pampamarca NaN NaN
2195 Aucara Santa Ana de Aucara NaN NaN Santa Ana de Aucara NaN NaN
2196 Aucara NaN NaN NaN NaN NaN NaN
2197 Aucara NaN NaN NaN NaN NaN NaN

2192 rows × 7 columns

# save raw places

from utils import UniqueValues

uniqueValues = UniqueValues.UniqueValuesExtractor(
    [BAUTISMOS_PLACES_RAW, MATRIMONIOS_PLACES_RAW, ENTIERROS_PLACES_RAW]
).get_unique_values(return_dataframe=True)

uniqueValues.to_csv("../data/raw/raw_places.csv", index=False) # type: ignore # 

Cleaning Audit

Basic audit to ensure that the data cleaning process has been successful.

Save Cleaned Data

clean_data_folder = Path("../data/clean/")

# Fill NaN values for consistency
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.fillna(value=np.nan)
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.fillna(value=np.nan)
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.fillna(value=np.nan)

BAUTISMOS_HARMONIZED.to_csv(clean_data_folder / "bautismos_clean.csv", index=False)
MATRIMONIOS_HARMONIZED.to_csv(clean_data_folder / "matrimonios_clean.csv", index=False)
ENTIERROS_HARMONIZED.to_csv(clean_data_folder / "entierros_clean.csv", index=False)
BAUTISMOS_HARMONIZED.head()
file identifier event_type event_date baptized_name baptized_birth_place baptized_birth_date baptized_legitimacy_status father_name father_lastname ... godfather_social_condition godmother_name godmother_lastname godmother_social_condition event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4 baptized_lastname
0 APAucará LB L001 B001 Bautizo 1790-10-04 domingo NaN 1790-08-04 Hijo legitimo lucas ayquipa ... NaN NaN NaN NaN Pampamarca NaN Pampamarca NaN NaN ayquipa
1 APAucará LB L001 B002 Bautizo 1790-10-06 dominga NaN 1790-08-04 Hija legitima juan lulia ... NaN NaN NaN NaN Pampamarca NaN Pampamarca NaN NaN lulia
2 APAucará LB L001 B003 Bautizo 1790-10-07 bartola NaN 1790-08-04 Hija legitima jacinto quispe ... NaN rotonda pocco NaN Pampamarca NaN Pampamarca NaN NaN quispe
3 APAucará LB L001 B004 Bautizo 1790-10-20 francisca NaN 1790-10-15 Hija legitima juan cuebas ... NaN ysabel guillen NaN Aucara NaN NaN NaN NaN cuebas
4 APAucará LB L001 B005 Bautizo 1790-10-20 pedro NaN 1790-10-19 Hijo legitimo santos manxo ... NaN josefa santiago NaN Aucara NaN NaN NaN NaN manxo

5 rows × 27 columns

MATRIMONIOS_HARMONIZED.head()
file identifier event_type event_date groom_name groom_lastname groom_social_condition groom_marital_status groom_birth_date groom_birth_place ... witness_3_lastname witness_4_name witness_4_lastname event_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4 event_geographic_descriptor_5 event_geographic_descriptor_6
0 APAucará LM L001 M001 Matrimonio 1816-12-06 josé manl manuel de la roca don, vecinos de este pueblo [Aucara] soltero NaN Ciudad de Huamanga ... baldes NaN NaN Aucara NaN Huamanga NaN NaN NaN NaN
1 APAucará LM L001 M002 Matrimonio 1816-12-12 esteban castillo vecinos de esta doctrina [Aucara] soltero NaN NaN ... guamani NaN NaN Aucara NaN Colca NaN NaN NaN NaN
2 APAucará LM L001 M003 Matrimonio 1817-03-05 alexandro ramires vecinos de dicho pueblo [Aucara] soltero NaN Pampamarca ... urbano NaN NaN Aucara NaN Pampamarca NaN NaN NaN NaN
3 APAucará LM L001 M004 Matrimonio 1817-03-10 jose cuchu vecinos de dicho [Pampamarca] soltero NaN Pampamarca ... coillo NaN NaN Pampamarca|santa iglesia NaN Pampamarca NaN NaN NaN NaN
4 APAucará LM L001 M005 Matrimonio 1817-03-12 domingo tito vecinos de dicho [Pampamarca] soltero NaN NaN ... guallpatuiru NaN NaN Pampamarca|santa iglesia NaN Pampamarca NaN NaN NaN NaN

5 rows × 55 columns

ENTIERROS_HARMONIZED.head()
file identifier event_type event_date doctrine event_place deceased_name deceased_lastname deceased_birth_date deceased_birth_place ... mother_lastname husband_name wife_name burial_place event_geographic_descriptor_1 event_geographic_descriptor_2 event_geographic_descriptor_3 event_geographic_descriptor_4 husband_lastname wife_lastname
0 APAucará LD L001 E001 Entierro 1846-10-06 Parroquia de Aucará NaN julian xavies NaN NaN ... NaN NaN mercedes NaN NaN Lucanas NaN NaN NaN lupa
1 APAucará LD L001 E002 Entierro 1846-10-07 Parroquia de Aucará NaN joce raime 1821-10-13 NaN ... NaN NaN francisca NaN NaN Lucanas NaN NaN NaN cucho
2 APAucará LD L001 E003 Entierro 1846-11-02 Parroquia de Aucará NaN martina condori 1766-11-21 NaN ... NaN luciano NaN NaN NaN Lucanas NaN NaN ccoyllo NaN
3 APAucará LD L001 E004 Entierro 1846-12-08 Parroquia de Aucará NaN dorotea ccoyllo 1806-12-18 NaN ... NaN josé NaN NaN NaN Lucanas NaN NaN espinosa NaN
4 APAucará LD L001 E005 Entierro 1847-02-23 Parroquia de Aucará NaN maría romani 1797-03-06 NaN ... NaN mariano NaN NaN NaN Lucanas NaN NaN huallpatuiro NaN

5 rows × 26 columns