# Libraries
import json
import pandas as pd
import numpy as np
import re
from pathlib import PathData Cleaning
This notebook handles the data cleaning process for the datasets, documenting the process at the same time.
Raw Datasets
The raw datasets are stored in the data/raw directory. The datasets include:
bautismos.csv: Baptism recordsmatrimonios.csv: Marriage recordsentierros.csv: Burial records
BAUTISMOS_RAW = pd.read_csv("../data/raw/bautismos.csv")
MATRIMONIOS_RAW = pd.read_csv("../data/raw/matrimonios.csv")
ENTIERROS_RAW = pd.read_csv("../data/raw/entierros.csv")
BAUTISMOS_RAW.head()| Secuencia | Unidad Documental Compuesta (a la que pertenece) | Identificador (es recomendable seguir una secuencia numeral como la mostrada en los ejemplos) | Título (incluir un título breve para cada documento) | Folio inicial del documento (convertir como se muestra abajo) | Folio final del documento (convertir como se muestra abajo) | Imagen inicial (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes) | Imagen final (estos valores serán añadidos cuando comienze el proceso de revisión de imágenes) | Tipo de evento | Fecha aaaa-mm-dd | ... | Condición de la madrina | Lugar de bautizo | Notas adicionales del documento | Descriptor Geográfico 1 | Descriptor Geográfico 2 | Descriptor Geográfico 3 | Descriptor Geográfico 4 | 5 | Características físicas (Estado de conservación de los materiales físicos) | Historia de revisión (de los materiales digitalizados) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | APAucará LB L001 | B001 | Bautizo. Domingo. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-04 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 1 | 2.0 | APAucará LB L001 | B002 | Bautizo. Dominga. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-06 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 2 | 3.0 | APAucará LB L001 | B003 | Bautizo. Bartola. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-07 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 3 | 4.0 | APAucará LB L001 | B004 | Bautizo. Francisca | 3v | 3v | IMG_7000b | IMG_7000b | Bautizo | 1790-10-20 | ... | NaN | Aucara, iglesia | Abreviatura poco visible en el margen | Aucara | NaN | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 4 | 5.0 | APAucará LB L001 | B005 | Bautizo. Pedro | 3v | 3v | IMG_7000b | IMG_7000b | Bautizo | 1790-10-20 | ... | NaN | Aucara, iglesia | Margen roto y manchado de tinta | Aucara | NaN | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
5 rows × 36 columns
Column Harmonization
Rename the columns in the datasets to ensure consistency across different data sources.
Values are mapped using the json files located in the data/mappings directory.
from utils.ColumnManager import ColumnManagerbautismoMapping = Path("../data/mappings/bautismosMapping.json")
matrimonioMapping = Path("../data/mappings/matrimoniosMapping.json")
entierroMapping = Path("../data/mappings/entierrosMapping.json")
column_manager = ColumnManager()
BAUTISMOS_HARMONIZED = column_manager.harmonize_columns(BAUTISMOS_RAW, bautismoMapping)
MATRIMONIOS_HARMONIZED = column_manager.harmonize_columns(MATRIMONIOS_RAW, matrimonioMapping)
ENTIERROS_HARMONIZED = column_manager.harmonize_columns(ENTIERROS_RAW, entierroMapping)
BAUTISMOS_HARMONIZED.head()| id | file | identifier | title | start_folio | end_folio | start_image | end_image | event_type | event_date | ... | godmother_social_condition | event_place | event_additional_notes | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | event_other | record_physical_characteristics | revision_history | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | APAucará LB L001 | B001 | Bautizo. Domingo. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-04 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 1 | 2.0 | APAucará LB L001 | B002 | Bautizo. Dominga. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-06 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 2 | 3.0 | APAucará LB L001 | B003 | Bautizo. Bartola. Tributarios | 3r | 3r | IMG_7000a | IMG_7000a | Bautizo | 1790-10-07 | ... | NaN | Pampamarca, iglesia | NaN | Aucara | Pampamarca | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 3 | 4.0 | APAucará LB L001 | B004 | Bautizo. Francisca | 3v | 3v | IMG_7000b | IMG_7000b | Bautizo | 1790-10-20 | ... | NaN | Aucara, iglesia | Abreviatura poco visible en el margen | Aucara | NaN | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
| 4 | 5.0 | APAucará LB L001 | B005 | Bautizo. Pedro | 3v | 3v | IMG_7000b | IMG_7000b | Bautizo | 1790-10-20 | ... | NaN | Aucara, iglesia | Margen roto y manchado de tinta | Aucara | NaN | NaN | NaN | NaN | Regular | Registrado por Edwin Gonzales en 2023 |
5 rows × 36 columns
Reduce DataFrames to their relevant columns
useful_columns = json.load(open("../data/mappings/usefulColumnsMapping.json"))# Bautismos useful columns
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED[useful_columns['bautizo']]
# remove empty columns
BAUTISMOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)
BAUTISMOS_HARMONIZED.columnsIndex(['file', 'identifier', 'event_type', 'event_date', 'baptized_name',
'baptized_birth_place', 'baptized_birth_date',
'baptized_legitimacy_status', 'father_name', 'father_lastname',
'father_social_condition', 'mother_name', 'mother_lastname',
'mother_social_condition', 'parents_social_condition', 'godfather_name',
'godfather_lastname', 'godfather_social_condition', 'godmother_name',
'godmother_lastname', 'godmother_social_condition', 'event_place',
'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
'event_geographic_descriptor_3', 'event_geographic_descriptor_4'],
dtype='object')
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED[useful_columns['matrimonio']]
# remove empty columns
MATRIMONIOS_HARMONIZED.dropna(axis=1, how='all', inplace=True)
MATRIMONIOS_HARMONIZED.columnsIndex(['file', 'identifier', 'event_type', 'event_date', 'groom_name',
'groom_lastname', 'groom_social_condition', 'groom_marital_status',
'groom_birth_date', 'groom_birth_place', 'groom_resident_in',
'groom_legitimacy_status', 'groom_father_name', 'groom_father_lastname',
'groom_father_social_condition', 'groom_mother_name',
'groom_mother_lastname', 'groom_mother_social_condition', 'bride_name',
'bride_lastname', 'bride_social_condition', 'bride_marital_status',
'bride_birth_date', 'bride_birth_place', 'bride_resident_in',
'bride_legitimacy_status', 'bride_father_name', 'bride_father_lastname',
'bride_father_social_condition', 'bride_mother_name',
'bride_mother_lastname', 'bride_mother_social_condition',
'godparent_1_name', 'godparent_1_lastname',
'godparent_1_social_condition', 'godparent_2_name',
'godparent_2_lastname', 'godparent_2_social_condition',
'godparent_3_name', 'godparent_3_lastname', 'witness_1_name',
'witness_1_lastname', 'witness_2_name', 'witness_2_lastname',
'witness_3_name', 'witness_3_lastname', 'witness_4_name',
'witness_4_lastname', 'event_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4', 'event_geographic_descriptor_5',
'event_geographic_descriptor_6'],
dtype='object')
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED[useful_columns['entierro']]
# remove empty columns
ENTIERROS_HARMONIZED.dropna(axis=1, how='all', inplace=True)
ENTIERROS_HARMONIZED.columnsIndex(['file', 'identifier', 'event_type', 'event_date', 'doctrine',
'event_place', 'deceased_name', 'deceased_lastname',
'deceased_birth_date', 'deceased_birth_place',
'deceased_social_condition', 'deceased_marital_status',
'deceased_legitimacy_status', 'father_name', 'father_lastname',
'mother_name', 'mother_lastname', 'husband_name', 'wife_name',
'burial_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4'],
dtype='object')
Replace empty or null values with ‘na’
## replace cells with no textual information with numpy na
def replace_empty_with_na(df):
"""
Replace placeholder strings with np.nan in string columns only.
"""
placeholders = {'', '-', '--', 'n/a', 'na', 'null', 'None'}
def clean_cell(val):
if isinstance(val, str) and val.strip().lower() in placeholders:
return np.nan
return val
return df.map(clean_cell)BAUTISMOS_HARMONIZED = replace_empty_with_na(BAUTISMOS_HARMONIZED)
MATRIMONIOS_HARMONIZED = replace_empty_with_na(MATRIMONIOS_HARMONIZED)
ENTIERROS_HARMONIZED = replace_empty_with_na(ENTIERROS_HARMONIZED)Remove all empty rows from the DataFrames
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.dropna(how='all')
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.dropna(how='all')
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.dropna(how='all')Particular cases
Baptism event type error correction
In the bautismos dataset, two records have an incorrect event type:
BAUTISMOS_HARMONIZED.loc[~(BAUTISMOS_HARMONIZED['event_type'] == 'Bautizo')]| file | identifier | event_type | event_date | baptized_name | baptized_birth_place | baptized_birth_date | baptized_legitimacy_status | father_name | father_lastname | ... | godfather_lastname | godfather_social_condition | godmother_name | godmother_lastname | godmother_social_condition | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1900 | APAucará LB L001 | B1901 | Concurso | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5414 | APAucará LB L003 | B1120 | 44444 | 1869-09-17 | Juan | Yshua | 12 días | Hijo natural | Ygnacio | Sevidola | ... | Origuela | NaN | NaN | NaN | NaN | Aucará | Aucará | NaN | NaN | NaN |
2 rows × 26 columns
Event 1900 is basically an empty record, so we will drop it. While event 5414 requires correction:
BAUTISMOS_HARMONIZED.drop(1900, inplace=True)
BAUTISMOS_HARMONIZED.loc[5414, 'event_type'] = 'Bautizo'
BAUTISMOS_HARMONIZED.iloc[5414]file APAucará LB L003
identifier B1121
event_type Bautizo
event_date 1869-09-17
baptized_name Maria Isabel
baptized_birth_place Chacralla
baptized_birth_date 4 días
baptized_legitimacy_status Hija legítima
father_name José
father_lastname Cerrano
father_social_condition NaN
mother_name Manuela
mother_lastname Avalos
mother_social_condition NaN
parents_social_condition NaN
godfather_name NaN
godfather_lastname NaN
godfather_social_condition NaN
godmother_name Brijida
godmother_lastname NaN
godmother_social_condition NaN
event_place Aucará
event_geographic_descriptor_1 Aucará
event_geographic_descriptor_2 NaN
event_geographic_descriptor_3 NaN
event_geographic_descriptor_4 NaN
Name: 5415, dtype: object
Dates Normalization
Ensure dates are in a consistent format across all datasets. The dates should be in the format YYYY-MM-DD.
from actions.normalizers.DatesNormalizer import DateNormalizerBAUTISMOS_HARMONIZED['event_date'] = DateNormalizer(BAUTISMOS_HARMONIZED['event_date']).normalize()
BAUTISMOS_HARMONIZED['event_date']0 1790-10-04
1 1790-10-06
2 1790-10-07
3 1790-10-20
4 1790-10-20
...
6336 1888-12-10
6337 1888-12-11
6338 1888-12-12
6339 1888-12-15
6340 1888-12-16
Name: event_date, Length: 6340, dtype: object
MATRIMONIOS_HARMONIZED['event_date'] = DateNormalizer(MATRIMONIOS_HARMONIZED['event_date']).normalize()
MATRIMONIOS_HARMONIZED['event_date']0 1816-12-06
1 1816-12-12
2 1817-03-05
3 1817-03-10
4 1817-03-12
...
1714 1907-10-27
1715 1908-01-13
1716 1908-01-15
1717 1908-02-15
1718 1908-03-17
Name: event_date, Length: 1719, dtype: object
ENTIERROS_HARMONIZED['event_date'] = DateNormalizer(ENTIERROS_HARMONIZED['event_date']).normalize()
ENTIERROS_HARMONIZED['event_date']0 1846-10-06
1 1846-10-07
2 1846-11-02
3 1846-12-08
4 1847-02-23
...
2193 1920-10-12
2194 1920-10-19
2195 1920-10-19
2196 1920-10-20
2197 1920-10-21
Name: event_date, Length: 2192, dtype: object
Age Inferring
Infer the age of individuals based on their birthdates and the date of the event (baptism, marriage, burial).
from actions.generators.AgeInferrer import AgeInferrerBAUTISMOS_HARMONIZED['baptized_birth_date'] = AgeInferrer(BAUTISMOS_HARMONIZED['event_date']).infer_all(BAUTISMOS_HARMONIZED['baptized_birth_date'])
BAUTISMOS_HARMONIZED[['event_date', 'baptized_birth_date']]| event_date | baptized_birth_date | |
|---|---|---|
| 0 | 1790-10-04 | 1790-08-04 |
| 1 | 1790-10-06 | 1790-08-04 |
| 2 | 1790-10-07 | 1790-08-04 |
| 3 | 1790-10-20 | 1790-10-15 |
| 4 | 1790-10-20 | 1790-10-19 |
| ... | ... | ... |
| 6336 | 1888-12-10 | 1888-12-09 |
| 6337 | 1888-12-11 | 1888-12-07 |
| 6338 | 1888-12-12 | 1888-12-06 |
| 6339 | 1888-12-15 | 1888-11-30 |
| 6340 | 1888-12-16 | 1888-12-01 |
6340 rows × 2 columns
# Inconsistent dates: event_date should be after baptized_birth_date
invalid_mask = pd.to_datetime(BAUTISMOS_HARMONIZED['event_date'], errors='coerce') < pd.to_datetime(BAUTISMOS_HARMONIZED['baptized_birth_date'], errors='coerce')
if invalid_mask.any():
print("Found invalid records:")
print(BAUTISMOS_HARMONIZED[invalid_mask][['event_date', 'baptized_birth_date']])Found invalid records:
event_date baptized_birth_date
135 1792-03-29 1792-04-08
290 1794-01-01 1794-01-27
671 1797-07-15 1797-07-24
2814 1900-04-01 1900-04-09
Unfortunately, these records were incorrectly recorded. To fix this, it is necessary to check with the original records.
MATRIMONIOS_HARMONIZED['groom_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['groom_birth_date'])
MATRIMONIOS_HARMONIZED['bride_birth_date'] = AgeInferrer(MATRIMONIOS_HARMONIZED['event_date']).infer_all(MATRIMONIOS_HARMONIZED['bride_birth_date'])
MATRIMONIOS_HARMONIZED[['event_date', 'groom_birth_date', 'bride_birth_date']]| event_date | groom_birth_date | bride_birth_date | |
|---|---|---|---|
| 0 | 1816-12-06 | NaN | NaN |
| 1 | 1816-12-12 | NaN | NaN |
| 2 | 1817-03-05 | NaN | NaN |
| 3 | 1817-03-10 | NaN | NaN |
| 4 | 1817-03-12 | NaN | NaN |
| ... | ... | ... | ... |
| 1714 | 1907-10-27 | 1882-11-01 | 1880-11-01 |
| 1715 | 1908-01-13 | 1880-01-19 | 1879-01-19 |
| 1716 | 1908-01-15 | 1886-01-19 | 1888-01-19 |
| 1717 | 1908-02-15 | 1883-02-20 | 1886-02-19 |
| 1718 | 1908-03-17 | 1880-03-23 | 1879-03-24 |
1719 rows × 3 columns
ENTIERROS_HARMONIZED['deceased_birth_date'] = AgeInferrer(ENTIERROS_HARMONIZED['event_date']).infer_all(ENTIERROS_HARMONIZED['deceased_birth_date'])
ENTIERROS_HARMONIZED[['event_date', 'deceased_birth_date']]| event_date | deceased_birth_date | |
|---|---|---|
| 0 | 1846-10-06 | NaN |
| 1 | 1846-10-07 | 1821-10-13 |
| 2 | 1846-11-02 | 1766-11-21 |
| 3 | 1846-12-08 | 1806-12-18 |
| 4 | 1847-02-23 | 1797-03-06 |
| ... | ... | ... |
| 2193 | 1920-10-12 | 1920-02-15 |
| 2194 | 1920-10-19 | 1870-10-31 |
| 2195 | 1920-10-19 | 1913-10-21 |
| 2196 | 1920-10-20 | 1919-10-21 |
| 2197 | 1920-10-21 | 1920-10-19 |
2192 rows × 2 columns
Names Normalization
Standardize names across datasets to ensure consistency. This includes normalizing first names, last names, and any other relevant name fields.
from actions.normalizers.NamesNormalizer import NamesNormalizerdef normalize_names_columns(series):
namesManager = NamesNormalizer()
return namesManager.clean_series(series)names_columns = [
'baptized_name',
'father_name', 'father_lastname',
'mother_name', 'mother_lastname',
'godfather_name', 'godfather_lastname',
'godmother_name', 'godmother_lastname',
]
for col in names_columns:
if col in BAUTISMOS_HARMONIZED.columns:
BAUTISMOS_HARMONIZED[col] = normalize_names_columns(BAUTISMOS_HARMONIZED[col])
BAUTISMOS_HARMONIZED[names_columns]| baptized_name | father_name | father_lastname | mother_name | mother_lastname | godfather_name | godfather_lastname | godmother_name | godmother_lastname | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | domingo | lucas | ayquipa | sevastiana | quispe | vicente | guamani | NaN | NaN |
| 1 | dominga | juan | lulia | jospha | gomes | ignacio | varientos | NaN | NaN |
| 2 | bartola | jacinto | quispe | juliana | chinchay | NaN | NaN | rotonda | pocco |
| 3 | francisca | juan | cuebas | clemenzia | manco | NaN | NaN | ysabel | guillen |
| 4 | pedro | santos | manxo | baleriana | arango | NaN | NaN | josefa | santiago |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6336 | leocadio | miguel | pacheco | rosa | huarcaya | josé julián | bendezú | NaN | NaN |
| 6337 | mariano concepcion | facundo | vega | silvestra | urbano | fernando | mancco | NaN | NaN |
| 6338 | ambrosio | ysidro | ccasane | rita | palomino | juan | tito | NaN | NaN |
| 6339 | francisco | mariano | lopez | leocadia | medina | feliciano | dias | NaN | NaN |
| 6340 | laureana | bernarda | champa | rufina | lopez | NaN | NaN | manuela | de la cruz |
6340 rows × 9 columns
Baptisms special case
Because baptized surname wasn’t recorded, we need to infer it from the father’s surname.
BAUTISMOS_HARMONIZED['baptized_lastname'] = BAUTISMOS_HARMONIZED['father_lastname']matrimonios_names_columns = [
'groom_name', 'groom_lastname',
'groom_father_name', 'groom_father_lastname',
'groom_mother_name', 'groom_mother_lastname',
'bride_name', 'bride_lastname',
'bride_father_name', 'bride_father_lastname',
'bride_mother_name', 'bride_mother_lastname',
'godparent_1_name', 'godparent_1_lastname',
'godparent_2_name', 'godparent_2_lastname',
'godparent_3_name', 'godparent_3_lastname',
'witness_1_name', 'witness_1_lastname',
'witness_2_name', 'witness_2_lastname',
'witness_3_name', 'witness_3_lastname',
'witness_4_name', 'witness_4_lastname'
]
for col in matrimonios_names_columns:
if col in MATRIMONIOS_HARMONIZED.columns:
MATRIMONIOS_HARMONIZED[col] = normalize_names_columns(MATRIMONIOS_HARMONIZED[col])
MATRIMONIOS_HARMONIZED[matrimonios_names_columns]| groom_name | groom_lastname | groom_father_name | groom_father_lastname | groom_mother_name | groom_mother_lastname | bride_name | bride_lastname | bride_father_name | bride_father_lastname | ... | godparent_3_name | godparent_3_lastname | witness_1_name | witness_1_lastname | witness_2_name | witness_2_lastname | witness_3_name | witness_3_lastname | witness_4_name | witness_4_lastname | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | josé manl manuel | de la roca | acencio | roca | leonor | guerrero | juana | rodrigues | pedro | rodrigues | ... | NaN | NaN | agustin | castro | mariano | castro | juan | baldes | NaN | NaN |
| 1 | esteban | castillo | matheo | castillo | ma maria | torres | ambrocia | tasqui | pedro | tasqui | ... | NaN | NaN | pedro | manco | carlos | canto | pedro | guamani | NaN | NaN |
| 2 | alexandro | ramires | leonor | romani | franca francisca | paucar | sipriana | coillo | cristobal | coillo | ... | NaN | NaN | marcelo | llamuca | julian | urbano | antonio | urbano | NaN | NaN |
| 3 | jose | cuchu | acencio | cuchu | baleriana | antay | cacimira | flores | NaN | NaN | ... | NaN | NaN | pablo | roque | antonio | urbano | cristobal | coillo | NaN | NaN |
| 4 | domingo | tito | NaN | NaN | marcela | guauya | petrona | guallpatuiru | agustin | guallpatuiru | ... | NaN | NaN | marcelo | llamuca | antonio | guamani | mariano | guallpatuiru | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1714 | patrocinio | chinchay | miguel | chinchay | andrea | polanco | lorenza | quispe | gervacio | quispe | ... | NaN | NaN | juan de dios | barrientos | manuel | espinosa | crisostomo | pumarino | NaN | NaN |
| 1715 | gerónimo | cucho | ambrocio | cucho | gertrudis | serrano | teresa | jimenes | aniseto | jimenes | ... | NaN | NaN | victor | saravia | mateo | aiquipa | felix | cucho | NaN | NaN |
| 1716 | josé | coro | felix | coro | natividad | cucho | emilia | huamani | patricio | huamani | ... | NaN | NaN | pablo | de la cruz | victor | saravia | marcelo | ramos | NaN | NaN |
| 1717 | pedro | gutierres | ruperto | gutierrez | micaila | oscco | juliana | huarcaya | hilario | huarcaya | ... | NaN | NaN | rafael | delgado | josé | vivanco | agustin | vicente | NaN | NaN |
| 1718 | benito | sanchez | tiburcio | sanchez | lorenza | aymi | teresa | poma | bernardo | poma | ... | NaN | NaN | santos | espejo | felix | sanches | mariano | urbano | NaN | NaN |
1719 rows × 26 columns
entierros_names_columns = [
"deceased_name", "deceased_lastname",
"father_name", "father_lastname",
"mother_name", "mother_lastname",
"husband_name", "wife_name",
]
for col in entierros_names_columns:
if col in ENTIERROS_HARMONIZED.columns:
ENTIERROS_HARMONIZED[col] = normalize_names_columns(ENTIERROS_HARMONIZED[col])
ENTIERROS_HARMONIZED[entierros_names_columns]| deceased_name | deceased_lastname | father_name | father_lastname | mother_name | mother_lastname | husband_name | wife_name | |
|---|---|---|---|---|---|---|---|---|
| 0 | julian | xavies | NaN | NaN | NaN | NaN | NaN | mercedes lupa |
| 1 | joce | raime | NaN | NaN | NaN | NaN | NaN | francisca cucho |
| 2 | martina | condori | NaN | NaN | NaN | NaN | luciano ccoyllo | NaN |
| 3 | dorotea | ccoyllo | NaN | NaN | NaN | NaN | josé espinosa | NaN |
| 4 | maría | romani | NaN | NaN | NaN | NaN | mariano huallpatuiro | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2193 | nieves | huallpatuero | patrocinio | huallpatuero | teresa | urbano | NaN | NaN |
| 2194 | sinforiano | huamani | eustaquio | huamani | martina | llamoca | NaN | gregoria ccoillo |
| 2195 | salomé | condori | sebastian | condori | anacla | roque | NaN | NaN |
| 2196 | maría | chinchay | abdon | chinchay | floriza | lopez | NaN | NaN |
| 2197 | fortunato | florez | cecilio | florez | guadalupe | ramos | NaN | NaN |
2192 rows × 8 columns
Entierros special case
Deceased wife or husband were recorded as full name. Therefore, we need to split the full name into first name and last name.
def name_splitter(name):
if pd.isnull(name):
return np.nan, np.nan
name_parts = name.split()
if len(name_parts) > 2:
if "de la" in name:
return name_parts[0], " ".join(name_parts[1:])
return " ".join(name_parts[0:2]), " ".join(name_parts[2:])
return name_parts[0], name_parts[1] if len(name_parts) > 1 else np.nanENTIERROS_HARMONIZED[['husband_name', 'husband_lastname']] = ENTIERROS_HARMONIZED['husband_name'].apply(name_splitter).apply(pd.Series)
ENTIERROS_HARMONIZED[['wife_name', 'wife_lastname']] = ENTIERROS_HARMONIZED['wife_name'].apply(name_splitter).apply(pd.Series)Place Recognition
This process involves NER recognition to extract places entities from the text fields in the datasets.
from actions.extractors import placeRecognition
extractor = placeRecognition.PlaceExtractor()bautismos_place_columns = [
'baptized_birth_place', 'event_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4'
]
BAUTISMOS_PLACES_RAW = BAUTISMOS_HARMONIZED[bautismos_place_columns]
for col in bautismos_place_columns:
if col in BAUTISMOS_HARMONIZED.columns:
BAUTISMOS_HARMONIZED[col] = extractor.extract_places_per_row(BAUTISMOS_HARMONIZED[col])
BAUTISMOS_HARMONIZED[bautismos_place_columns]| baptized_birth_place | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | |
|---|---|---|---|---|---|---|
| 0 | NaN | Pampamarca | NaN | Pampamarca | NaN | NaN |
| 1 | NaN | Pampamarca | NaN | Pampamarca | NaN | NaN |
| 2 | NaN | Pampamarca | NaN | Pampamarca | NaN | NaN |
| 3 | NaN | Aucara | NaN | NaN | NaN | NaN |
| 4 | NaN | Aucara | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 6336 | NaN | Aucará | NaN | NaN | NaN | NaN |
| 6337 | NaN | Aucará | NaN | NaN | NaN | NaN |
| 6338 | NaN | Aucará | NaN | Mayobamba | NaN | NaN |
| 6339 | NaN | Aucará | NaN | Huaicahuacho | NaN | NaN |
| 6340 | NaN | Aucará | NaN | NaN | NaN | NaN |
6340 rows × 6 columns
matrimonios_place_columns = [
'groom_birth_place',
'groom_resident_in',
'bride_birth_place', 'bride_resident_in',
'event_place', 'event_geographic_descriptor_1', 'event_geographic_descriptor_2',
'event_geographic_descriptor_3', 'event_geographic_descriptor_4',
'event_geographic_descriptor_5', 'event_geographic_descriptor_6'
]
MATRIMONIOS_PLACES_RAW = MATRIMONIOS_HARMONIZED[matrimonios_place_columns]
for col in matrimonios_place_columns:
if col in MATRIMONIOS_HARMONIZED.columns:
MATRIMONIOS_HARMONIZED[col] = extractor.extract_places_per_row(MATRIMONIOS_HARMONIZED[col])
MATRIMONIOS_HARMONIZED[matrimonios_place_columns]| groom_birth_place | groom_resident_in | bride_birth_place | bride_resident_in | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | event_geographic_descriptor_5 | event_geographic_descriptor_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ciudad de Huamanga | Aucara | NaN | NaN | Aucara | NaN | Huamanga | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | Aucara | NaN | Colca | NaN | NaN | NaN | NaN |
| 2 | Pampamarca | NaN | Pampamarca | NaN | Aucara | NaN | Pampamarca | NaN | NaN | NaN | NaN |
| 3 | Pampamarca | NaN | Pampamarca | NaN | Pampamarca|santa iglesia | NaN | Pampamarca | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | Pampamarca|santa iglesia | NaN | Pampamarca | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1714 | Pampamarca | NaN | Pampamarca | NaN | Pampamarca | NaN | Pampamarca | NaN | NaN | NaN | NaN |
| 1715 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1716 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1717 | NaN | NaN | NaN | NaN | Aucara | NaN | NaN | NaN | NaN | NaN | NaN |
| 1718 | Pampamarca | NaN | Pampamarca | NaN | Pampamarca | NaN | Pampamarca | NaN | NaN | NaN | NaN |
1719 rows × 11 columns
entierros_place_columns = [
'event_place', 'deceased_birth_place', 'burial_place', 'event_geographic_descriptor_1',
'event_geographic_descriptor_2', 'event_geographic_descriptor_3',
'event_geographic_descriptor_4'
]
ENTIERROS_PLACES_RAW = ENTIERROS_HARMONIZED[entierros_place_columns]
for col in entierros_place_columns:
if col in ENTIERROS_HARMONIZED.columns:
ENTIERROS_HARMONIZED[col] = extractor.extract_places_per_row(ENTIERROS_HARMONIZED[col])
ENTIERROS_HARMONIZED[entierros_place_columns]| event_place | deceased_birth_place | burial_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | |
|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | Lucanas | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | Lucanas | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | Lucanas | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN | Lucanas | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | Lucanas | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2193 | Aucara | Santa Ana de Aucara | NaN | NaN | Santa Ana de Aucara | NaN | NaN |
| 2194 | Aucara | Pampamarca | NaN | NaN | Pampamarca | NaN | NaN |
| 2195 | Aucara | Santa Ana de Aucara | NaN | NaN | Santa Ana de Aucara | NaN | NaN |
| 2196 | Aucara | NaN | NaN | NaN | NaN | NaN | NaN |
| 2197 | Aucara | NaN | NaN | NaN | NaN | NaN | NaN |
2192 rows × 7 columns
# save raw places
from utils import UniqueValues
uniqueValues = UniqueValues.UniqueValuesExtractor(
[BAUTISMOS_PLACES_RAW, MATRIMONIOS_PLACES_RAW, ENTIERROS_PLACES_RAW]
).get_unique_values(return_dataframe=True)
uniqueValues.to_csv("../data/raw/raw_places.csv", index=False) # type: ignore # Cleaning Audit
Basic audit to ensure that the data cleaning process has been successful.
Save Cleaned Data
clean_data_folder = Path("../data/clean/")
# Fill NaN values for consistency
BAUTISMOS_HARMONIZED = BAUTISMOS_HARMONIZED.fillna(value=np.nan)
MATRIMONIOS_HARMONIZED = MATRIMONIOS_HARMONIZED.fillna(value=np.nan)
ENTIERROS_HARMONIZED = ENTIERROS_HARMONIZED.fillna(value=np.nan)
BAUTISMOS_HARMONIZED.to_csv(clean_data_folder / "bautismos_clean.csv", index=False)
MATRIMONIOS_HARMONIZED.to_csv(clean_data_folder / "matrimonios_clean.csv", index=False)
ENTIERROS_HARMONIZED.to_csv(clean_data_folder / "entierros_clean.csv", index=False)BAUTISMOS_HARMONIZED.head()| file | identifier | event_type | event_date | baptized_name | baptized_birth_place | baptized_birth_date | baptized_legitimacy_status | father_name | father_lastname | ... | godfather_social_condition | godmother_name | godmother_lastname | godmother_social_condition | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | baptized_lastname | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | APAucará LB L001 | B001 | Bautizo | 1790-10-04 | domingo | NaN | 1790-08-04 | Hijo legitimo | lucas | ayquipa | ... | NaN | NaN | NaN | NaN | Pampamarca | NaN | Pampamarca | NaN | NaN | ayquipa |
| 1 | APAucará LB L001 | B002 | Bautizo | 1790-10-06 | dominga | NaN | 1790-08-04 | Hija legitima | juan | lulia | ... | NaN | NaN | NaN | NaN | Pampamarca | NaN | Pampamarca | NaN | NaN | lulia |
| 2 | APAucará LB L001 | B003 | Bautizo | 1790-10-07 | bartola | NaN | 1790-08-04 | Hija legitima | jacinto | quispe | ... | NaN | rotonda | pocco | NaN | Pampamarca | NaN | Pampamarca | NaN | NaN | quispe |
| 3 | APAucará LB L001 | B004 | Bautizo | 1790-10-20 | francisca | NaN | 1790-10-15 | Hija legitima | juan | cuebas | ... | NaN | ysabel | guillen | NaN | Aucara | NaN | NaN | NaN | NaN | cuebas |
| 4 | APAucará LB L001 | B005 | Bautizo | 1790-10-20 | pedro | NaN | 1790-10-19 | Hijo legitimo | santos | manxo | ... | NaN | josefa | santiago | NaN | Aucara | NaN | NaN | NaN | NaN | manxo |
5 rows × 27 columns
MATRIMONIOS_HARMONIZED.head()| file | identifier | event_type | event_date | groom_name | groom_lastname | groom_social_condition | groom_marital_status | groom_birth_date | groom_birth_place | ... | witness_3_lastname | witness_4_name | witness_4_lastname | event_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | event_geographic_descriptor_5 | event_geographic_descriptor_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | APAucará LM L001 | M001 | Matrimonio | 1816-12-06 | josé manl manuel | de la roca | don, vecinos de este pueblo [Aucara] | soltero | NaN | Ciudad de Huamanga | ... | baldes | NaN | NaN | Aucara | NaN | Huamanga | NaN | NaN | NaN | NaN |
| 1 | APAucará LM L001 | M002 | Matrimonio | 1816-12-12 | esteban | castillo | vecinos de esta doctrina [Aucara] | soltero | NaN | NaN | ... | guamani | NaN | NaN | Aucara | NaN | Colca | NaN | NaN | NaN | NaN |
| 2 | APAucará LM L001 | M003 | Matrimonio | 1817-03-05 | alexandro | ramires | vecinos de dicho pueblo [Aucara] | soltero | NaN | Pampamarca | ... | urbano | NaN | NaN | Aucara | NaN | Pampamarca | NaN | NaN | NaN | NaN |
| 3 | APAucará LM L001 | M004 | Matrimonio | 1817-03-10 | jose | cuchu | vecinos de dicho [Pampamarca] | soltero | NaN | Pampamarca | ... | coillo | NaN | NaN | Pampamarca|santa iglesia | NaN | Pampamarca | NaN | NaN | NaN | NaN |
| 4 | APAucará LM L001 | M005 | Matrimonio | 1817-03-12 | domingo | tito | vecinos de dicho [Pampamarca] | soltero | NaN | NaN | ... | guallpatuiru | NaN | NaN | Pampamarca|santa iglesia | NaN | Pampamarca | NaN | NaN | NaN | NaN |
5 rows × 55 columns
ENTIERROS_HARMONIZED.head()| file | identifier | event_type | event_date | doctrine | event_place | deceased_name | deceased_lastname | deceased_birth_date | deceased_birth_place | ... | mother_lastname | husband_name | wife_name | burial_place | event_geographic_descriptor_1 | event_geographic_descriptor_2 | event_geographic_descriptor_3 | event_geographic_descriptor_4 | husband_lastname | wife_lastname | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | APAucará LD L001 | E001 | Entierro | 1846-10-06 | Parroquia de Aucará | NaN | julian | xavies | NaN | NaN | ... | NaN | NaN | mercedes | NaN | NaN | Lucanas | NaN | NaN | NaN | lupa |
| 1 | APAucará LD L001 | E002 | Entierro | 1846-10-07 | Parroquia de Aucará | NaN | joce | raime | 1821-10-13 | NaN | ... | NaN | NaN | francisca | NaN | NaN | Lucanas | NaN | NaN | NaN | cucho |
| 2 | APAucará LD L001 | E003 | Entierro | 1846-11-02 | Parroquia de Aucará | NaN | martina | condori | 1766-11-21 | NaN | ... | NaN | luciano | NaN | NaN | NaN | Lucanas | NaN | NaN | ccoyllo | NaN |
| 3 | APAucará LD L001 | E004 | Entierro | 1846-12-08 | Parroquia de Aucará | NaN | dorotea | ccoyllo | 1806-12-18 | NaN | ... | NaN | josé | NaN | NaN | NaN | Lucanas | NaN | NaN | espinosa | NaN |
| 4 | APAucará LD L001 | E005 | Entierro | 1847-02-23 | Parroquia de Aucará | NaN | maría | romani | 1797-03-06 | NaN | ... | NaN | mariano | NaN | NaN | NaN | Lucanas | NaN | NaN | huallpatuiro | NaN |
5 rows × 26 columns