Personas Creation

Our approach aligns closely with the input preparation model proposed by David W. Embley (2021). We structure our data around personas, defined as “each mention instance of a person in a document” (p. 66), as a foundational step toward probabilistic record linkage (PRL). Each persona is created by ingesting available individual metadata (such as name, last name, birth date), associating the person with a sacramental event (baptism, marriage, or burial). The relationship between personas is established by their participation at the event (e.g., as father, mother, godfather, witness).

Personas Data Structure

The personas data structure is very straightforward:

event_idno: unique semantically meaningful identifier for the event
persona_idno: unique semantically meaningful identifier for the persona
persona_type: role of the persona in the event (e.g., baptized, father, mother, witness)
name: first name of the persona
last_name: last name of the persona
birth_date: birth date of the persona
birth_place: birth place of the persona
resident_in: persona residence at the time of the event
gender: inferred gender of the persona
social_condition: harmonized social condition of the persona
legitimacy_status: harmonized legitimacy status of the persona
marital_status: harmonized marital status of the persona

Identification of individuals is done by parsing one or a list of dataframes with the clean data, and processing the data using the Persona class. Results are stored in data/interim/personas_extracted.csv for testing, and in data/clean/personas.csv for production.

Data Extraction

We begin by loading the cleaned sacramental records and extracting persona instances using the PersonaExtractor class. This process creates individual persona records for each person mentioned in the historical documents, preserving their role in each event.

import pandas as pd
from actions.extractors import Persona

bautismos = pd.read_csv("../data/clean/bautismos_clean.csv")
entierros = pd.read_csv("../data/clean/entierros_clean.csv")
matrimonios = pd.read_csv("../data/clean/matrimonios_clean.csv")

extractor = Persona.PersonaExtractor([bautismos, matrimonios, entierros])
personas = extractor.extract_personas()

personas.describe(include='all')

	event_idno	original_identifier	persona_type	name	birth_place	birth_date	legitimacy_status	lastname	persona_idno	social_condition	marital_status	resident_in	death_place	death_date	gender
count	47072	47072	47072	46999	5378	8596	11866	46762	47072	9643	4275	925	1513	2114	47072
unique	10180	10179	14	4286	53	7001	2	2616	47072	7	3	19	7	1813	6
top	matrimonio-490	APAucará-LM-L001_M490	mother	mariano	pampamarca	1901-09-04	legitimo	quispe	persona-1	indio	soltero	pampamarca	aucará	1871-11-04	male
freq	12	12	7614	1556	1919	8	9104	2712	1	5654	2779	292	1016	7	20150

Initial Exploration

Before conducting detailed quality assessment, we examine the basic structure and distribution of persona types in the extracted dataset.

personas.to_csv("../data/clean/personas.csv", index=False)

personas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47072 entries, 0 to 47071
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   event_idno           47072 non-null  object
 1   original_identifier  47072 non-null  object
 2   persona_type         47072 non-null  object
 3   name                 46999 non-null  object
 4   birth_place          5378 non-null   object
 5   birth_date           8596 non-null   object
 6   legitimacy_status    11866 non-null  object
 7   lastname             46762 non-null  object
 8   persona_idno         47072 non-null  object
 9   social_condition     9643 non-null   object
 10  marital_status       4275 non-null   object
 11  resident_in          925 non-null    object
 12  death_place          1513 non-null   object
 13  death_date           2114 non-null   object
 14  gender               47072 non-null  object
dtypes: object(15)
memory usage: 5.4+ MB

Quality Assessment

To evaluate the suitability of the extracted personas for probabilistic record linkage, we assess data completeness across multiple dimensions: names, parental linkages, temporal information, spatial attributes, and social/legal status markers.

Names are critical identifiers for record linkage. We assess the completeness of both first names and surnames across all persona types.

Name Completeness

name_completeness = personas.loc[(personas['name'].isna()) | (personas['lastname'].isna())]
name_completeness.info()

<class 'pandas.core.frame.DataFrame'>
Index: 383 entries, 62 to 46928
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   event_idno           383 non-null    object
 1   original_identifier  383 non-null    object
 2   persona_type         383 non-null    object
 3   name                 310 non-null    object
 4   birth_place          35 non-null     object
 5   birth_date           52 non-null     object
 6   legitimacy_status    41 non-null     object
 7   lastname             73 non-null     object
 8   persona_idno         383 non-null    object
 9   social_condition     55 non-null     object
 10  marital_status       20 non-null     object
 11  resident_in          0 non-null      object
 12  death_place          26 non-null     object
 13  death_date           53 non-null     object
 14  gender               383 non-null    object
dtypes: object(15)
memory usage: 47.9+ KB

name_completeness['persona_type'].value_counts()

persona_type
mother               83
deceased             53
father_of_husband    45
father_of_wife       45
father               44
godmother            27
godfather            25
godparent            13
wife                 13
witness              10
husband               9
mother_of_wife        7
mother_of_husband     5
baptized              4
Name: count, dtype: int64

# Percentage of missing names
total_personas = len(personas)
missing_names = len(name_completeness)
percentage_missing_names = (missing_names / total_personas) * 100
print(f"Percentage of personas with missing names: {percentage_missing_names:.2f}%")

Percentage of personas with missing names: 0.81%

missing_firstnames = personas.loc[personas['name'].isna()]
percentage_missing_firstnames = (len(missing_firstnames) / total_personas) * 100
print(f"Percentage of personas with missing firstnames: {percentage_missing_firstnames:.2f}%")

Percentage of personas with missing firstnames: 0.16%

missing_surnames = personas.loc[personas['lastname'].isna()]
percentage_missing_surnames = (len(missing_surnames) / total_personas) * 100
print(f"Percentage of personas with missing lastnames: {percentage_missing_surnames:.2f}%")

Percentage of personas with missing lastnames: 0.66%

Parental Linkage Completeness

For personas identified as children (baptized, deceased children, etc.), we assess whether their parents are properly linked in the dataset. The expectation differs by legitimacy status: legitimate children should have both parents recorded, while illegitimate children require at least one parent.

legitimate_sons = personas.loc[personas['legitimacy_status'] == 'legitimo']
ilegitimate_sons = personas.loc[personas['legitimacy_status'] == 'ilegitimo']

sons_types = legitimate_sons['persona_type'].unique().tolist()

# filter ilegitimate sons by sons types to avoid including other ilegitimate personas
ilegitimate_sons = ilegitimate_sons.loc[ilegitimate_sons['persona_type'].isin(sons_types)]

print("Legitimate Sons Persona Types and Counts:")
print(legitimate_sons['persona_type'].value_counts())
print("\nIlegitimate Sons Persona Types and Counts:")
print(ilegitimate_sons['persona_type'].value_counts())

Legitimate Sons Persona Types and Counts:
persona_type
baptized    5483
wife        1267
husband     1248
deceased    1106
Name: count, dtype: int64

Ilegitimate Sons Persona Types and Counts:
persona_type
baptized    810
deceased    201
wife        170
husband     165
Name: count, dtype: int64

def check_parents_completeness(sons_df, personas_df, legitimacy='leg'):
    # Get unique event_idno from sons
    event_ids = sons_df['event_idno'].unique()
    
    # Filter personas to only relevant events
    relevant_personas = personas_df[personas_df['event_idno'].isin(event_ids)]
    
    # Check for father and mother presence by event
    events_with_father = set(relevant_personas[relevant_personas['persona_type'].str.contains('father', na=False)]['event_idno'])
    events_with_mother = set(relevant_personas[relevant_personas['persona_type'].str.contains('mother', na=False)]['event_idno'])
    
    if legitimacy == 'leg':
        # For legitimate sons, both parents should be present
        # Incomplete if missing father OR missing mother
        events_missing_father = set(event_ids) - events_with_father
        events_missing_mother = set(event_ids) - events_with_mother
        incomplete_events = events_missing_father | events_missing_mother
    elif legitimacy == 'ileg':
        # For illegitimate sons, at least one parent should be present
        # Incomplete if missing BOTH father AND mother
        incomplete_events = set(event_ids) - events_with_father - events_with_mother
    else:
        raise ValueError("Legitimacy must be 'leg' or 'ileg'")
    
    # Get sons with incomplete parents
    incomplete_sons = sons_df[sons_df['event_idno'].isin(incomplete_events)]
    return incomplete_sons

incomplete_legit_parents = check_parents_completeness(legitimate_sons, personas)
incomplete_ilegit_parents = check_parents_completeness(ilegitimate_sons, personas, legitimacy='ileg')

print(f"Number of legitimate sons with incomplete parents: {len(incomplete_legit_parents)}")
print(f"Number of ilegitimate sons with incomplete parents: {len(incomplete_ilegit_parents)}")

Number of legitimate sons with incomplete parents: 42
Number of ilegitimate sons with incomplete parents: 9

The results show excellent parental linkage quality:

Legitimate personas (9,110 total): Only 0.46% (42 cases) have incomplete parental records
Illegitimate personas (1,310 total): Only 0.68% (9 cases) lack at least one parent

This high completeness rate indicates that the extraction process successfully preserved family relationships recorded in the sacramental registers.

Temporal Completeness

We assess the availability of birth and death dates, which are essential for temporal reasoning in record linkage.

nobirthdate = personas.loc[personas['birth_date'].isna()]
personas_size = len(personas)
nobirthdate_size = len(nobirthdate)
percentage_nobirthdate = (nobirthdate_size / personas_size) * 100
print(f"Percentage of personas with missing birth dates: {percentage_nobirthdate:.2f}%")
nobirthdate['persona_type'].value_counts()

Percentage of personas with missing birth dates: 81.74%

persona_type
mother               7614
father               7369
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 1470
mother_of_wife       1459
husband              1441
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
baptized              978
deceased               86
Name: count, dtype: int64

nodeathdate = personas.loc[personas['death_date'].isna()]
nodeathdate_size = len(nodeathdate)
percentage_nodeathdate = (nodeathdate_size / personas_size) * 100
print(f"Percentage of personas with missing death dates: {percentage_nodeathdate:.2f}%")
nodeathdate['persona_type'].value_counts()

Percentage of personas with missing death dates: 95.51%

persona_type
mother               7614
father               7369
baptized             6340
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 2060
husband              2051
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
deceased                6
Name: count, dtype: int64

Spatial Completeness

Birth and death places provide geographic context for mobility analysis and help disambiguate between individuals with similar names.

nonbirthplace = personas.loc[personas['birth_place'].isna()]
nonbirthplace_size = len(nonbirthplace)
percentage_nonbirthplace = (nonbirthplace_size / personas_size) * 100
print(f"Percentage of personas with missing birth places: {percentage_nonbirthplace:.2f}%")
nonbirthplace['persona_type'].value_counts()

Percentage of personas with missing birth places: 88.57%

persona_type
mother               7614
father               7369
baptized             4532
witness              4249
godparent            3260
godmother            3251
godfather            3012
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
wife                 1154
husband              1114
deceased              393
Name: count, dtype: int64

nodeathplace = personas.loc[personas['death_place'].isna()]
nodeathplace_size = len(nodeathplace)
percentage_nodeathplace = (nodeathplace_size / personas_size) * 100
print(f"Percentage of personas with missing death places: {percentage_nodeathplace:.2f}%")
nodeathplace['persona_type'].value_counts()

Percentage of personas with missing death places: 96.79%

persona_type
mother               7614
father               7369
baptized             6340
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 2060
husband              2051
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
deceased              607
Name: count, dtype: int64

# personas with both birth and death places present
birth_and_death_places = personas.loc[personas['birth_place'].notna() & personas['death_place'].notna()]
birth_and_death_places_size = len(birth_and_death_places)
percentage_birth_and_death_places = (birth_and_death_places_size / personas_size) * 100
print(f"Percentage of personas with both birth and death places present: {percentage_birth_and_death_places:.2f}%")
birth_and_death_places['persona_type'].value_counts()

Percentage of personas with both birth and death places present: 2.59%

persona_type
deceased    1219
Name: count, dtype: int64

Attribute Completeness

We examine the completeness of harmonized social and legal status attributes (legitimacy, marital status, social condition), which provide contextual information that can strengthen or weaken linkage hypotheses.

legitimacy_missing = personas.loc[personas['legitimacy_status'].isna()]
legitimacy_missing_size = len(legitimacy_missing)
percentage_legitimacy_missing = (legitimacy_missing_size / personas_size) * 100
print(f"Percentage of personas with missing legitimacy status: {percentage_legitimacy_missing:.2f}%")
legitimacy_missing['persona_type'].value_counts()

Percentage of personas with missing legitimacy status: 74.79%

persona_type
mother               7608
father               7365
witness              4249
godparent            3258
godmother            3248
godfather            3008
mother_of_wife       1113
father_of_wife       1093
mother_of_husband    1085
father_of_husband    1058
deceased              813
husband               638
wife                  623
baptized               47
Name: count, dtype: int64

marital_status_missing = personas.loc[personas['marital_status'].isna()]
marital_status_missing_size = len(marital_status_missing)
percentage_marital_status_missing = (marital_status_missing_size / personas_size) * 100
print(f"Percentage of personas with missing marital status: {percentage_marital_status_missing:.2f}%")
marital_status_missing['persona_type'].value_counts()

Percentage of personas with missing marital status: 90.92%

persona_type
mother               7550
father               7362
baptized             6340
witness              4249
godmother            3250
godparent            3241
godfather            3011
mother_of_wife       1459
mother_of_husband    1438
father_of_wife       1438
father_of_husband    1410
deceased              917
wife                  586
husband               546
Name: count, dtype: int64

social_condition_missing = personas.loc[personas['social_condition'].isna()]
social_condition_missing_size = len(social_condition_missing)
percentage_social_condition_missing = (social_condition_missing_size / personas_size) * 100
print(f"Percentage of personas with missing social condition: {percentage_social_condition_missing:.2f}%")
social_condition_missing['persona_type'].value_counts()

Percentage of personas with missing social condition: 79.51%

persona_type
mother               6715
father               6564
baptized             4338
witness              4249
godparent            2967
godfather            2922
godmother            2901
wife                 1196
husband              1145
mother_of_wife        955
father_of_wife        938
mother_of_husband     938
father_of_husband     917
deceased              684
Name: count, dtype: int64

Aggregate Completeness Metrics

Beyond individual field completeness, we calculate overall completeness scores to understand the general quality of persona records. We compute both a simple completeness score (proportion of non-null fields) and a weighted score that prioritizes critical fields for record linkage.

Simple Completeness Score

The simple score treats all fields equally, providing a general measure of data richness.

personas['completeness_score'] = personas.notna().sum(axis=1) / len(personas.columns)
personas.groupby('persona_type')['completeness_score'].mean().sort_values(ascending=False)

persona_type
deceased             0.821415
husband              0.656038
wife                 0.651748
baptized             0.629243
mother_of_husband    0.506092
mother_of_wife       0.505186
father_of_husband    0.504492
father_of_wife       0.503755
mother               0.474424
father               0.473651
godmother            0.473372
godparent            0.472822
godfather            0.468216
witness              0.466510
Name: completeness_score, dtype: float64

Weighted Completeness Score

The weighted score assigns higher importance to fields crucial for record linkage (names, dates) and lower weights to supplementary attributes (social condition, residence). This reflects the differential utility of fields in matching algorithms.

weights = {
    'name': 0.15,
    'lastname': 0.15,
    'birth_date': 0.125,
    'death_date': 0.125,
    'birth_place': 0.10,
    'death_place': 0.10,
    'legitimacy_status': 0.05,
    'marital_status': 0.05,
    'social_condition': 0.05,
    'gender': 0.05,
    'resident_in': 0.05
}

def weighted_completeness(row, weights):
    score = 0.0
    for col, w in weights.items():
        if pd.notna(row[col]):
            score += w
    return score

personas['weighted_completeness'] = personas.apply(weighted_completeness, axis=1, weights=weights)
personas.groupby('persona_type')['weighted_completeness'].mean().sort_values(ascending=False)

persona_type
deceased             0.836722
baptized             0.549558
husband              0.536738
wife                 0.531650
mother_of_husband    0.379222
mother_of_wife       0.378410
father_of_husband    0.375177
father_of_wife       0.374687
mother               0.354728
father               0.354641
godparent            0.354218
godmother            0.354199
godfather            0.350332
witness              0.349647
Name: weighted_completeness, dtype: float64

Comparison of Completeness Metrics

Visualizing the relationship between simple and weighted completeness reveals how different persona types vary in their possession of high-priority fields.

# plot correlation between completeness_score and weighted_completeness
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.scatterplot(data=personas, x='completeness_score', y='weighted_completeness', hue='persona_type')
plt.title('Correlation between Completeness Score and Weighted Completeness')
plt.xlabel('Completeness Score')
plt.ylabel('Weighted Completeness')
plt.legend(title='Persona Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

Summary and Implications

The quality assessment reveals that the persona extraction process successfully preserved historical information with high fidelity:

Name completeness is excellent for most persona types, with missing names concentrated in expected categories (e.g., godparents, witnesses)
Parental linkages are nearly complete (>99%) for both legitimate and illegitimate children, enabling family reconstruction
Temporal and spatial data show variable completeness depending on persona type, reflecting the original documentary practices
Weighted completeness scores indicate that core linkage fields (names, dates) are well-populated across persona types

These results suggest that the dataset is well-suited for probabilistic record linkage, with sufficient information density to support robust matching algorithms while retaining the historical nuances present in the original sacramental registers.