Personas Creation

Our approach aligns closely with the input preparation model proposed by David W. Embley (2021). We structure our data around personas, defined as “each mention instance of a person in a document” (p. 66), as a foundational step toward probabilistic record linkage (PRL). Each persona is created by ingesting available individual metadata (such as name, last name, birth date), associating the person with a sacramental event (baptism, marriage, or burial). The relationship between personas is established by their participation at the event (e.g., as father, mother, godfather, witness).

Personas Data Structure

The personas data structure is very straightforward:

  • event_idno: unique semantically meaningful identifier for the event
  • persona_idno: unique semantically meaningful identifier for the persona
  • persona_type: role of the persona in the event (e.g., baptized, father, mother, witness)
  • name: first name of the persona
  • last_name: last name of the persona
  • birth_date: birth date of the persona
  • birth_place: birth place of the persona
  • resident_in: persona residence at the time of the event
  • gender: inferred gender of the persona
  • social_condition: harmonized social condition of the persona
  • legitimacy_status: harmonized legitimacy status of the persona
  • marital_status: harmonized marital status of the persona

Identification of individuals is done by parsing one or a list of dataframes with the clean data, and processing the data using the Persona class. Results are stored in data/interim/personas_extracted.csv for testing, and in data/clean/personas.csv for production.

Data Extraction

We begin by loading the cleaned sacramental records and extracting persona instances using the PersonaExtractor class. This process creates individual persona records for each person mentioned in the historical documents, preserving their role in each event.

import pandas as pd
from actions.extractors import Persona
bautismos = pd.read_csv("../data/clean/bautismos_clean.csv")
entierros = pd.read_csv("../data/clean/entierros_clean.csv")
matrimonios = pd.read_csv("../data/clean/matrimonios_clean.csv")
extractor = Persona.PersonaExtractor([bautismos, matrimonios, entierros])
personas = extractor.extract_personas()

personas.describe(include='all')
event_idno original_identifier persona_type name birth_place birth_date legitimacy_status lastname persona_idno social_condition marital_status resident_in death_place death_date gender
count 47072 47072 47072 46999 5378 8596 11866 46762 47072 9643 4275 925 1513 2114 47072
unique 10180 10179 14 4286 53 7001 2 2616 47072 7 3 19 7 1813 6
top matrimonio-490 APAucará-LM-L001_M490 mother mariano pampamarca 1901-09-04 legitimo quispe persona-1 indio soltero pampamarca aucará 1871-11-04 male
freq 12 12 7614 1556 1919 8 9104 2712 1 5654 2779 292 1016 7 20150

Initial Exploration

Before conducting detailed quality assessment, we examine the basic structure and distribution of persona types in the extracted dataset.

personas.to_csv("../data/clean/personas.csv", index=False)
personas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47072 entries, 0 to 47071
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   event_idno           47072 non-null  object
 1   original_identifier  47072 non-null  object
 2   persona_type         47072 non-null  object
 3   name                 46999 non-null  object
 4   birth_place          5378 non-null   object
 5   birth_date           8596 non-null   object
 6   legitimacy_status    11866 non-null  object
 7   lastname             46762 non-null  object
 8   persona_idno         47072 non-null  object
 9   social_condition     9643 non-null   object
 10  marital_status       4275 non-null   object
 11  resident_in          925 non-null    object
 12  death_place          1513 non-null   object
 13  death_date           2114 non-null   object
 14  gender               47072 non-null  object
dtypes: object(15)
memory usage: 5.4+ MB

Quality Assessment

To evaluate the suitability of the extracted personas for probabilistic record linkage, we assess data completeness across multiple dimensions: names, parental linkages, temporal information, spatial attributes, and social/legal status markers.

Names are critical identifiers for record linkage. We assess the completeness of both first names and surnames across all persona types.

Name Completeness

name_completeness = personas.loc[(personas['name'].isna()) | (personas['lastname'].isna())]
name_completeness.info()
<class 'pandas.core.frame.DataFrame'>
Index: 383 entries, 62 to 46928
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   event_idno           383 non-null    object
 1   original_identifier  383 non-null    object
 2   persona_type         383 non-null    object
 3   name                 310 non-null    object
 4   birth_place          35 non-null     object
 5   birth_date           52 non-null     object
 6   legitimacy_status    41 non-null     object
 7   lastname             73 non-null     object
 8   persona_idno         383 non-null    object
 9   social_condition     55 non-null     object
 10  marital_status       20 non-null     object
 11  resident_in          0 non-null      object
 12  death_place          26 non-null     object
 13  death_date           53 non-null     object
 14  gender               383 non-null    object
dtypes: object(15)
memory usage: 47.9+ KB
name_completeness['persona_type'].value_counts()
persona_type
mother               83
deceased             53
father_of_husband    45
father_of_wife       45
father               44
godmother            27
godfather            25
godparent            13
wife                 13
witness              10
husband               9
mother_of_wife        7
mother_of_husband     5
baptized              4
Name: count, dtype: int64
# Percentage of missing names
total_personas = len(personas)
missing_names = len(name_completeness)
percentage_missing_names = (missing_names / total_personas) * 100
print(f"Percentage of personas with missing names: {percentage_missing_names:.2f}%")
Percentage of personas with missing names: 0.81%
missing_firstnames = personas.loc[personas['name'].isna()]
percentage_missing_firstnames = (len(missing_firstnames) / total_personas) * 100
print(f"Percentage of personas with missing firstnames: {percentage_missing_firstnames:.2f}%")
Percentage of personas with missing firstnames: 0.16%
missing_surnames = personas.loc[personas['lastname'].isna()]
percentage_missing_surnames = (len(missing_surnames) / total_personas) * 100
print(f"Percentage of personas with missing lastnames: {percentage_missing_surnames:.2f}%")
Percentage of personas with missing lastnames: 0.66%

Parental Linkage Completeness

For personas identified as children (baptized, deceased children, etc.), we assess whether their parents are properly linked in the dataset. The expectation differs by legitimacy status: legitimate children should have both parents recorded, while illegitimate children require at least one parent.

legitimate_sons = personas.loc[personas['legitimacy_status'] == 'legitimo']
ilegitimate_sons = personas.loc[personas['legitimacy_status'] == 'ilegitimo']

sons_types = legitimate_sons['persona_type'].unique().tolist()

# filter ilegitimate sons by sons types to avoid including other ilegitimate personas
ilegitimate_sons = ilegitimate_sons.loc[ilegitimate_sons['persona_type'].isin(sons_types)]

print("Legitimate Sons Persona Types and Counts:")
print(legitimate_sons['persona_type'].value_counts())
print("\nIlegitimate Sons Persona Types and Counts:")
print(ilegitimate_sons['persona_type'].value_counts())
Legitimate Sons Persona Types and Counts:
persona_type
baptized    5483
wife        1267
husband     1248
deceased    1106
Name: count, dtype: int64

Ilegitimate Sons Persona Types and Counts:
persona_type
baptized    810
deceased    201
wife        170
husband     165
Name: count, dtype: int64
def check_parents_completeness(sons_df, personas_df, legitimacy='leg'):
    # Get unique event_idno from sons
    event_ids = sons_df['event_idno'].unique()
    
    # Filter personas to only relevant events
    relevant_personas = personas_df[personas_df['event_idno'].isin(event_ids)]
    
    # Check for father and mother presence by event
    events_with_father = set(relevant_personas[relevant_personas['persona_type'].str.contains('father', na=False)]['event_idno'])
    events_with_mother = set(relevant_personas[relevant_personas['persona_type'].str.contains('mother', na=False)]['event_idno'])
    
    if legitimacy == 'leg':
        # For legitimate sons, both parents should be present
        # Incomplete if missing father OR missing mother
        events_missing_father = set(event_ids) - events_with_father
        events_missing_mother = set(event_ids) - events_with_mother
        incomplete_events = events_missing_father | events_missing_mother
    elif legitimacy == 'ileg':
        # For illegitimate sons, at least one parent should be present
        # Incomplete if missing BOTH father AND mother
        incomplete_events = set(event_ids) - events_with_father - events_with_mother
    else:
        raise ValueError("Legitimacy must be 'leg' or 'ileg'")
    
    # Get sons with incomplete parents
    incomplete_sons = sons_df[sons_df['event_idno'].isin(incomplete_events)]
    return incomplete_sons

incomplete_legit_parents = check_parents_completeness(legitimate_sons, personas)
incomplete_ilegit_parents = check_parents_completeness(ilegitimate_sons, personas, legitimacy='ileg')

print(f"Number of legitimate sons with incomplete parents: {len(incomplete_legit_parents)}")
print(f"Number of ilegitimate sons with incomplete parents: {len(incomplete_ilegit_parents)}")
Number of legitimate sons with incomplete parents: 42
Number of ilegitimate sons with incomplete parents: 9

The results show excellent parental linkage quality:

  • Legitimate personas (9,110 total): Only 0.46% (42 cases) have incomplete parental records
  • Illegitimate personas (1,310 total): Only 0.68% (9 cases) lack at least one parent

This high completeness rate indicates that the extraction process successfully preserved family relationships recorded in the sacramental registers.

Temporal Completeness

We assess the availability of birth and death dates, which are essential for temporal reasoning in record linkage.

nobirthdate = personas.loc[personas['birth_date'].isna()]
personas_size = len(personas)
nobirthdate_size = len(nobirthdate)
percentage_nobirthdate = (nobirthdate_size / personas_size) * 100
print(f"Percentage of personas with missing birth dates: {percentage_nobirthdate:.2f}%")
nobirthdate['persona_type'].value_counts()
Percentage of personas with missing birth dates: 81.74%
persona_type
mother               7614
father               7369
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 1470
mother_of_wife       1459
husband              1441
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
baptized              978
deceased               86
Name: count, dtype: int64
nodeathdate = personas.loc[personas['death_date'].isna()]
nodeathdate_size = len(nodeathdate)
percentage_nodeathdate = (nodeathdate_size / personas_size) * 100
print(f"Percentage of personas with missing death dates: {percentage_nodeathdate:.2f}%")
nodeathdate['persona_type'].value_counts()
Percentage of personas with missing death dates: 95.51%
persona_type
mother               7614
father               7369
baptized             6340
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 2060
husband              2051
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
deceased                6
Name: count, dtype: int64

Spatial Completeness

Birth and death places provide geographic context for mobility analysis and help disambiguate between individuals with similar names.

nonbirthplace = personas.loc[personas['birth_place'].isna()]
nonbirthplace_size = len(nonbirthplace)
percentage_nonbirthplace = (nonbirthplace_size / personas_size) * 100
print(f"Percentage of personas with missing birth places: {percentage_nonbirthplace:.2f}%")
nonbirthplace['persona_type'].value_counts()
Percentage of personas with missing birth places: 88.57%
persona_type
mother               7614
father               7369
baptized             4532
witness              4249
godparent            3260
godmother            3251
godfather            3012
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
wife                 1154
husband              1114
deceased              393
Name: count, dtype: int64
nodeathplace = personas.loc[personas['death_place'].isna()]
nodeathplace_size = len(nodeathplace)
percentage_nodeathplace = (nodeathplace_size / personas_size) * 100
print(f"Percentage of personas with missing death places: {percentage_nodeathplace:.2f}%")
nodeathplace['persona_type'].value_counts()
Percentage of personas with missing death places: 96.79%
persona_type
mother               7614
father               7369
baptized             6340
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 2060
husband              2051
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
deceased              607
Name: count, dtype: int64
# personas with both birth and death places present
birth_and_death_places = personas.loc[personas['birth_place'].notna() & personas['death_place'].notna()]
birth_and_death_places_size = len(birth_and_death_places)
percentage_birth_and_death_places = (birth_and_death_places_size / personas_size) * 100
print(f"Percentage of personas with both birth and death places present: {percentage_birth_and_death_places:.2f}%")
birth_and_death_places['persona_type'].value_counts()
Percentage of personas with both birth and death places present: 2.59%
persona_type
deceased    1219
Name: count, dtype: int64

Attribute Completeness

We examine the completeness of harmonized social and legal status attributes (legitimacy, marital status, social condition), which provide contextual information that can strengthen or weaken linkage hypotheses.

legitimacy_missing = personas.loc[personas['legitimacy_status'].isna()]
legitimacy_missing_size = len(legitimacy_missing)
percentage_legitimacy_missing = (legitimacy_missing_size / personas_size) * 100
print(f"Percentage of personas with missing legitimacy status: {percentage_legitimacy_missing:.2f}%")
legitimacy_missing['persona_type'].value_counts()
Percentage of personas with missing legitimacy status: 74.79%
persona_type
mother               7608
father               7365
witness              4249
godparent            3258
godmother            3248
godfather            3008
mother_of_wife       1113
father_of_wife       1093
mother_of_husband    1085
father_of_husband    1058
deceased              813
husband               638
wife                  623
baptized               47
Name: count, dtype: int64
marital_status_missing = personas.loc[personas['marital_status'].isna()]
marital_status_missing_size = len(marital_status_missing)
percentage_marital_status_missing = (marital_status_missing_size / personas_size) * 100
print(f"Percentage of personas with missing marital status: {percentage_marital_status_missing:.2f}%")
marital_status_missing['persona_type'].value_counts()
Percentage of personas with missing marital status: 90.92%
persona_type
mother               7550
father               7362
baptized             6340
witness              4249
godmother            3250
godparent            3241
godfather            3011
mother_of_wife       1459
mother_of_husband    1438
father_of_wife       1438
father_of_husband    1410
deceased              917
wife                  586
husband               546
Name: count, dtype: int64
social_condition_missing = personas.loc[personas['social_condition'].isna()]
social_condition_missing_size = len(social_condition_missing)
percentage_social_condition_missing = (social_condition_missing_size / personas_size) * 100
print(f"Percentage of personas with missing social condition: {percentage_social_condition_missing:.2f}%")
social_condition_missing['persona_type'].value_counts()
Percentage of personas with missing social condition: 79.51%
persona_type
mother               6715
father               6564
baptized             4338
witness              4249
godparent            2967
godfather            2922
godmother            2901
wife                 1196
husband              1145
mother_of_wife        955
father_of_wife        938
mother_of_husband     938
father_of_husband     917
deceased              684
Name: count, dtype: int64

Aggregate Completeness Metrics

Beyond individual field completeness, we calculate overall completeness scores to understand the general quality of persona records. We compute both a simple completeness score (proportion of non-null fields) and a weighted score that prioritizes critical fields for record linkage.

Simple Completeness Score

The simple score treats all fields equally, providing a general measure of data richness.

personas['completeness_score'] = personas.notna().sum(axis=1) / len(personas.columns)
personas.groupby('persona_type')['completeness_score'].mean().sort_values(ascending=False)
persona_type
deceased             0.821415
husband              0.656038
wife                 0.651748
baptized             0.629243
mother_of_husband    0.506092
mother_of_wife       0.505186
father_of_husband    0.504492
father_of_wife       0.503755
mother               0.474424
father               0.473651
godmother            0.473372
godparent            0.472822
godfather            0.468216
witness              0.466510
Name: completeness_score, dtype: float64

Weighted Completeness Score

The weighted score assigns higher importance to fields crucial for record linkage (names, dates) and lower weights to supplementary attributes (social condition, residence). This reflects the differential utility of fields in matching algorithms.

weights = {
    'name': 0.15,
    'lastname': 0.15,
    'birth_date': 0.125,
    'death_date': 0.125,
    'birth_place': 0.10,
    'death_place': 0.10,
    'legitimacy_status': 0.05,
    'marital_status': 0.05,
    'social_condition': 0.05,
    'gender': 0.05,
    'resident_in': 0.05
}

def weighted_completeness(row, weights):
    score = 0.0
    for col, w in weights.items():
        if pd.notna(row[col]):
            score += w
    return score

personas['weighted_completeness'] = personas.apply(weighted_completeness, axis=1, weights=weights)
personas.groupby('persona_type')['weighted_completeness'].mean().sort_values(ascending=False)
persona_type
deceased             0.836722
baptized             0.549558
husband              0.536738
wife                 0.531650
mother_of_husband    0.379222
mother_of_wife       0.378410
father_of_husband    0.375177
father_of_wife       0.374687
mother               0.354728
father               0.354641
godparent            0.354218
godmother            0.354199
godfather            0.350332
witness              0.349647
Name: weighted_completeness, dtype: float64

Comparison of Completeness Metrics

Visualizing the relationship between simple and weighted completeness reveals how different persona types vary in their possession of high-priority fields.

# plot correlation between completeness_score and weighted_completeness
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.scatterplot(data=personas, x='completeness_score', y='weighted_completeness', hue='persona_type')
plt.title('Correlation between Completeness Score and Weighted Completeness')
plt.xlabel('Completeness Score')
plt.ylabel('Weighted Completeness')
plt.legend(title='Persona Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

Summary and Implications

The quality assessment reveals that the persona extraction process successfully preserved historical information with high fidelity:

  • Name completeness is excellent for most persona types, with missing names concentrated in expected categories (e.g., godparents, witnesses)
  • Parental linkages are nearly complete (>99%) for both legitimate and illegitimate children, enabling family reconstruction
  • Temporal and spatial data show variable completeness depending on persona type, reflecting the original documentary practices
  • Weighted completeness scores indicate that core linkage fields (names, dates) are well-populated across persona types

These results suggest that the dataset is well-suited for probabilistic record linkage, with sufficient information density to support robust matching algorithms while retaining the historical nuances present in the original sacramental registers.