Personas Creation

Our approach aligns closely with the input preparation model proposed by David W. Embley (2021). We structure our data around personas, defined as “each mention instance of a person in a document” (p. 66), as a foundational step toward probabilistic record linkage (PRL). Each persona is created by ingesting available individual metadata (such as name, last name, birth date), associating the person with a sacramental event (baptism, marriage, or burial). The relationship between personas is established by their participation at the event (e.g., as father, mother, godfather, witness).

Personas Data Structure

The personas data structure is very straightforward:

  • event_idno: unique semantically meaningful identifier for the event
  • persona_idno: unique semantically meaningful identifier for the persona
  • persona_type: role of the persona in the event (e.g., baptized, father, mother, witness)
  • name: first name of the persona
  • last_name: last name of the persona
  • birth_date: birth date of the persona
  • birth_place: birth place of the persona
  • resident_in: persona residence at the time of the event
  • gender: inferred gender of the persona
  • social_condition: harmonized social condition of the persona
  • legitimacy_status: harmonized legitimacy status of the persona
  • marital_status: harmonized marital status of the persona

Identification of individuals is done by parsing one or a list of dataframes with the clean data, and processing the data using the Persona class. Results are stored in data/interim/personas_extracted.csv for testing, and in data/clean/personas.csv for production.

import pandas as pd
from actions.extractors import Persona
bautismos = pd.read_csv("../data/clean/bautismos_clean.csv")
entierros = pd.read_csv("../data/clean/entierros_clean.csv")
matrimonios = pd.read_csv("../data/clean/matrimonios_clean.csv")
extractor = Persona.PersonaExtractor([bautismos, matrimonios, entierros])
personas = extractor.extract_personas()

personas.describe(include='all')
event_idno original_identifier persona_type name birth_place birth_date legitimacy_status lastname persona_idno social_condition marital_status resident_in gender
count 47071 47071 47071 46999 2330 8595 11865 46480 47071 9643 4275 399 47071
unique 10180 10179 16 4286 85 7000 2 2615 47071 7 3 18 6
top matrimonio-490 APAucará-LM-L001_M490 mother mariano Pampamarca 1901-09-04 legitimo quispe persona-1 indio soltero Pampamarca male
freq 12 12 7614 1556 1902 8 9104 2705 1 5654 2779 292 20150
personas.to_csv("../data/clean/personas.csv", index=False)