import pandas as pd
from actions.extractors import PersonaPersonas Creation
Our approach aligns closely with the input preparation model proposed by David W. Embley (2021). We structure our data around personas, defined as “each mention instance of a person in a document” (p. 66), as a foundational step toward probabilistic record linkage (PRL). Each persona is created by ingesting available individual metadata (such as name, last name, birth date), associating the person with a sacramental event (baptism, marriage, or burial). The relationship between personas is established by their participation at the event (e.g., as father, mother, godfather, witness).
Personas Data Structure
The personas data structure is very straightforward:
- event_idno: unique semantically meaningful identifier for the event
- persona_idno: unique semantically meaningful identifier for the persona
- persona_type: role of the persona in the event (e.g., baptized, father, mother, witness)
- name: first name of the persona
- last_name: last name of the persona
- birth_date: birth date of the persona
- birth_place: birth place of the persona
- resident_in: persona residence at the time of the event
- gender: inferred gender of the persona
- social_condition: harmonized social condition of the persona
- legitimacy_status: harmonized legitimacy status of the persona
- marital_status: harmonized marital status of the persona
Identification of individuals is done by parsing one or a list of dataframes with the clean data, and processing the data using the Persona class. Results are stored in data/interim/personas_extracted.csv for testing, and in data/clean/personas.csv for production.
bautismos = pd.read_csv("../data/clean/bautismos_clean.csv")
entierros = pd.read_csv("../data/clean/entierros_clean.csv")
matrimonios = pd.read_csv("../data/clean/matrimonios_clean.csv")extractor = Persona.PersonaExtractor([bautismos, matrimonios, entierros])
personas = extractor.extract_personas()
personas.describe(include='all')| event_idno | original_identifier | persona_type | name | birth_place | birth_date | legitimacy_status | lastname | persona_idno | social_condition | marital_status | resident_in | gender | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 47071 | 47071 | 47071 | 46999 | 2330 | 8595 | 11865 | 46480 | 47071 | 9643 | 4275 | 399 | 47071 |
| unique | 10180 | 10179 | 16 | 4286 | 85 | 7000 | 2 | 2615 | 47071 | 7 | 3 | 18 | 6 |
| top | matrimonio-490 | APAucará-LM-L001_M490 | mother | mariano | Pampamarca | 1901-09-04 | legitimo | quispe | persona-1 | indio | soltero | Pampamarca | male |
| freq | 12 | 12 | 7614 | 1556 | 1902 | 8 | 9104 | 2705 | 1 | 5654 | 2779 | 292 | 20150 |
personas.to_csv("../data/clean/personas.csv", index=False)