Welcome

Sondondo Parish Records Project

Project Overview

This project employs probabilistic record linkage methods to identify unique individuals within parish records from the Sondondo Valley, Peru, spanning from 1760 to 1921. The collection encompasses 10,180 historical records across three vital event types: baptisms (6,340 records), marriages (1,719 records), and burials (2,121 records).

Historical parish records provide valuable insights into demographic, social, and familial patterns, but they often lack explicit unique identifiers, making it challenging to track individuals across multiple life events. By leveraging contextual data—such as names, familial relationships, geographic locations, and event dates—this project aims to reconstruct individual life histories and reveal the social networks of this historical community.

Research Objectives

  • Entity Resolution: Apply probabilistic record linkage to identify unique individuals across multiple event records
  • Data Standardization: Clean, normalize, and harmonize inconsistent historical data from manuscript sources
  • Network Analysis: Uncover familial and social connections within the community through relationship mapping
  • Methodological Contribution: Develop replicable workflows for processing historical datasets with similar challenges

Documentation Structure

This documentation site presents the complete data processing pipeline through a series of interconnected notebooks and reference materials:

Notebooks

The Notebooks section contains detailed computational workflows documenting each phase of data processing:

  1. Data Cleaning: Comprehensive data cleaning including column harmonization, date normalization, name standardization, and quality validation
  2. Place Mapping: Geographic entity extraction using Named Entity Recognition (NER) and standardization through external gazetteers
  3. Textual Variation Analysis: Statistical analysis of social condition terminology and controlled vocabulary development
  4. Personas Creation: Extraction and consolidation of individual person mentions from event records into a unified person-centric dataset
  5. Visualizations: Exploratory data analysis and visual summaries of the processed datasets

Documentation

The Documentation section provides reference materials for understanding and working with the datasets:

  • Metadata Dictionary: Complete field-level documentation for all cleaned datasets, including data types, descriptions, and controlled vocabularies

Current Status

Phase:Documentation & Visualization (v0.3.0)

The project has successfully completed: - ✅ Phase 1: Data Cleaning & Standardization - ✅ Phase 2: Personas Dataset Creation
- ✅ Phase 3: Documentation & Visualization

Current work focuses on implementing probabilistic record linkage algorithms to identify unique individuals across the corpus.

Datasets

Raw Data

The original transcribed data is stored in data/raw/:

  • bautismos.csv: Baptism records (6,340 entries, 36 columns)
  • matrimonios.csv: Marriage records (1,719 entries, 66 columns)
  • entierros.csv: Burial records (2,121 entries, 37 columns)

Data was collected through manual transcription from digitized parish registers, using structured Google Sheets templates followed by manual quality review.

Cleaned Data

Processed datasets in data/clean/:

  • bautismos_clean.csv: Cleaned baptism records with standardized columns and normalized data
  • matrimonios_clean.csv: Cleaned marriage records with harmonized fields
  • entierros_clean.csv: Cleaned burial records with processed names and places
  • unique_places.csv: Geographic locations extracted and standardized through gazetteers

Processing achievements: - Column harmonization across all datasets - Date normalization to ISO 8601 format (YYYY-MM-DD) - Name standardization using custom algorithms - Geographic entity extraction using Named Entity Recognition (NER) - Age inference and data validation - Comprehensive quality audit and error reporting

Personas Dataset

personas.csv consolidates all individuals mentioned across event records:

  • Unified schema across all record types (baptisms, marriages, burials)
  • Role-based extraction (baptized, parent, godparent, spouse, deceased, witness, etc.)
  • Preserved source record metadata for traceability
  • Standardized name and place fields ready for probabilistic matching
  • Each row represents a unique person-event mention with demographic attributes

This dataset serves as the foundation for probabilistic record linkage to reconstruct individual life histories.

Technical Contributions

GeoResolver Library

As part of this research, we developed and published GeoResolver, a Python library for geographic entity resolution and coordinate lookup. This library emerged from our work on place name standardization within the parish records.

Key features: - Geographic entity recognition and normalization - Coordinate lookup and validation - Place name standardization for historical data - Caching mechanisms for improved performance

Installation:

pip install georesolver

The GeoResolver library is actively used in this project for processing geographic descriptors and place names found in the historical records.

Data Sources

Data was collected through manual transcription from digitized parish registers of the Sondondo Valley, Peru. All transcriptions were performed directly from document images using structured templates in Google Sheets, followed by manual review for quality assurance.


For technical details, source code, and raw data access, visit the project repository.