Surname Study and AI Part 5: Adding City Directories

blog banner Surname Study and AI Part 5: Adding City Directories

In this series of posts about a surname study, Part 1 described the study, Part 2 included how census data was collected and formatted for use and Part 3 described how to combine and analyze the census data. Part 4 showed how to create a project as part of a surname study (or any any task you are doing). In this part, adding collecting and adding data from city directories to the surname study will be discussed.

When I was approaching this project without AI, I had gathered many records. I had compiled them in spreadsheets, planning to do analysis. Since I am a visual person, the data I collected was compiled in PowerPoint slides with graphics. That meant that I the collected data in a spreadsheet.

Capturing the city directory data for the four cities with directories that listed Gilroys from 1850-1900 took some thought. By its nature, the tables in the spreadsheet were not completely populated; I only had data for years that I could find directories and could only fill in the table when a person was listed in the directory. There had been movement of Gilroys between cities, too.

Since the data collection had been a while ago, I had a chance to revisit Ancestry.com to see if there had been additional city directories added to the database. Spoiler alert: there had. I also used ChatGPT’s services to collect sources for the actual city directories that appear online. This helped, and in some cases I used data from House Directory and Family Address Books.

While ChatGPT and I worked on defining and refining the product, it became important for me to redesign the spreadsheet of city directory data in a different, and more uniform pattern. I was careful to separate out people with the same name, treating them as different individuals until there was confirmation that they could combined into being records for the same person. I experimented with filling in the blank cells of the spreadsheet with a dash, but in the end, leaving the empty cells blank worked better.

The final spreadsheet was all in one worksheet, but had four separate sections for the cities in Rhode Island where I had located Gilroys: Newport, Providence, Westerly and Bristol. There pairs of columns for each entry with occupation and address. Above those headers, I merged the cells to enter the year. Above the year was a brief title for the source, with the page numbers. As you can imagine, the spreadsheet had many columns.

For me, it was important to know that ChatGPT was using all the data in the spreadsheet, so I had it create a listing based on each entry (name, year, occupation, address). I then verified that all the individuals that I had input were being seen by ChatGPT. A complication arose. Even though I uploaded the spreadsheet with the revised city directory entries into the Project, ChatGPT told me that it could not access the spreadsheet. It suggested that I post a screenshot, or the cells, to accompany my questions. So I did.

At the end of ironing out the directions for this task, based on the output, I asked ChatGPT to provide a prompt that would have created a list with all the collected data for me. In the end, I added two columns to this integrity checking spreadsheet: a number to correspond with the individual in the row of the spreadsheet, and a number referencing the source of the data. We also decided it was best for ChatGPT to take in one city at a time, then have me verify that it had the entries before doing the analysis.

ChatGPT created a very detailed reusable prompt with sections describing for the following subtasks:

  • Work by city section only
  • Preserve spreadsheet row order exactly
  • Extract strictly left to right
  • Use exact visible cell text only
  • Output format
  • Add source list below each city table Include abbreviation notes
  • Mandatory verification stop

A report correlating the city directories with the census data was generated. Four families were identified as units, and the backbone of the migration of Gilroys was hypothesized. I then asked for additional insights. I reminded ChatGPT about the single women who came before the families, and they were also discussed. I also had lists of all the people found in the city directories printed by city to include in an appendix.

A lot of the push-pull between ChatGPT during the creation of a report was the fact that it seemed to want to talk about the report more than create it. I had to guide it to create a product more through this task than through the previous one. Honestly, the effort to manually reformat and check the data did help me get immersed in the data in a way that telling someone, or an AI, to collect and analyze data would ever do. (Being hands-on also helped me to combine data during the next phase when I was collecting vital records data.)

In its crosswalk through the city directory data and census data, ChatGPT now had five strong lines. It also saw the connections between parts of the family and addresses. There were clusters on streets: Burn’s Court / Byrnes Court cluster in Newport, and Manton Avenue in Providence. The census where Timothy and Eliza were in Providence were correlated to directory entry for him Manton Avenue. That address was also one where my great-grandfather would live after his parents’ deaths. (More on that story after vital records.) This work may be laying groundwork for chain migration!

Although I know more about the family at this point than ChatGPT (until I add vital records data), I became excited because the married women were speaking through their husbands’ entries! The power of combining different types of records by compiling them was becoming more obvious.

I knew Timothy and Eliza were married in Newport and died there. However, from the census work, I knew that  there was a RI Census showing them living in Providence. After looking at the addresses in Providence, I opened up a new chat within the project and asked about the street where Timothy-Eliza family members lived before and after the immigrant couple’s deaths. This opened up understanding of the historical context of Manton Avenue in Providence, RI.

Other Gilroy families also lived in streets around this industrial area that was home to several mills. ChatGPT shared that this was a likely destination for internal migration within Rhode Island. After Timothy and Eliza’s death, my great-grandfather lived and worked in Providence. He lived at the same address as William Patrick Rafferty, who had married Katie Josephine Gilroy (Timothy and Eliza’s daughter) in Newport in 1889. They would later move to Long Island, NY.  The Manton Avenue connection became more intriguing. In a separate conversation, I asked ChatGPT:

Tell me about Providence, RI Manton Avenue in the 1850-1900 timeframe

This conversation was illuminating, as part of the answer was: “By 1850–1900, it had become one of the city’s major mill and worker-residential corridors.” Mills in the area were named as were streets in the area.

Since ChatGPT had been trained on Gilroy data, unsurprisingly ChatGPT asked:

“Would you like me to help analyze whether any of your Providence Gilroy directory entries fall near Manton Avenue or the Olneyville mill corridor?”

Yes, look especially at city directory entries for these addresses and occupations

ChatGPT gave a listing of the addresses, and explained which streets in those addresses were all within walking distance of each other. It gave me its insight that family members within the a few blocks might be indicators of chain migration, sibling households, and a kin boarding network. This

At the end of this step:

I had created a table with the city directory entries that had been located, by year, for people named Gilroy in the Rhode Island cities of Newport, Providence, Westerly and Bristol. The spreadsheet contained the occupation and address for each person, by year as well as the name of the reference.

ChatGPT had generated a report: Gilroy City Directories Analysis Report With Timelines And Census Correlation, Rhode Island (Newport, Bristol, Providence, Westerly), 1850–1900. There were also separate sections to add to an appendix with all listings of people with the name Gilroy from the cities Newport, Providence, Westerly and Bristol.

More than that, I had a better understanding of a neighborhood in Providence that may be a nexus for my immigrant family.

The way to tie these individuals together would be vital records, so I was eager to move forward to them. I just had to review the data I had, capture data if needed, organize it logically and load it into ChatGPT.

Surname Study and AI Part 4: Making A ChatGPT Project

blog banner Surname Study and AI Part 4: Making A ChatGPT Project

In this series of posts about a surname study, Part 1 described the study, Part 2 included how census data was collected and formatted for use and Part 3 described how to combine and analyze the census data. This blog post will show how to create a project in ChatGPT. Even though the example shows creating a project as part of a surname study, the steps can be used for any task you are doing.

In the work done during the previous part of my project, I asked ChatGPT:

Would it be good to have this chat in a project?

ChatGPT suggested that a Project is good for a long, multi-stage surname study. It explained the benefits of having the related chats and files grouped together for organization. It also recommended creating separate threads as I continued my effort, with tips for naming the chats. ChatGPT went on to suggest which threads and files to include.

NOTE: Although I kept detailed notes about the steps of this study, I had not written the full blog posts as I performed the steps. As a results, some details of the interface had changed, so please keep in mind, ChatGPT is always evolving!

In the menu section Projects I clicked on the + sign next to New Project.

New Project on the ChatGPT Menu

Then a dialog box opened so that I could enter the Project name.

Create Project option on ChatGPT

And the Gilroy Surname Study (RI, 1850-1900) project was created in ChatGPT. It appears in the menu, above the other chats.

New Project created on ChatGPT

Since I already had chats to add to the project, I clicked on the three dots (ellipses, sideways snowman) next to the name of the existing chat I wanted to add (Surname study assistance). Choosing Move to project gave me the option to create a New project and the name of the already existing one.

Move existing chat to the Project

Next, I wanted to add the data files to the project. I clicked on the Project in the menu, and where Chats was already selected.

Project Chats and Sources

I selected Sources, then + Add sources

Project select Sources, then Add sources

The dialog box opened to allow me to add sources. In this case, the sources were my files, and I dragged and dropped them.

Add sources

ChatGPT had offered suggestions about what products to add to the Project, such as checklists it had generated.

NOTE: At this point (ChatGPT 5.2), the names of sources in the project cannot be edited. The types of files that can be added to a project have been expanded, and are: .docx, .pdf, .txt, .md, .xlsx, .csv, .jpg, .jpeg, .png, .tiff, .json, .xml, .pptx, .mp3, .wav, .mp4, .html, .mhtml

At the end of this step:

The chats and source files had been grouped together into a project.

Next step: I decided to look at the data from city directories.

Google NotebookLM Tutorial

Blog Banner Google NotebookLM Tutorial

This is it! You have decided to give Google’s NotebookLM a try!

Maybe you want step-by-step instructions, or just want to look over the process before diving in. Either way, this tutorial stands ready to help.

What will you do in this Notebook? One suggestion is to upload a group of documents related to a subject or ancestor. These are documents that you want to understand better or analyze. Don’t overthink it. You just need to have an idea of your subject, because once you begin to use the Notebook more ideas will probably come to you.

In this tutorial, we’ll get started with a brand new NotebookLM, add documents to it, then based on those documents generate an Audio Overview, an Infographic, a Slide Deck and a Video Overview.

NOTE: For this tutorial, keep in mind that Google may change how it looks or add/remove specific functionality and labels at any time, but the basic ideas will remain.

When you have decided the topic for your Notepad, it’s time to get going and create it.

In my example I will add only a few documents: the homestead patents and pages from the tract books for Charles F. Gilroy.         

Here’s the link:

https://sites.google.com/view/notebook-lm/login

NotebookLM Login Page

Login to your Google account here. If you are already logged into Google in the same browser, you may go directly to this page:

NotebookLM Welcome Page

You’re in!

Select Create new notebook to start.

After you have created a new notebook, a window pops up asking you to add media. (This is the same window that will open when you select + Add sources)

As of this writing the Notebook supports: Google Docs, Slides, PDFs, text files, web URLs, YouTube transcripts, and audio files. When you enter a link a YouTube video, only the transcript will be used and the video has to be public.

For best results, enter documents with text in them. There is no guarantee that images will be transcribed properly.

From this window you can drag and drop the files you want to add to your Notebook.

NotebookLM adding sources

When adding to this Notebook, I have to admit that I did not follow the text-is-best rule. That means I will need to verify the transcription that the Notebook is using was done correctly. I added Land Patents and Tract Book images. (The Tract Book images had been located by FamilySearch Full-Text Search!)

On the left, I selected one of the sources, and viewed a description of the document containing key information from it that had been extracted. 

NotebookLM Source Guide

The workspace that opens is called the Notebook, and it has three windows labeled: Sources, Chat, and Studio. The first two are self-explanatory.

The third window is the Studio Window, which is also called the Studio Panel.

There are two sections within the Studio Panel. One section is home to the buttons, called Action Tiles, where you ask the Notebook to generate complicated multimedia products. By selecting an Action Tile, the Notebook to generate audio or visual presentations, infographic, slide decks, reports, mind maps and more. At this point, several Tiles are labeled “Beta” which means they are almost ready to be full-fledged features but are still being evaluated. Do not let that dissuade you from trying them! Test them out for yourself.

The second section is the Generated Resource List. When you request a product, you will see it added to that list. The list is empty for a new Notebook. As you choose products, the list is populated with the generated media. Next to each resource in that appears in the list there is a 3 dot menu (snowman) where you can Rename, Download, Share or Delete a resource. When you rename a resource, that changes only the name and does not change any of the media’s content.

NotebookLM three windows

After uploading the documents, a name for the notebook was automatically generated.

NotebookLM Sources Window

I renamed the Notebook.

NotebookLM after updating Notebook name

Audio Overview

First, I tried an Audio Overview based on the few documents I had uploaded. This action offers to “Generate an AI podcast based on your sources.”

NotebookLM Audio Overview Tile Detail

Documentation for the Notebook had explained that it may take some time for the Audio Overview to be generated.

NotebookLM Studio Panel Audio being generated on Generation Resource List

Within minutes, I was listening to audio in a podcast format of two people explaining and discussing the documents and their context in a pleasant conversation presentation. It was 19 minutes, 12 seconds in length.

NotebookLM Studio Panel Audio on Generation Resource List

A clip from this audio is here:

Infographic

Next, I decided to generate an Infographic based on the documents.

NotebookLM Infographic Tile Detail

In the Generated Resource List at the bottom of the Studio Panel, there was a spinning circle to indicate that the infographic was being generated. When it was done, I could select it from the list.

I clicked on the Infographic in the list in the Studio window

NotebookLM Studio Panel Infographic on Generation Resource List

and a Viewer opened up. I had options to share, download, collapse the Viewer and close the Viewer in the upper right hand corner.  

NotebookLM Infographic Window

After I closed the Viewer, I could click on the snowman (3 dot menu) and to be presented with options: Rename, Download, Share, Delete

This is one of the features that in BETA, but the infographic that was generated was interesting.

Slide Deck

An option is to generate a Slide Deck. At this time, this feature is in BETA.

NotebookLM Slide Deck Tile Detail

I selected Slide Deck and waited while it was generated

NotebookLM Studio Panel Generation Resource List Overview

When I clicked on the Slide Deck in the Resource List, a Viewer opened up where I could look at the slides, and interact with them.

NotebookLM Slide Deck overview window

I particularly liked this slide

NotebookLM Generated Slide

NotebookLM Generated Slide

I also liked the option to download the slide deck as a PDF or a PowerPoint document.

download the slide deck as a PDF or a PowerPoint document

Selecting “Revise” gives you the chance to interact and make change to the slide. The pending changes will be generated in a few minutes (or longer).

Video Overview

I selected the Video Overview Tile

and accepted the default selections, which included the longer Explainer format.

NotebookLM Customize Video Overview Window – Explainer Format

Generating that video took a long time. When I quizzed Gemini if I could find out how long it took to generate a product, I was told no, but that this task usually took from 5 to 30+ minutes.

NotebookLM Generated Resource List

At the end of that response, Gemini asked me if generating was taking a long time, and when I said yes, Gemini recommended that I refresh the webpage because the user interface had not updated. When I followed this recommendation, it appeared that the Video Overview generation had failed.

NotebookLM Generated Resource List – Video Overview failed

I deleted the Video Overview entry on the Generated Resource List, and tried again. This time I selected the option for a Brief Format.

NotebookLM Customize Video Overview Window – Brief Format

The brief format video was generated within minutes, providing me with a video 1 minute and 50 seconds long.

NotebookLM Generated Resource List – Video Overview

When I clicked on the Video Overview in the Generated Resource List it opened a window within the Studio Panel. The video gave the context of the Homestead Act then dove into presenting data about the two homesteads’ and their patents.

An excerpt from the video:

An Experiment in the Chat Window

I have engineering experience in testing, which matches my style of pressing the buttons and trying the features. That made me want to see if I could get some general information in a Chat within the Notebook.

I asked in the Chat window of the Notebook: If I upload a Word document with newspaper clippings can you transcribe all of them?

This was answered literally, using only the data within the Notebook. (At that point, there was no Word document in the sources containing newspaper clippings.) So if you have a general question that is not based on the information loaded into the Notebook, or have a question about how NotebookLM works it would be better to ask it in Google so that Gemini can answer it.

Gemini told me that “…if the clippings are embedded as images (e.g., photos or scans of newspaper pages), NotebookLM may not automatically transcribe that visual information into searchable, readable text” reminding me that “NotebookLM is designed to work with machine-readable text. If your Word document contains photos of newspaper clippings, the AI may be unable to “read” or transcribe the text inside those images.”

Getting back to my Notebook

When you need to revisit your Notebook, or login on a different computer, you can choose it from your list of Recent notebooks.

NotebookLM Recent Notebooks

Current Limitations

According to Gemini, currently free accounts have limits of generating approximately 3 Audio/Video Overviews per day, and can only send 50 chat queries per day. The Free accounts are limited to 50 sources per notebook, and are limited to 100 notebooks. (Workaround for large projects: Try combining multiple, smaller documents into a single PDF or Google Doc before uploading.)

Google has a tutorial that provides good information in an overview, and it can be found at: https://sites.google.com/view/notebook-lm/tutorial

Give this a try and explore the Tiles and Chat. Let me know how you do.

Have You Tried Google’s NotebookLM Yet?

Blog post banner - have you tried Google's NotebookLM Yet?

Trying out NotebookLM has been on my to-do list for months. I just did, and I was blown away by it. The accessibility of technologies that I knew existed but had so well not seen integrated was impressive. You can chat with the AI about what has been added to the Notebook, and you can generate products based on what the uploaded documents. The AI-generated media and responses in the Notebook are all based on the documents that you upload to it, which should reduce the opportunity for AI hallucinations. Keep in mind that the best idea is to enter documents with text; there is no guarantee that images will be transcribed properly.

I had already identified a couple of ancestors as test cases. One is all-time family favorite who was born and raised in Newport, Rhode Island, served in the Army during Spanish-American War, then settled on a homestead out in Oregon. He was a poet and a raconteur who loved to travel and was always involved in social movements.

Another ancestor is one of my brick walls. He is the only German immigrant in my tree (so far), and while I have clues about his origins in Germany, I cannot pin down his arrival to the United States or from whence he came. What I have learned about him is in the U.S., and begins when he was married to an Irish woman, after he had anglicized his name. From the time of his marriage, he never lived near other German immigrants. Very knowledgeable and generous researchers in Brooklyn, New York, and in Germany have helped me follow up on the very limited clues I have developed. The ability to pull together the material and look at it from different perspectives has the potential to help with this brick wall.

If you have not had a chance to try out NotebookLM, here is the link:

https://sites.google.com/view/notebook-lm/login

NotebookLM Welcome Page

If you are interested, I have put together a step-by-step tutorial that will get you started here: Google NotebookLM Tutorial.

Surname Study and AI Part 2: Collecting Census Data

blog banner - Surname Study and AI
Part 2

In the Surname Study and AI Part 1 post, I described the reasons that motivated me to undertake a surname study in Rhode Island, US, and the approach I took. The use of AI tools to help with formatting, visualizing and analyzing data is a goal in this latest iteration of the project.

Both US Population and Rhode Island State Census data were used as a backbone for the study.

My next step was to use AI to capture the transcriptions of key record information from the censuses, and work to normalize it. For this first step, I decided to limit my search to census databases, for exact and similar spelling of the surname, using the exact location of Rhode Island, USA. Even though I collected the images of the census, I collected the data presented on the Record Page to populate the columns of the spreadsheet.

My search settings were:

Last name: Gilroy; Slider: Exact and similar

Lived in: Rhode Island, USA ; Slider: Exact

Focus: United States [this setting was not necessary because I searched for records specific to the United States and Rhode Island]

On the search results page, I used filters to narrow down to one census at a time so that I could collect the data.

Thanks to a great idea I learned from Jon Smith of the North Carolina Genealogical Society, I decided to use Ancestry.com in a Chrome browser with Gemini AI enabled to capture the Record pages.

Gemini in top of Chrome Browser

If you do not see Gemini on the top of Chrome:

First, be sure that you are logged into your Google account. You can do this by logging into your Gmail account in the browser.

Then, try this to enable Gemini in Chrome:

Click the three dots (More), and select Settings from the menu

In Settings, click AI innovations in the left menu, then select Gemini in Chrome.

Chrome Settings to use Gemini
Chrome Preference to open Gemini

To collect the data in the US Census, I signed into HeritageQuest in the Chrome browser. Always check your county library, as HeritageQuest may be free to access from home.

I searched for all the occurrences of the surname in Rhode Island, one census at a time for the 1850, 1870, 1880 and 1900 US Censuses. My plan was to collect one line of data for each name that appeared in the search results.

These are example results for the search for exact and similar surnames to Gilroy.

Example HeritageQuest Search Results Page

Example Search Results Page (courtesy HeritageQuest.com)

From the 1860 US Census Search Results Page, I right clicked on the View button to open each Record in a new tab.

Example HeritageQuest Record Page

Example Record Page (courtesy HeritageQuest.com)

Gemini in top of Chrome Browser

Some of the issues and limitations that I found may be due to the fact that I use a free version of Gemini. I had to work on my prompt to have the data captured in a Comma Separated Values (csv) format, so that I could use the data from the transcription of the record in my Excel spreadsheet. I tried to have Gemini decide what to label the columns, but it worked out better when I told it the names of the columns in the prompt.

NOTE: Later on, Gemini and I decided to format the collected data in Markdown tables. This simplified the process, because the data could be pasted directly into the Excel worksheet.

In the interest of time, I used copied all the data from one Record page and asked ChatGPT to extract the data tags, using the prompt:

keep only the data tags such as Name, Age, etc and show them in a comma separated sentence on one line.

That provided me with column names which could then be used in the Gemini prompt. (This was done once for each census.) That way the line for each enumerated person in a worksheet would have the same data in the same columns.

In my type of account (free), Gemini would only look at ten open tabs in the Chrome browser as input to a prompt, so I knew that I would have to collect the data in steps. Gemini wanted to jump right in and give me analysis based on the data in those tabs, and it took some coaxing through prompt refinement to get the data in a form to put into a spreadsheet.

I added tabs using the plus sign until I had selected the Current tab and 9 others to share with Gemini. (When you select more than 10 tabs a warning appears: “Only 10 tabs can be shared.”

Select Multiple Tabs as Input to the Gemini Prompt in Chrome Browser

Select Multiple Tabs as Input to the Gemini Prompt in Chrome Browser

Prompts may need refinement, and in this case Gemini and I chatted back and forth to get the results that I wanted. Gemini warned me that it could not directly create or download an Excel (.xlsx) file for me, but that it could format the data into a standard CSV (Comma Separated Values) format.

For the 1860 US Census, this is a prompt that I used in Gemini in the Chrome browser. This was the result of refinement, and needed to be changed slightly for each census.

For all open census records, extract the data and generate the full CSV text. For each record, transcribe it into a new row of the CSV . Put the CSV text in a canvas so that I can copy it from the prompt. Structure the output so that each record (the main person detailed on the page) is a single row, and list all their household members’ names in a single column titled ‘Other Household Members (Names)’. **Only transcribe data explicitly visible in the current tab’s detail and household sections.**

Here are only the data tags, formatted as a single comma-separated line:

Name, Age, Birth Year, Gender, Race, Birth Place, Home in 1860, Post Office, Dwelling Number, Family Number, Occupation, Real Estate Value, Inferred Spouse, Household Members (Name)

**For any column field where data is not transcribed, insert a blank space to ensure all records have identical column structures.**

The response included this CSV text.

CSV from the Gemini Canvas

I used the copy icon at the top right to capture the CSV text, and pasted it into an open Notepad file. The Notepad file was saved as type “All files” and I created a file name ending with the extension “.csv” (CSV = comma separated values)

Save Notepad file as CSV

Then I opened the CSV file in Excel, and copied and pasted the lines into the Excel worksheet.

It seemed that when Gemini was used in the browser, it did not have a large memory, so I would have to reload the prompt during my next session. (Always save your prompts!) Sometimes Gemini wanted to use older data for the task I was giving, so I needed to modify the prompt to remind it to only work on the set of selected tabs.

Since this version of Gemini-enabled browser only allowed me to work on 10 tabs at a time, I stepped carefully through the results to be sure that each person with a name that was Gilroy or similar was included.

In an Excel spreadsheet, I pasted the data from the 1860 census in a worksheet, and labeled its tab “with the year and the type of census”1860 US Census.”

I repeated these steps for each US Population Census.

The Rhode Island state censuses are available on Ancestry.com, and I repeated the same process for each one.

Engineers do enjoy visualizing data, so using Excel, I created a graph of the number of individuals with the exact surname Gilroy or a similar surname for each type of census. Then I combined the number of individuals from both types of censuses, for all available years. Note: the US Census for 1890 and the RI State Census are unavailable.

graph US Census Results for Gilroys in Rhode Island by Year
graph RI Census Results for Gilroys in Rhode Island by Year
graph Census Results for RI Gilroys by Year Combined

The story that I know from my hands-on analysis involves people with the Gilroy name arriving and departing Rhode Island through immigration or moving from or to another state in the US. The number of individuals with the same surname varied by marriage, birth and death. Women would either gain the surname through marriage, or lose it when enumerated using their husband’s surname.

Even though I did collect the citations from Ancestry.com, they are not sufficient for publication and I would have to do some more work to create any citations. There are limits to the approach I used. The enumerators may not have visited all the people who shared that surname, and that different transcription efforts may result in different spelling of the surname.

At the end of this step: I had an Excel spreadsheet, with a worksheet for each census. Each worksheet contained a line for each person who was enumerated in the census as having the exact surname Gilroy or a similar surname that was present in the online databases. Each column in a census worksheet has the same type of data, or was blank, for ease of analysis.

Excel spreadsheet, with a worksheet for each census.

Next, I can use an AI tool to analyze the data in each census, and across censuses. My goal is to identify family groups as well as individuals and track their changes through the years of interest.

Surname Study and AI Part 1: The Approach

log banner - Blog Post Surname and AI 1

This blog post begins a series of posts exploring an ongoing surname study and my recent use of artificial intelligence (AI) in it. In this post, I will describe the history of getting to this point in my efforts.

Over the course of several years, I have been working on a surname study. My goal was to find out if and how families who lived in Rhode Island from 1850-1900 were connected. Chain migration to the United States from Ireland was entirely likely, and by connecting these family units I could potentially research collateral relatives to learn more about the family unit(s) back in Ireland.

Using what I had learned from researching my direct ancestors, these were the parameters:

  • Surname: Gilroy
  • Place: Rhode Island, US
  • Timeframe: 1850-1900

For this project, I collected both federal and census data to use as the backbone of the research. Then I built upon the intermediate years using vital records. I faced some challenges when collecting the data. At that time, Rhode Island Censuses and vital records were obtained by mailing requests to an incredibly helpful and knowledgeable staff at the Rhode Island State Archives. Copies of the records were available for modest fees, but you did require data about the record you sought. (Contrast that with the ability to search for everyone with the same or similar name in a record set through a digital database.) At the time that meant that some of the names came from index-only databases as place holders until copies of the original records could be found. An index of vital records for the state was available on Ancestry, as were a composite of indexed city directories which formed an 1890 US Census substitute.

Another challenge was correlating dissimilar data. Just as every federal census asks different questions, so does every state census. Vital records change what data is recorded over time, too. The data found in city directories is also different from the other records, containing addresses and occupations but lacking explicit family connections.

My main product was an Excel spreadsheet with tabs for the data collected from each record type by year. I worked to reconcile the different data collected from similar record types. From that spreadsheet, I extracted family units, capturing them in PowerPoint to visually show how the family units changed over time. This gave me some insights but was labor intensive. I contemplated my next steps, knowing that analyses of ages, appearances of people with the same surnames in Rhode Island, and child naming patterns, as well as mapping the neighborhoods were among them.

Fast-forward to now, when more records are available online. For example, in addition to the vital record indexes, images of the RI vital record ledgers are now online. The Rhode Island state censuses are also online. And then there is AI to help with formatting, visualizing and analyzing data.  

Some challenges still exist. There were gaps in census coverage, due to the 1890 US Population Census and the 1895 Rhode Island Census no longer being available. The use of other record types will help to fill in the census gaps. A state-specific challenge is the fact that the 1885 Rhode Island Census is available as an alphabetized index of names, requiring family units to be connected using data in the “Family Number” column.

The state of AI is constantly changing, but I decided to investigate how AI could help this the collection and analysis of data. 

I did try an analysis of the whole spreadsheet in ChatGPT, and I had been able to create family groups and use them to discriminate between some people who had the same name. However, the data was not combined in an efficient manner, and rather than have one large spreadsheet, I decided it would be more understandable to break the data into more manageable pieces, based on the record types. The composite spreadsheet was broken down into different spreadsheets: (1) censuses, (2) births, marriages, and deaths and (3) city directories. I also decided to use AI to help with the data collection process, the analysis and different ways to visualize the data.

At the end of this step: I had a basic plan to redo the data collection, collect additional data that had become available online, and developed ideas on how AI could support this study. The next step will be to use only census date and have AI create the backbone of a timeline for the individuals and families.