Skip to Content

University Libraries

On the Books: Applying machine learning to Jim Crow laws

A scholarly project on century-old legislative history got a boost from cutting-edge librarianship when University of South Carolina Libraries partnered with the USC Center for Civil Rights History and Research to create On the Books in South Carolina, a website that aggregates Jim Crow laws in the Palmetto State between 1868 – 1968 and makes them readily searchable, allowing researchers to explore them by year, location, topic, and more.

Funded by a grant from the Andrew W. Mellon Foundation, the On the Books project originated at the University of North Carolina at Chapel Hill, with subawards to USC and the University of Virginia. The goal was to explore all laws passed in Virginia and the Carolinas in the period between Reconstruction and the Civil Rights Act to determine the scope and extent of Jim Crow language that restricted the rights and freedoms of African Americans. That wasn’t as easy as it seems, both because of the sheer volume of laws passed in that hundred-year timeframe and because many of the laws were written using language that disguised their intent and impact.

The large scale of the project made it an ideal candidate for machine learning, a kind of artificial intelligence that identifies patterns in large datasets. But the obfuscatory language in which the laws were written posed challenges for the AI model’s capacity to reliably differentiate between Jim Crow and non-Jim Crow laws.

In that respect, the project made manifest both the potential and the limitations of AI – and showcased the new and emerging ways in which librarians like Kate Boyd, University Libraries’ Director of Digital Research Services & Collections, can help faculty and students manage, and make the most of, their data.

Despite Boyd’s significant experience with data management, the On the Books project involved a larger dataset than she’d worked with before. That, she says, made it a great learning experience. “It helped me understand the importance of organizing and cleaning your data,” she notes, “and how long that can take before you even start running your code. So I feel much better equipped now to provide support for others around campus who are doing similar projects.”

Clean, organized data increases the efficacy of the LLM, but in this case the AI tool still required significant human assistance to distinguish between Jim Crow and non-Jim Crow laws with a reasonable degree of accuracy. After an initial 15,000 sentences were fed through the LLM and it made its determinations about which ones should be flagged as Jim Crow, two fellows from the Joseph F. Rice School of Law’s Constitutional Law Center reviewed the sentences to assess how well the tool had done. Based on their recommendations, the team then used those same sentences, now labeled Jim Crow or non-Jim Crow by the legal scholars, to train the LLM to become smarter in its own assessments of what constituted a Jim Crow sentence. Only after that training was complete was the entire corpus of 300,000 sentences run through the AI tool.

At that point, the results were reviewed by another set of experts – a team of historians from the Civil Rights Center, who fine-tuned the data once again, identifying some 500 false positives that the machine had labeled as AI but the historians determined were not, and 875 sentences that the machine couldn’t decide how to label and that the historians identified as Jim Crow.  

That error rate of a few hundred sentences out of several hundred thousand means the machine made correct assessments 99.5% of the time. Still, says Boyd, “we could have increased its accuracy if we’d run more labeled data through it beforehand.”

Notable findings from the project included that the period of “maximum ‘Jim Crow’ language” in SC legislation was between 1892-1897 and that the most frequent specific topics of Jim Crow laws were school districts and elections. Those findings and more are available on the On the Books website, which was also created by University Libraries and which is searchable by year, county, volume, or topic. A data visualization dashboard facilitates comparative interpretation of the data as well.

Just as optimization of the LLM required contributions from team members with various kinds of expertise, the project as a whole benefited from the support not just of librarians but also of computer scientists and web developers, including four undergraduates.

While coordinating between faculty, staff and students across multiple divisions created logistical challenges, the results, says Boyd, are more than worthwhile. “For the Civil Rights Center, the website is a very useful tool for teaching and research,” she notes. “And for the Libraries, this project was a terrific learning experience. It was our first attempt at using machine learning on a large body of text, and now we’re better equipped to help others around campus with similar projects.”

In fact, Boyd says, the ability to wrangle and extract meaning from large data sets has the potential to expand possibilities for researchers all across campus: “This project leads to the question, what else can we do with this data? Once you have well-curated data, you can use it in so many ways. And the AI tools keep improving too, even since we started the project in 2022. So that’s really exciting.”

 

About On the Books

On the Books in South Carolina: Mining for Jim Crow Laws is a “collections as data” and machine learning text analysis project by the University of South Carolina Libraries (USC). Following UNC's lead, the USC team of legal scholars, historians, and computer programmers created a text corpora of South Carolina legislative session Acts and utilized machine learning techniques to discover Jim Crow language and effect in laws passed in the period between Reconstruction and the Civil Rights Movement (1868-1968).

The corpus is made up of the digitized volumes of the Acts and Joint Resolutions of the General Assembly of the State of South Carolina from 1866 – 1968. 

 

The On the Books Team:

Co-Investigators

Kate Boyd, Lead-PI, is Director of Digital Research Services and Collections for the University of South Carolina Libraries. 

Dr. Bobby Donaldson, co-PI, is Executive Director of the Center for Civil Rights History and Research and Associate Professor of History at the University of South Carolina

Lance Dupre, co-PI, is Digital Repository Librarian for the University of South Carolina Libraries

 

Staff

Legal Scholars

Axton Crolley, a legal scholar for the grant, was a Constitutional Law Fellow at the University of South Carolina Constitutional Law Center (CLC).

Taylor Pipkin Callahan, legal scholar for the grant, Assistant Director of the Constitutional Law Center

 

Historians

Christopher Frear, PhD., Center for Civil Rights History and Research

Rebekah Turnmire, PhD. Candidate, Center for Civil Rights History and Research

Jill Found, PhD., Center for Civil Rights History and Research

 

Computer Scientists

Vandana Srivastava, technical lead on the grant, PhD. Candidate in Computer Science at the University of South Carolina

Nitin Gupta, technical team, undergraduate majoring in Physics and Computer Science

Hannah Gardner, technical team, USC Computer Science 2022 graduate

Briannah Carrol, technical team, USC Math and Computer Science 2024 graduate

 

Web Developers

Kristin Harrell, PhD. English, CTE Instructor

Jalen Freeman, undergraduate, majoring in Computer Engineering, web developer

Kristin designed the web site and created the Tableau graphs and dashboard. Jalen added all of the Acts and metadata to the web site.

 

Further Support

Derek Black, professor of law and the Ernest F. Hollings Chair in Constitutional Law at the University of South Carolina School of Law. He directs the law school’s Constitutional Law Center (CLC). 

Rebekah Maxwell, Associate Director for Library Operations at the University of South Carolina School of Law library. 


Challenge the conventional. Create the exceptional. No Limits.

©