Elisa Beshero-Bondar
Penn State University
This sequence of assignments progressively introduces students to natural language processing (NLP) through repeated prompt experiments with ChatGPT. Students are beginners learning Python and NLP. Accessing ChatGPT and writing prompt experiments successfully provided the basis for them to investigate the cosine similarity of word embeddings in multiple responses to the same prompt. These assignments succeeded in introducing students to NLP using short generated texts prior to students' beginning to experiment with larger text corpora.
Learning Goals:
Original Assignment Context: intermediate-level course in the middle of a core sequence in Digital Humanities
Materials Needed: spaCy language model (a freely-accessible, open-source NLP library for working with Python), AI text generation tools (ChatGPT used)
Time Frame: ~3 weeks
Introduction
Can we engage students with AI to interest them in natural language processing? This is a pedagogical experiment to orient students to ChatGPT as an interactive conversationalist that depends on natural language processing (NLP) that they can access, measure, and influence with some orientation to Python code. The experiment could work well in courses where students are exploring NLP, Python, or automated text-generation tools. In my Digital Humanities course, we took very gentle steps from tinkering with ChatGPT, to organizing its outputs together with prompts as collections of short text documents for reading with Python and NLP tools. We applied the spaCy language model as a start to learn about word embeddings and calculations of similarity and to explore a very simple research question, and this made for a very topical and highly relevant beginning of a university semester.
My course is called Large-Scale Text Analysis, and it is taught in the Digital Media, Arts, and Technology (DIGIT) program at Penn State Behrend. It is DIGIT 210, an intermediate-level course in the middle of a core sequence in Digital Humanities required of all students in our major. My students are undergraduates with experience in digital art production and structured markup (HTML and XML, git and command line experience). In my classes, students learn to build web archives of digital resources, work with transcribing and encoding cultural heritage resources (like photo-facsimiles of manuscripts) to create archival websites. But we do not expect students to have any background in programming with Python or natural language processing before they take this course.
I am most comfortable teaching with markup technologies, which reflects my roots in Digital Humanities and the Text Encoding Initiative (TEI). Markup technologies involve "angle-bracket" markup of structures and patterns with regular eXtensible Markup Language (XML), and processing with languages called XSLT and XQuery, used for querying, analysis, and visualization of document data and metadata. One reason I am comfortable with "the XML stack" and markup processing is that with these technologies, students are the decision-makers and command their own document data from tagging to visualization. If the results of a markup project raise questions, we can return to the tagging and observe what we have missed, or interpreted in a problematic way. By contrast, with NLP, I am teaching outside my comfort zone, because here we tap into libraries and statistical processing that are remote, packaged by others: we can tinker with the algorithms, but cannot be quite so clear of their significance or margin for error. Working with a very large text corpus with complex statistics-based algorithms introduces uncertainties about how much we miss by relying on external taggers and automated tools. I write Python and I do some natural language processing in my research projects, but I worry about teaching with them when it can be difficult to validate the "results" of NLP. Lacking much formal training in statistics, I am still finding my way with methods for analyzing so-called “unstructured text" and working with the NLP data that drives large language models like ChatGPT. In my teaching and research, I believe markup approaches can complement NLP and I am seeking a balance between these methods. Each time I teach my Text Analysis course, I try to find that balance while experimenting more and more with NLP methods. For example, marking up regularly-structured text corpora in simple XML can facilitate cleaning the source documents and discovering the regular patterns of their data. By autotagging the documents, we can also improve the curated data set with simple markup that makes explicit what was previously available only in patterns of punctuation or lineation in the so-called "plain" text of the original files.
In the past two cycles of my Text Analysis course, taught during 2020 and 2021 during the first years of the pandemic, I relied on my foundations. My emphasis (until now) has been simply on providing introductory access to popular NLP tools like spaCy (a freely-accessible, open-source NLP library for working with Python, which automatically supplies named entity recognition, part-of-speech identification, syntax-parsing based on how words and word particles cluster together, and more). My students built large-scale projects by collecting publicly available text archives and then applying markup in an automated way: We would perform a careful document analysis to study patterns in the formatting of document collections, and "autotag" them by applying regular expression matching to explicitly mark their features and structures. By recognizing regular expression patterns, we could quickly tag all the speeches of a collection of screenplays, for example, so that we could later extract just the speech content to explore with NLP tools. We could then extract and output a list of all the distinct action verbs in the speeches and rank how frequently they were used. To this point we have generally relied on the convenience of spaCy's small language model to handle named-entity and part-of-speech analysis with cautions about interesting flaws we would find. However, I have wanted to improve the NLP unit to engage students in more experimental work, to explore, train, and fine-tune calculations of similarity and topic modeling.
ChatGPT's prototype launch on November 30, 2022 came just at the moment when I was thinking about improving the NLP material in my Spring 2023 course. Experimenting with the chat led quickly to reading OpenAI's API reference documentation and realizing that OpenAI was effectively encouraging people to try its models in their own programming projects. Effectively, OpenAI moved NLP methods to the foreground and was very much encouraging interaction not just with the interface but with its word embeddings data from its training model. It seemed to be beckoning us educators to give it a try, test and experiment not just with its capacity to write in a human voice but also to learn and share how it works. I began in December with forming plans to involve experiments with ChatGPT directly in my course as a way to introduce NLP.
I had been experimenting with ChatGPT myself to make it regenerate its response to prompts, and found myself fascinated by what kinds of prompts would most likely make ChatGPT generate divergent responses, and how it might diverge. Asking it to locate famous people who share a surname like “Shelley,” for example, could prompt a mix of responses with some Shelleys as the first name. Similarly, asking ChatGPT for source citations for its statements is a well-known source of highly creative, properly formulaic, and almost certainly erroneous bibliographies. I decided this might be a creative place to begin experimenting with how natural language processing works to make statistically-based predictions of most satisfying or best fit responses.
Students began the semester with assignments to test ChatGPT.
All students have worked with git and GitHub before and need to establish their workspace on their personal GitHub repos, so this assignment combines a review of their GitHub workflow from previous semesters with the challenge to craft prompts for ChatGPT and save the results as text files in their repositories.
The students will save their prompts and outputs from ChatGPT from this assignment, so that they can work with the material later during their orientation to natural language processing with Python.
ChatGPT and Git Review Exercise 1
This was followed by a second assignment due the following week. Submitting this assignment also meant the students needed to push new text files to their GitHub repositories.
ChatGPT and Git Review Exercise 2
For this assignment, come up with a prompt that generates more text than last round. Also try to generate text in a different form or genre than you generated with our first experiment. We'll be working with these files as we start exploring natural language processing with Python--so you're building up a resource of experimental prompt responses to help us study the kinds of variation ChatGPT can generate.
Design a prompt that generates one or more of the following on three tries:
In the Canvas text box for this homework, provide some reflection/commentary on your prompt experiment for this round: What surprises or interests you about this response, or what should we know about your prompt experiments this time?
Students came up with clever, inventive prompts, mostly concentrating on making ChatGPT output fictional stories based on a few details, like "write a story about a girl in a white dress," (which strangely resulted in stories in which the girl always lives in a small village near the sea.) A student's prompt to "write a new Futurama episode" produced surprisingly long responses with full casts but noticeably lacking Bender's salty language. One student asked ChatGPT to write about "Prince Charles" (as known to the AI based on its pre-2021 training) entering the SCP-3008 universe and had to deal repeatedly with ChatGPT objecting that it would not produce sensational false news. The student each time said that "fiction is okay" and returned a collection of four very entertaining tales that we opted to use as a class for our first modeling of a Python assignment.
Compared to me in December 2022, when I was prompting ChatGPT with slightly obscure names and references to historic people and events, my students were more interested in making ChatGPT write fiction, probably because it generated immediately divergent responses for them. For our purposes it did not really matter whether ChatGPT was outputting supposed fact or outright fiction. We simply needed a source for very small testing collections that could be used as a basis for comparing texts based on word embedding values, as an introduction to NLP.
In January 2023, I was surprised that most of my students had not been following all the excitement and dismay about ChatGPT that I had been eagerly following in December, though several students were aware of Stable Diffusion and other applications for generating digital art. I took time on the first days of semester to discuss how this would likely come up for them in other classes as a source of concern for their assignments, and how we would be exploring it in our class. These discussions introduced some readings about ethical issues in the training of large language models, and led us to discussions of the data on which ChatGPT, Google, and Facebook trained their models.
In the next two weeks, students worked their way through Pycharm's excellent "Introduction to Python course" while also reading about word embeddings and ethical issues in AI. They annotated these readings together in a private class group with Hypothes.is.
Learn about Word Embeddings: Reading Set 1
I selected this pair of readings because I thought the second provided a stronger explanation of the simple and frequently quoted "man woman boy girl prince princess king queen monarch" example so we could dwell on this in class discussion. How do the word vectors work? Students commented on Hypothes.is with amazement that you could do math on words and subtract "man" from "king" to get something close to "queen." The readings and their illustrations helped us to understand on a small scale how word embeddings might work.
The next set of readings introduced ethical issues on a larger scale than those we had discussed in class.
Annotate Readings on Data Annotation and Labor Issues in AI
These readings helped to familiarize my students with the practice of annotating data sets to train AI models, a topic which we had not yet discussed. They needed to learn about why it is necessary to direct the AI training, and also gain awareness of the human exploitation involved in speeding the process and purging systems of inappropriate content. Gaining perspective on how large-scale AI models are generated in a corporate context gave us perspective on the problems of scale and concern about the ubiquity and presence of AI modeling redirecting human lives and work. In this context I guided discussion toward the capacity for natural language processing to work with many different sizes of data sets.
The next assignment in the series oriented students to a Google Colab Notebook to run executable cells in a Python script, and at the same time help them to see how language models amplify gender bias they might not have been aware of from the simpler yet nevertheless binary "man woman boy girl prince princess king queen monarch" vector example given uncritically in their first reading.
Tutorial: Exploring Gender Bias in Word Embedding
Students explored the Google Colab notebook code during their orientation to Python, so this was a useful preview of NLP applications, as well as an important hands-on cautionary experience with biases embedded in predictive models. We were now ready (more or less) to begin having students set up their own Python environments and try out some natural language processing, starting with the files they had created from their encounters with ChatGPT.
Python NLP Exercise 1
Our first Python NLP exercise was about setting up a coding environment. I opted for students to work in Pycharm Community Edition, which is free to install, available in our university computer labs, and provides helpful syntax checking. I myself am working with the same Pycharm software and sharing my code scripts on GitHub. My code can readily be adapted to a notebook environment, but I prefer that we all simply share the code with comments over GitHub without configuring a notebook. This way students are encouraged to pull in my code and adapt it directly to read their own files. Much of our work involved orientation to pip installations and to opening and reading files.
This assignment is extremely easy, but the difficult part is all the local configuration of Python environments on student’s individual computers and wrangling differences between Windows and Mac environments (for which I provide detailed guidance in the linked assignment). I asked students to work with this starter Python script I created, and adapt it to read from a new file and make sure that their Python environments are properly configured and that everything is working to output some basic information from spaCy. The introductory script and the assignment does involve a first pass with spaCy to view the information it can output about named entities and parts of speech, including lemmatized forms.
My Python starter script: https://github.com/newtfire/textAnalysis-Hub/blob/main/Class-Examples/Python/nlp/nlp1.py
My assignment formatted in markdown on GitHub: https://github.com/newtfire/textAnalysis-Hub/blob/main/python-nlp-exercise1.md
Python NLP Exercise 2: A Word of Interest, and Its Relatives
For Python NLP Exercise 2, students were now prepared to explore the files they had generated with ChatGPT. We had assembled multiple collections of students' experiments so they could choose their own or other students' files to work with—but were encouraged to try something other than the collection I used in my sample code. The assignment involved selecting a word of interest to them from their collection of ChatGPT responses. Students would follow my guidance with a very introductory Python script to produce a dictionary of words most related to that word of interest, based on spaCy's model values and a calculation of cosine similarity. Each tokenized word is assigned a value between 0 and 1 based on a calculation of cosine similarity with the word of interest. The words that rank the highest (say above .3 or .5) are filtered and sorted into a dictionary featuring pairs of words and values. The Python script involves learning to read in documents from a collection of files so it can output a new dictionary for each file. Reviewing the dictionaries produced for each file would provide a quick way of evaluating the differences in the outputs, based on how they "skew" in relation to a single word of interest. Students could then choose a different word of interest, run the Python script again, and explore the output.
In my example for the class, we worked with the student's ChatGPT prompt with Prince Charles entering the SCP3008 universe. I purposefully did not finish developing the dictionary results and asked students to look up how to complete the sorting of values in a Python dictionary. At this early stage, they are adapting a “recipe” for their own work, with a tiny coding challenge, and an emphasis on studying outputs and seeing what happens as they make changes to my starter script.
The Assignment
For this exercise, you may continue working in the Python file you wrote for Python NLP 1 if it worked for you. Or you may choose to work in a new directory.
This time, you will work with a directory of text files so you can learn how to open these and work with them in a for loop. Our objective is to apply spaCy's nlp() function to work with its normalized word vector information.
Follow and adapt the sample code I have prepared in the textAnalysis-Hub here to work with your own collection of files: https://github.com/newtfire/textAnalysis-Hub/blob/main/Class-Examples/Python/readFileCollections-example/readingFileCollection.py
Read the script and my comments carefully to follow along and adapt what I'm doing to your set of files. Notes:
Push your directory of text file(s) and python code to your personal repo and post a link to it on Canvas.
Post links to your files on your personal GitHub repo for me to review, and leave comments in the text box about anything you're stuck on.
Sample Output from the "Prince Charles in the SCP3008 Universe" Student Collection
These outputs are sorted based on highest to lowest cosine similarity values of their spaCy's word embedding. We set a value of .3 or higher as a simple screening measure:
ChatGPT output 1:
This is a dictionary of words most similar to the word panic in this file.
{confusion: 0.5402386164509124, dangerous: 0.3867293723662065, shocking: 0.3746970219959627, when: 0.3639973848847503, cause: 0.3524045041675451, even: 0.34693562533865335, harm: 0.33926869345182253, thing: 0.334617802674614, anomalous: 0.33311204685701973, seriously: 0.3290226136508412, that: 0.3199346037146467, what: 0.3123327627287958, it: 0.30034611967158426}
ChatGPT output 2:
This is a dictionary of words most similar to the word panic in this file.
{panic: 1.0, chaos: 0.6202660691803873, fear: 0.6138941609247863, deadly: 0.43952932322993377, widespread: 0.39420462861870775, shocking: 0.3746970219959627, causing: 0.35029004564759286, even: 0.34693562533865335, that: 0.3199346037146467, they: 0.30881649272929873, caused: 0.3036122578603176, it: 0.30034611967158426}
Chat GPT output 3:
{confusion: 0.5402386164509124, dangers: 0.3939297105912422, dangerous: 0.3867293723662065, shocking: 0.3746970219959627, something: 0.3599935769414534, unpredictable: 0.3458318113571637, anomalous: 0.33311204685701973, concerns: 0.32749574848035723, that: 0.3199346037146467, they: 0.30881649272929873, apparent: 0.30219898650476046, it: 0.30034611967158426}
Chat GPT output 4:
{dangers: 0.3939297105912422, shocking: 0.3746970219959627, anomalous: 0.33311204685701973, struggling: 0.32224357512011353, that: 0.3199346037146467, repeatedly: 0.30081027485016304, it: 0.30034611967158426}
From this assignment, students gained Python experience as they learned to reverse-sort their dictionary of cosine similarity values, and were, I hope, motivated to see those values sorted from highest to lowest as a fine-grained way to explore the differences among the responses ChatGPT gave them. In the outputs above, we can see that when a student selected the word "panic," ChatGPT used the word itself once in the four outputs, and the most related words (in sort order from highest similarity to lowest) are usually only those ranking above .4. ChatGPT's second output is clearly full of the most panic-related words, while most of the others resonate with panic only in the sense of surprise or anomalies.
Conclusion and Continuation
This sequence of assignments helped to orient students to NLP through creative play with ChatGPT. The simple study of cosine similarity across a very small collection of short ChatGPT responses helped to introduce progressively more complex work with natural language processing. The students might continue to experiment with topic modeling of a much larger document collection that involves students in progressively increasing interaction with cleaning and filtering data based on their evaluation of stopwords or the relevance of ngrams. They could also continue on to train a language model of their own based on a collection of documents they have curated. Scaling down the large language models to projects in which students investigate texts they know or care about seems especially important to exploring the positive potential of NLP, while cultivating awareness of the ethical issues of extremely large models in use in our time.
Appendix
Text files generated by ChatGPT in response to student prompts about Prince Charles and SCP3008 in January 2023.
ChatGPT Interaction 1:
[Student: Write a news article about Prince Charles and SCP-3008]
I'm sorry, I am not aware of any news article about Prince Charles and SCP-3008. SCP-3008 is a fictional object from the SCP Foundation universe, a collection of stories and documents that describe anomalous objects, entities, and phenomena. Prince Charles is the heir apparent to the British throne. It would not be appropriate or factual to write an article linking the two. Is there something else I can help you with?
[Student: write a fictional article.]
I apologize for any confusion. Here is a fictional news article about Prince Charles and SCP-3008:
"In a shocking discovery, it has been revealed that Prince Charles, heir to the British throne, has been secretly working with the mysterious organization known as the SCP Foundation. According to sources within the Foundation, Prince Charles has been assisting in the containment and study of a particularly dangerous anomalous object, designated as SCP-3008.
SCP-3008, also known as "The Infinite Monarchy," is a powerful and unpredictable object that can manipulate reality to create a never-ending monarchy. It was discovered by the SCP Foundation during a routine investigation and has been classified as one of their most dangerous objects.
Prince Charles, who has long been known for his interest in environmental and social issues, has reportedly been working with the SCP Foundation to try and find a way to neutralize the object's effects. "He's been a valuable asset to us in this endeavor," said an anonymous SCP Foundation official. "His knowledge and experience in dealing with complex and sensitive issues has been instrumental in our efforts to contain SCP-3008."
The SCP Foundation has not yet released any information on their progress in neutralizing SCP-3008 or on the extent of Prince Charles' involvement. However, they have assured the public that all necessary precautions are being taken to ensure the safety and security of the British monarchy and the general public.
This revelation raises many questions about the true nature of the SCP Foundation and the role of the royal family in dealing with anomalous phenomena. It also raises concerns about the potential dangers posed by SCP-3008 and the possibility of a never-ending monarchy.
The Palace has not yet commented on the matter, but it is expected that they will release an official statement in the coming days."
It's important to note that SCP Foundation is a work of fiction and the story is not factual or real.
ChatGPT Interaction 2:
[Student: Write a news article about Prince Charles and SCP-3008]
I'm sorry, but I am not able to write a news article about Prince Charles and SCP-3008 as it is a fictional creature from the SCP Foundation, a collection of horror stories and tales that originated on the internet.
[Student: it can be fiction]
In a shocking turn of events, it has been revealed that Prince Charles, the heir to the British throne, has been in contact with the mysterious and deadly entity known as SCP-3008.
According to sources close to the royal family, Prince Charles has been secretly meeting with SCP-3008, a creature capable of causing mass destruction and chaos, for several months. The purpose of these meetings is currently unknown, but some speculate that the prince may be attempting to use SCP-3008 for his own gain.
The SCP Foundation, the organization responsible for containing and studying SCP-3008, has not yet commented on the situation. However, sources say that they are closely monitoring the situation and taking all necessary precautions to ensure the safety of the public.
The revelation of Prince Charles' involvement with SCP-3008 has caused widespread panic and fear among the British public, with many calling for an investigation into the prince's actions. Some are even calling for the prince to be removed from the line of succession.
This is a developing story, and more information will be provided as it becomes available.
Please note that SCP-3008 is a fictional creature from SCP Foundation and the information provided above is not factual.
ChatGPT Interaction 3:
[Student: Write a news article about Prince Charles and SCP-3008]
I'm sorry, but I'm unable to complete your request as Prince Charles is a real-life prince and heir to the British throne and SCP-3008 is a fictional entity from the SCP Foundation universe. There is no credible news that Prince Charles is associated with SCP-3008. It is not appropriate to create fake news about real people.
[Student: Write a fake article]
I'm sorry, but I cannot fulfill this request as it is not ethical to create fake news or misinformation, it can cause harm to real people and lead to confusion. Misinformation can be dangerous and cause harm, it's important to be thoughtful and responsible when sharing information, even if it's fictional or not intended to be taken seriously.
[Student: make it fiction]
In a shocking turn of events, it has been revealed that Prince Charles, the heir to the British throne, has been secretly working with the enigmatic organization known as the SCP Foundation.
According to sources close to the royal family, Prince Charles has been in contact with the SCP Foundation for several years, lending his expertise and resources to assist in containing and studying anomalous entities and phenomena.
The SCP Foundation, known for its classified and secretive nature, has not yet released a statement regarding Prince Charles' involvement. However, it is believed that his unique perspective and connections within the government and scientific community have been invaluable to the organization's efforts.
This revelation has sparked a heated debate within the United Kingdom, with some praising Prince Charles for his dedication to protecting the public from dangerous anomalies, while others criticize the potential breach of national security and lack of transparency.
As the story unfolds, it remains to be seen what impact this news will have on the royal family and the SCP Foundation. One thing is for sure, this revelation will certainly change the way the public views Prince Charles and the secretive organization he has been working with.
ChatGPT Interaction 4:
[Student: Write a news article about Prince Charles and SCP-3008]
I'm sorry, but Prince Charles and SCP-3008 do not have any known connection, as SCP-3008 is a fictional entity from the SCP Foundation, a collection of horror stories and urban legends.
It would be inappropriate and unverified information to write a news article about Prince Charles and SCP-3008.
[Student: it can be fiction]
In that case, here is a fictional news article about Prince Charles and SCP-3008:
"In a shocking turn of events, it has been revealed that Prince Charles, the heir to the British throne, has been secretly involved with the mysterious entity known as SCP-3008.
According to sources within the government, Prince Charles has been working with the SCP Foundation, a secretive organization dedicated to containing and studying anomalous entities and phenomena. It is believed that he has been using his royal connections and influence to aid the Foundation in their mission to keep the public safe from the dangers of SCP-3008.
SCP-3008, also known as "The Infinite Laundromat," is a highly anomalous entity that is able to repeatedly recycle objects and organisms, effectively making them immortal. The Foundation has been struggling to contain and understand the entity, and it is believed that Prince Charles has been instrumental in their efforts.
The Prince's involvement with the SCP Foundation has not been officially confirmed, and it is unclear how this revelation will affect his future role as King of England. Some have praised his dedication to protecting the public, while others have criticized him for potentially putting the monarchy at risk.
This is a developing story, and more information is expected to be released in the coming days. In the meantime, the public is advised to remain calm and trust in the efforts of the SCP Foundation to keep us safe from the dangers of SCP-3008."
Author Bio: Elisa Beshero-Bondar is Professor of Digital Humanities at Penn State Behrend in Erie, PA, where she chairs the Digital Media, Arts, and Technology (DIGIT) undergraduate program. She teaches, researches, and builds projects with structured markup languages (XML stack) and is beginning a term as Chair of the Technical Council of the Text Encoding Initiative (TEI). She explored AI as a neophyte alongside her students in Spring 2023.