Learning about AI Token Use through Essays and Prompt Responses

Joshua J. Wells
Indiana University South Bend

The behaviors of textual AI tools are opaque to most college students, and misunderstandings about how such tools function are rampant. The term “artificial intelligence” can mislead students into believing that such tools contain actual knowledge, when instead those tools create statistical models of textual associations. This exercise introduces students to the concept of textual tokens in AI training data, which form the base data from which statistical associations are modeled. This helps students to visualize how AI processes function to build textual answers to prompts. 


Learning Goals

  • Enhance students’ practical understanding of tokenization in GPT text tools 
  • Identify processes of text generation happening in an otherwise opaque or purposefully obfuscated way
  • Reflect on the purposes and potentials of AI in professional settings involving problem solving and textual analysis

Original Assignment Context: This assignment has been conducted three times (Spring 2023, Fall 2023, Spring 2024) in a sophomore/junior course in social informatics, an interdisciplinary course which examines the reciprocal roles of information and communication technology with sociocultural dynamics, with an emphasis on critical examination of use contexts. The particular essay used by students in my class was a 1000-word examination of the Mastodon social network, completed about one month prior, involving an extensive research literature review.

Materials Needed

  • Classroom with numerous computers or web devices on the model(s) of flipped classrooms or technology-enabled active learning classrooms
  • Completed essay which each student can access during class, either their original text file or a printed copy in hand
  • Shared workspace for text editing such as Google Docs or a Microsoft Teams Document
  • GPT chatbot access, this exercise uses Perplexity.AI which is free, with low use barriers, and a policy to not share personal data
  • Reference set of articles which can be accessed online through URLs or from a stored file location depending on local library access subscriptions

Time Frame: This assignment has always been employed within a single 75 minute class, it takes about an hour to introduce and complete, leaving 15 minutes for reflective conversation among the entire class. Note that original essay assignment completion is not included in this time frame.

Overview

Students learn about tokens in generative pre-trained transformer (GPT) text tools through a human-centered tokenization process in comparison with use of a GPT. In GPT text tools, text is broken down into tokens, meaningful chunks that range from single characters to character strings of words or multiple words. GPTs use training data to develop large reference sets of tokens and their statistical correspondence to one another. Tokens are recognized in user prompts given to GPT chatbots (e.g., ChatGPT, Microsoft Copilot, Perplexity), which are then used to search available information (databases, training data, web searches, etc.) for token sets with statistical likelihood to solve the prompt. Tokens are returned to the user in a way that solves the prompt within the scope of the GPT tool’s training. Here, students work in groups to “tokenize” the source data of their combined essays in a way that solves the provided prompt. Next they supply the prompt to a GPT chatbot and critically examine their human tokenization alongside GPT output. Finally, they reflect on their experience in context with a reference on GPT/AI issues. Students consider how “human tokenization” identified meaningful sentences in their essays; also, how their groups decided to coherently order those sentences to solve the prompt. Students should be encouraged to view the totality of their essays as a very limited set of tokens, which follow recognizable language patterns, and which only make sense to solve the prompt in limited ways. All work is conducted collaboratively and synchronously. 


Assignment

PART 1: TOKENIZE A SET OF PAST ESSAYS WITH YOUR HUMAN CLASSMATES

Prompt: [the instructor should craft a prompt which could instruct an AI chatbot to create output which informs a reader on a complex topic about which the class has previously written individual essays].

Instructions to Students: With your classmates, copy and paste entire sentences from your essays into the numbered list below in an order which you feel answers the above prompt. As you do this …

A. Each person should identify sentences in their finished essay as kinds of “tokens” which help to answer the prompt.

B. Work with your group to paste your individual sentences into the numbered list in a way that makes coherent sense with the order of the prompt (e.g. ask your team, “Should I paste my sentence before or after sentence #4 in the list?”)

C. Don’t worry about editing anything. Use your tokens as they exist in your group’s training data of finished essays.

D. Use at least one sentence where you have quoted/paraphrased and cited reference materials in your essay

E. You have ten slots to start, but expand the numbered list to be as long as necessary. Be expansive.

  1.  
  2.  
  3.  
  4.  
  5.  
  6.  
  7.  
  8.  
  9.  
  10.  

PART 2: ASK THE PROMPT OF PERPLEXITY GENERATIVE TEXT AI

Instructions: Go to https://www.perplexity.ai/search and use Perplexity without a login. Copy/paste the above prompt into Perplexity. Copy/paste the answers you each get below. Compare Perplexity’s output with your group human tokenized output. Use comment features to highlight important points in the Perplexity response to add notes and/or quotes of your text for comparison. Compare what appear to be processes of token recognition in Perplexity with your group’s work. What are strengths and weaknesses of each approach? 

PART 3: MORE ABOUT GPT

Instructions: What is something your group can learn from an item in this reference list about what you’ve done today? Write your response(s) as annotation(s) under your chosen reference(s) in red text and explain how you can connect the themes of the reference to our exercise today.

How it Works

Technical Criticisms

Deeper Criticisms