Create Your Own ChatGPT

João Gonçalves and Sarah Young
Erasmus University Rotterdam, The Netherlands

This assignment asks undergraduate students to train their own language model based on a set of text documents that they select. It provides them with code to do so on the free Google Colab platform, as well as detailed instructions. It allows them to understand the mechanisms behind language models and the requirements to train them, which in turn translates to higher levels of AI literacy. This approach is especially useful to foster a more critical use of generative AI by students, highlighting the links between training data and model outputs.


Learning Goals

  • Create a language model based on custom data
  • Understand how data influences the outputs of an LLM
  • Reflect on the requirements and potentials for training an LLM

Original Assignment Context: The assignment was given as part of an introduction to the technical components of AI algorithms initially in the course called, “Unboxing the Algorithms.”

Materials Needed: To execute this assignment, participants will need: a laptop/computer, internet connection, and Google Account (to make use of Google Colab)

Time Frame: Explaining and completing the assignment takes approximately 1h 30 minutes. The step of actually training the model takes approximately 45 minutes with suitable internet connection.

Overview: This assignment was taught 5 times and was embedded in a week devoted to text generation, while other weeks contained similar assignments for other types of Machine Learning algorithms. It was originally targeted at media and communication students, who did not have any background in programming or technical knowledge on AI. Students had to use their writing skills to reflect on the process of using the language model. This writing component draws on principles of critical reflection encouraging students to think about what they have learned, address potential biases of the LLM, and to consider how they can apply their learning in future contexts. This exercise contributes to the learning objectives in the course by helping students understand the basics of machine learning and the societal impacts of algorithmic design choices. Having both a technical and social understanding of a LLM is a key component of the course and program in which it sits in the Digitalization, Surveillance, and Society program at Erasmus University Rotterdam. 

The idea for this assignment originated in teaching where we noticed students did not quite understand how a LLM works when it puts sources together. This underdeveloped technological literacy was not surprising considering the authors do not teach in a technical program. However, we also believed that students in any discipline could at some level see how a LLM extracts passages from different texts to create derivative content. Drawing from his own technical background, Goncalves thus built this model assignment to help students to learn more about LLMs by confronting the consequences of text selection by a LLM. 

Pedagogically, this assignment springs from a larger category of AI literacies and aims to make students more responsible and informed AI users. However, instead of focusing predominantly on the LLM user like many frameworks, this assignment also forces a student to confront the inner workings of a LLM and teaches literacies about AI and LLM design and development. When a student sees exactly where the selected texts they used to train the model are directly reproduced by the LLM, they can think like a developer when they act like a user, thereby challenging a passive acceptance of what they are given by the LLM and make them more aware of design choice and potential bias. Students can then take these critical skills beyond this model to critique other products like ChatGPT when they interact with more sophisticated and less transparent LLMs. Beyond use in this course, this assignment can be used by any instructor wanting to “unbox the algorithm” so to speak and give students access to the more technical workings of a LLM. Participants of this assignment can also swap out the preloaded sources to upload their own source of choice which further personalizes this assignment. Gonçalves is also working with the school at the University level to develop a specialized LLM built around the same principles for the university community.  


Assignment

In this assignment, you will train your own language model from scratch with your own texts. This will give you a better understanding on how language models learn patterns from training data and reproduce them in their outputs.

A language model is a machine learning algorithm that learns how to predict the next word in a sentence. It is a set of mathematical operations that is designed to learn from and adjust to examples, so it can then reproduce these examples as accurately as possible. In this case, a language model examines sets of words and adjusts its mathematical operations (technically called model weights) to reproduce these patterns.

The more examples a machine learning algorithm is able to learn from, the better it gets. In this assignment, you will have the chance to train a language model from scratch and observe how it improves. You may already know some language models, like ChatGPT, but this one will be entirely trained by you. It is based on the LLaMA-2 model from Meta, but you will be training it without using any pre-existing weights, meaning that the only texts that the model has seen are the ones you choose to show it. This means that the model will not be as fluent as something like ChatGPT, but you will be able to see much more clearly where its texts come from.

You can find the code to train the model linked here. 

If the code seems daunting, you can also find an explanatory video linked here where the steps to train the model are explained. 

The final result of the training is a language model in its purest form, it only learns how to predict the next word in a text. Unless you have given it examples of conversations, it will not be able to chat, like ChatGPT, but you can still generate some texts by providing the model with a couple of words, from which the model will generate a full text.

In the final step of the code, you are able to generate your own text based on a prompt of a few words. Write a 1000-word report on your use of the language model. The report must contain (i) the prompts you used, (ii) a reflection on the prompts and the text they generated in relation to the texts you provided - and (iii) a critical assessment of the text in which you note what strikes you as unusual or interesting, such as factual errors or change of prose.


Acknowledgements

This project is adjacent to Goncalves’ work as the academic lead of the Erasmian Language Model (ELM), a collaborative language model built exclusively for Erasmus University Rotterdam. It is also motivated by Goncalves' work to make AI less artificial by bringing social science into AI, supported by a VENI grant through the Netherlands Organisation for Scientific Research. Sarah Young has contributed to this assignment by making it more accessible and clarifying its scope in relation to objectives of writing and rhetoric.