Welcome to the third installment of our series that delves into the intriguing realm of technology’s new plaything: ChatGPT. In this edition, we explore one of the most contentious issues surrounding AI and GPT-3—the legal implications of using large language model datasets to train AI models. Can you claim ownership of the code generated by ChatGPT? It’s not a dumb question, and it’s more complicated than you might think.
How Do Large Language Models Work?
Before we dive into the legal aspects, let’s understand how large language models (LLMs) like ChatGPT work. LLMs are trained on massive datasets, commonly referred to as “corpora.” The size of these datasets is growing exponentially. For instance, ChatGPT is trained on a dataset comprising approximately 8 million web pages. Known as the WebText dataset, this includes a wide variety of text from news articles, websites, books, and other text sources.
The Murky Waters of Copyright and Fair Use
Companies like OpenAI, which developed ChatGPT, operate under the doctrine of “fair use” under U.S. law. This allows them to use copyrighted material for limited purposes without needing explicit permission from the rights holder. But what constitutes “fair use” is highly contextual. The landmark Google v. Oracle decision suggests that creating new works from collected data can be considered transformative and thus fall under fair use.
The Legal Tsunami Has Begun
The waters have been further muddied by recent lawsuits. Microsoft, GitHub, and OpenAI are being sued for allegedly violating copyright law through their code-generating AI system, Copilot. Similarly, companies like Midjourney and Stability AI are being sued for infringing the rights of artists by training their models on web-scraped images. These cases could set legal precedents that influence how AI-generated content is treated under copyright law.
Enterprise Concerns and Corporate IT
Enterprises are increasingly interested in incorporating ChatGPT-style functionalities into their operations. However, IT security teams are cautious. They are unwilling to alter existing contracts dealing with data liability and copyright until the legal landscape becomes clearer. Software vendors are also evaluating the risks and are careful not to violate existing compliance directives.
The Complicated Question of Ownership: A Deeper Dive
The U.S. Copyright Office’s stance is clear: for a work to be eligible for copyright protection, it must be the result of “original and creative authorship by a human author.” This policy essentially rules out the possibility of registering works created solely by AI, like ChatGPT, for copyright protection.
Does AI-Generated Content Become Public Domain?
This brings us to a legal grey area. If a piece of content is generated solely by an AI tool, does it automatically become public domain, free for anyone to use? Or is it considered a derivative work of the materials the AI was trained on?
Example 1: Suppose you use ChatGPT to generate a poem. The poem isn’t eligible for copyright registration under your name because it wasn’t created by a human. At the same time, it’s not clear whether this poem would be considered a derivative work of the training data. Therefore, the legal status of the poem remains ambiguous.
Independent Creation and Liability
Another layer of complexity arises when the same content is generated for different users. In this case, it would be challenging for either party to claim copyright infringement against the other, as both pieces of content would be considered independently created.
Example 2: Imagine that ChatGPT generates the same marketing slogan for two different companies. Neither could likely claim copyright infringement against the other, as the slogan would be considered to have been independently created by the AI for each company.
Who is Liable for Damaging or Infringing Content?
If the generated content turns out to be damaging or infringes on someone else’s copyright, who is held responsible? Is it the user who initiated the query or the developers of the AI model?
Example 3: Let’s say you use ChatGPT to generate a piece of code, which you then publish on GitHub. Later, it turns out that the code snippet closely resembles copyrighted code from a third-party software. In this case, who would be liable for copyright infringement? The developers of ChatGPT or you, the end-user?
Explicit Answer to the Question
So, if you ask ChatGPT to write code for you, copy/paste it into GitHub, and insert a copyright notice with your name, can you safely say you own the code? The current legal landscape suggests that you can’t definitively claim ownership.
Given the ambiguity surrounding AI-generated content, it would be risky to assert copyright ownership over such code. While the code itself won’t be eligible for copyright registration (since it’s generated by AI), claiming it as your own could potentially put you in legal hot water, especially if the generated code closely resembles existing copyrighted material.
Citing AI-Generated Content
When it comes to academic or journalistic integrity, citing AI-generated content as a source is considered appropriate. For example, if you use text generated by ChatGPT in an article, it should be cited as:
“ChatGPT-generated response, accessed [date].”
Also read: How to check if something was written by AI
The intersection of AI and copyright law is a legal labyrinth with more questions than answers. As AI becomes more sophisticated and integrated into various aspects of life and business, the need for clear guidelines and regulations grows more urgent. Until then, the best course of action is to tread carefully and stay updated on evolving laws and regulations.
Stay tuned for more in-depth articles as we continue to explore the multifaceted world of AI and its implications. Feel free to share your thoughts and questions in the comments below.
This article is for informational purposes only and should not be considered as legal advice.