On May 25, 2023, the United Nations Educational, Scientific, and Cultural Organization (UNESCO) conducted a webinar on the impact of generative AI on intellectual property (IP). The webinar covered themes of data privacy, legal risks of generative AI, the use of AI in preserving traditional knowledge, and AI and artist collaboration. It is the first in a series of webinars discussing AI and the rule of law.
Why it matters:
Generative AI has been garnering attention in the past couple of years not just for what it can accomplish but also for the use of publicly available information in AI training. The webinar discussed how data is the most important ingredient in generative AI — it is used both for training and fine-tuning an AI application. It discussed how most of the data used to train generative AI is born out of human expression (like tweets, news articles, or photos), and that its authors might not have given informed and meaningful consent for the use of said data.
The ethical concern surrounding the use of this public data is also a site for regulation today with the European Union requiring developers of generative AI tools, such as ChatGPT or image generator Midjourney, to disclose any copyrighted material used to train their systems. But the question remains: what falls under the ambit of intellectual property protection?
STAY ON TOP OF TECH POLICY: Our daily newsletter with top stories from MediaNama and around the world, delivered to your inbox before 9 AM. Click here to sign up today!
Links between human output and generative AI
“Generative AI and IP are intrinsically linked, because AI really, all these systems cannot exist without tremendous intellectual effort, having been made by a lot of people all the way. [There are] those who produce the individual pieces of data, that end up in the data set, the big data architects that organize these data sets, machine learning engineers that create the algorithm, user interface designers, and even the users themselves,” explained Marielza Oliveira, Director for the Division for Digital Inclusion and Policies and Digital Transformation at UNESCO and a speaker at the webinar.
Oliveria mentioned that AI developers can buy training data from organizations like Common Crawl Corpus and Google Colossal Clean Crawl Corpus and when this training data is scrutinized, the copyright symbol (©) appears 200 million times, which means that these data sets contain a fairly large amount of copyrighted data. It contains all of Wikipedia, registered patents, news media from across the world, and pirated ebooks to name a few.
This data is then standardized and normalized (deleting duplicates and toxic content) and then it is broken into pieces (tokenized) to create a dictionary for the neural network to read. The AI model is essentially taught the relationship between tokens, not between raw data pieces. This brings up questions about intellectual property because “it’s not taking the original data set necessarily but this tokenized data set,” she said.
She mentioned two different types of data that can be generated from these pre-existing data sources:
Data collected from the internet can be used to generate new data sets via augmentation. To do so, batches of images (or text) are fed to the software. In image augmentation, the software would create new data by zooming in to an image, cropping it, filling it, etc., which, in essence, would still be equivalent to using the same image.
But the same isn’t true for text, which is augmented through back translation, replacing words with synonyms or changing their order. This complicates things because “upside down Picasso is still Picasso but when you shuffle words around, you change the meaning.” Oliveria said. (Quick example: changing “this is good” into “is this good?”) It also raises important questions surrounding authorship and consent.
The question of authorship gets even more confusing when looking at synthetic data—generated by feeding a subset of available data into an algorithm that then learns patterns. For instance, learning that human faces have two eyes, a nose, and lips and generating new faces that display the same pattern. These new faces are created from groups of faces as opposed to a single one so that the AI doesn’t accidentally generate a real face.
Oliveria’s underlying point is that all of this data (whether it is directly consumed by AI models, augmented, or used to create synthetic data) is coming from humans, and “whether or not we subsequently do something with it, it’s still human expression.”
New IP created by generative AI
Once an AI chatbot has been trained, it can answer user questions and requests which has led to the creation of a new IP. “There’s even an emerging profession called a prompt engineer in which an expert in the techniques to guide a language model towards the desired output exists.”
Oliveria added that good prompts are valuable, there are websites around the world selling sets of ready-made prompts that people can use. The final output of an AI can also be monetized and there are IP implications of this as well. The terms of services of most generative AI apps request consent to collect these prompt and output pairs which are then used to refine the model. “So then, again, you get that authorship issue but now mixed with the authorship of the human as well,” she mentioned.
Violations of confidentiality
In May this year, Samsung banned the use of generative AI at the workplace after some employees uploaded confidential source code data and internal meeting notes into ChatGPT for assistance in code writing. What makes the leak of sensitive information problematic is that there is no way to retrieve or delete it from the AI’s dataset.
Oliveria brought up Samsung’s case to highlight that there is a possibility that users can trick generative AI into regurgitating confidential information held in the dataset. “Several other companies have either suspended or banned the use of chatbots because of bad experiences with proprietary software code that was included in the output of a coding prompt.”
Inspiration vs copyright theft
One of the arguments used to defend the use of copyrighted material in training AI is that it is equivalent to an artist seeking inspiration from pre-existing art or cultural products. But, according to copyright activist and member of the Artist Rights Alliance, Neil Turkewitz, “There is a profound gulf between an individual being inspired by the works in the development of her new work compared to the corporate ingestion of existing cultural output of human creators in order to train a commercial product.” This, he says, violates the fundamental right of authors and creators to control the usage of their works which is found both in national law and international human rights conventions.
He explained that some protection exists to combat this use of copyrighted data. For instance, the 2019 EU Digital Single Market Initiative says that commercial data mining establishments must ensure that there is a practical way for people to opt out of the inclusion of their work. But this opt-out feature isn’t in place in any commercial generative AI models. But even if it was in place, Turkewitz says that opt-out isn’t a feasible solution since it cannot be scaled at the same level as the unauthorized use of user data taking place today.
But despite that, Turkewitz doesn’t think that there is a need for new rights to address unauthorized ingestion of creative materials and that pre-existing laws and international treaties establish a framework that prevents such ingestion. Instead, he believes, the focus should be on how author consent (consent to use their IP to train AI models) is going to be manifested. “And that is one area where I think we need to explore the role of collective management to ensure that simply establishing the rights or enforcing the rights doesn’t then fail to achieve the broader social objective which is to give artists a meaningful ability to prevent that right [from being taken away] and to license [their art] on terms that will sustain creativity.”
This post is released under a CC-BY-SA 4.0 license. Please feel free to republish on your site, with attribution and a link. Adaptation and rewriting, though allowed, should be true to the original.