By Tarunima Prabhakar
An article published in MediaNama in late July titled The Gap Between Responsibility and Liability of AI provided the following provocation: “Is open sourcing AI enabling an exponential growth in the problematic use cases? ” The article cited the increase in scraping of websites such as the MediaNama website, as one of the problematic use cases of open-source AI.
The conclusion of the article itself is hard to dispute with—there is no silver bullet, and regulation is needed to prevent the many user harms from AI. But the article described a simplified connection between open-source AI and negative effects such as dual use and scraping. This, in my view, made complex issues such as open source in AI, scraping, and copyright more confusing.
The rise of Generative AI is animating and reconfiguring a number of existing tech policy ‘beats’. Issues of privacy, copyright, misinformation, algorithmic fairness, bias, and transparency need to be relooked at in the context of transformer-based deep neural networks that rely on data as large as the entire public internet, to replicate creative human abilities such as creating visual art and literary pieces, and writing software. But each of these issues comes with a deep history. While online harms is emerging as a new framework to think of tech regulation, specificity in what the harm is, where it takes place, and how these beats relate to each other will contribute to better policymaking. In the spirit of the debate recommended in the original piece, this article attempts to clarify the terms used in the original article and how the concerns raised in the article relate (or not) to each other.
Contention no. 1: What open source means in the world of ML and AI in and of itself remains unclear
Let’s begin by looking at open source as a theme in AI development. The claim in the article is that open-source AI leads to more applications and a possible increase in negative applications of a dual-use technology. This claim is very difficult to analyze because what open source means in the world of machine learning and AI is unclear to everybody involved in developing these systems. This confusion can be attributed to some degree to corporations bandying their AI models as ‘open’. In fact, defining open source in the context of AI is one of the turfs on which big corporations are competing because it lets them position themselves as leaders in the development of responsible AI (which is consequential when dealing with regulators). Open AI in particular, by using ‘Open’ in its name implied that the AI it was building was open. But, across different products, OpenAI has used different definitions of openness, and in some cases defended not opening its work. ChatGPT is open access in that anyone can use the API to build applications. But neither the datasets used to train the model, nor is the codebase public. Thus, one can’t recreate or modify ChatGPT, which is the essence of open-source software. When first releasing GPT-2, Open AI declared: “due to our concerns about malicious applications of technology, we are not releasing the trained model.” Meta claimed LLaMa2 was ‘open source’. The Open Source initiative clarified it wasn’t. OSI started its clarification with “even assuming the term can be validly applied to a large language model comprising several resources of different kinds…,” which is to say that even OSI isn’t sure if the term can be applied to large language models and is trying to figure out what Open Source AI means. Across different definitions of openness, the creators of models try to provide endpoints (API access, hosted service) that enable people to build applications. This makes business sense since it adds to the value of the models and the company behind them.
But this isn’t true just of ‘open source’ AI. All AI, including closed source, will be dual use- enabling deep fakes as well as language translation tools. Whether opening models, in whatever forms, increases the risks of dual use, is far from clear. On one hand opening models could enable more applications but on the other hand, by enabling scrutiny, it could also provide ways to understand, audit, and limit nefarious uses. It is worthwhile to remember that when OpenAI did not open GPT-2, two independent research teams were able to replicate the model. A completely closed AI economy will also have negative uses of AI (even without leaked IP). It will just be more centralized.
Contention no. 2: Scraping cannot be considered a unilateral harm
Another issue mentioned in the article was AI leading to an increase in scraping. Scraping, an act as old as the internet itself, is entwined in debates of copyright and, to a lesser extent, of data privacy. Generative AI models rely on data for training and adapting models to specific contexts. It is easiest to get it from online public sources. Automating the data collection from these online sources is called scraping. But, as the original MediaNama article pointed out, just because something is public doesn’t mean that it is open for all uses. The level and forms of control that the original owner/publisher should get over how something they have created is used, is what copyright laws have been trying to answer for centuries. Copyright is about preserving the rights of the original owner but when owners are powerful organizations, copyright transgressions (including ignoring robots.txt and terms of service) are often a protest against monopolization of knowledge. In the context of Generative AI, scraping doesn’t appear as a tool of protest but rather as an extractive tool for gaining a business advantage. Unsurprisingly, and rightfully, people are upset about this. But this shouldn’t lead us to uncritically consider scraping (or even ignoring of robots.txt by scrapers) as a unilateral harm. Online spaces have always been difficult to study, and platforms haven’t always been forthcoming with data. As platforms restrict access to APIs that are critical for research, scraping is one of the few techniques available at the moment to collect data to understand online spaces.
The article speaks of copyright violations through an increase in scraping. Before we discuss scraping as a harm, it is important to distinguish it from dual-use harms of AI. Dual use happens after the AI model is in place. Scraping is the acquisition of the raw materials or inputs for the production of AI. Both these harms happen at different points in the AI value chain and need to be considered separately.
In this current moment, it would be hard to dispute that developments in AI have led to an increase in demands for data, and therefore in scraping. But, two decades back we might have drawn the cause-and-effect relationship in the opposite direction. Without scraping, one wouldn’t have had resources like Common Crawl that have been in existence since 2008, long before neural networks were established as the de-facto approach in artificial intelligence development. Scraping has historically been used to create datasets for research. Machine learning is one of the many directions of research that scraped datasets enabled and that research became industrialized. Now that this research has become industrialized, we should concern ourselves with malpractices in accessing an input resource (data). What constitutes malpractice in the acquisition of online public data is broader than the question of bots honoring robots.txt. But as we answer this question, we should remember that the practice of scraping enabled and continues to enable innovation, creativity and accountability. Promoting the necessary and legitimate forms of each while limiting the extractive uses of automated data collection is going to be an important challenge to address in the coming years.
That AI developments are leading to scraping is clear. But is open-source AI more responsible than the rest for an increase in scraping? Hard to say. But why do we need to establish this relationship at all? While it makes sense to discuss open sourcing in AI as a risk management strategy for a dual-use tech, tying open source models to scraping doesn’t provide a productive way of thinking about copyright or data privacy which are primarily principle-based debates.
Need for privileging clarity over a grand framework
To round things up, the question posed by the original piece and stated in the first paragraph here, is an odd starting point to discuss the responsibility and liability of AI because it places a poorly understood term called open source AI as the fulcrum of the discussion. Furthermore, by clubbing upstream negative effects such as copyright violations from an increase in scraping with downstream effects such as the nefarious use of developed models, it stretches the canvas of user harms too broad and arguably too thin. As we think about regulation we might be better served to think through each of these policy issues on their own terms, and not sacrifice clarity in search of a grand unified framework.
Tarunima is the research lead and co-founder of Tattle, an open source, civic tech project building solutions to respond to inaccurate and harmful content in India.
STAY ON TOP OF TECH NEWS: Our daily newsletter with the top story of the day from MediaNama, delivered to your inbox before 9 AM. Click here to sign up today!
Also Read:
