Fuelling AI’s Brain: Cracking Open Copyright for Training AI
Pınar Bakırtaş, LL.M. (Max Planck Institute for Innovation and Competition & WIPO- Ankara University)
1. Introduction
The future arrived sooner than expected. The Jetsons may still be ahead on flying cars, but they were spot on when it comes to robot-powered living. Even though Rosey the robotic maid is still missing from today’s homes, Artificial Intelligence (AI) tools have found their place in most households. The global use of AI has accelerated, particularly with the widespread use of generative AI tools that generate text, images, sound and more in a remarkably human-like manner. Innovative uses of AI emerge constantly, often in unexpected ways. Among the most prominent ones is the next Rembrandt project where AI algorithms were used to analyze paintings of Rembrandt to generate new artwork like the artist. Recently with the emergence of Open AI’s new image generator it has become a trend to produce images in the style of the Ghibli Studios anime productions. An image generated by Midjourney titled “Théâtre D’opéra Spatial” won first place in a fine arts competition. The AI-generated song “Heart on my Sleeve” mimicked the artist Drake and the Weekend before being taken down, meanwhile, Drake himself used AI for his own song using the deceased rapper Tupac’s voice.
While the examples of Generative AI creations get more interesting by the day, the conflicts brought to the copyright field is not negligible. The robotics-enabled future seemed unproblematic in the Jetsons, however the reality of today comes with a fine print. At the heart of the legal debate over Generative AI are two key actions performed by the technology: learning from existing data (training) and creating new outputs. This paper focuses on copyright conflicts triggered by AI's use of copyright-protected material at the training stage.
Accordingly, Part 2 explains how AI works, Part 3 puts forth copyright protection and its exceptions in the context of AI. The following parts explain how different jurisdictions consider copyright exceptions for AI training on copyrighted material. In this respect, Part 4-7 covers US, EU, UK and Japan approaches. Lastly, Part 8 concludes with final remarks.
2. The Functioning of AI
AI refers to a technology where machines can perform tasks that would normally require human intelligence such as making decisions and understanding natural language. The major technique AI systems learn to do these things is through a process called machine learning. In traditional programming, programmers write step-by-step instructions and the computer just follows the orders. When data is provided the computer follows the given instructions and provides an outcome. Whereas in machine learning the method is reversed. A machine learning system is not told exactly what to do. Instead, it is fed with large amounts of data where a learning algorithm is used to learn patterns from that data and make predictions. For example, one common use of machine learning is recommendation systems like those on social media or streaming platforms.[1]
There are different learning algorithms that can be used in machine learning such as linear regression, decision trees, random forests and neural networks. The technical details of these methods are beyond the scope and purpose of this article. However, neural networks are important to mention as they are inspired by how human brain works. Deep learning is an enhanced method of machine learning that uses neural networks in a more intense way.[2] In the end deep learning allows identification of very complex patterns to make more precise predictions. The more data it processes the better it gets at predicting. For example, deep learning is used in tasks like translating languages, understanding natural language or generating art. Deep learning is the core technology behind many Generative AI systems that are designed to create new content. They require large datasets to learn the underlying patterns, this process is called training. In Generative AI training requires massive amount of existing content such as text, images or music. After training, deep learning systems generate new content.
Both the training process and the content generation may cause copyright infringement issues. Although generating new content is unique to Generative AI, the process of training which includes learning from copyrighted data (like texts and images) relates to the broader set of AI systems that train on data but do not generate content.
The next part explains how copyright and its exceptions relate to the training stage of AI systems.
3. Copyright and Exceptions in AI Training
Copyright grants creator’s exclusive rights over their original works, which can include literary, artistic, scientific, or other creations. To qualify for protection, a work must be both original and independently created (not copied from another source). The exclusive rights of a copyright owner typically include the right to reproduce the work, communicate it to the public, distribute it, and create adaptations. The rights and their scope can vary between jurisdictions. In the context of AI, the training process often involves making copies of copyrighted works to feed into machine learning systems. This act of copying directly engages with the right of reproduction held by the copyright owner, making it a central point of legal discussion around AI training practices.
There may exist different exceptions to copyright for the training activity of AI systems. A copyright exception allows the use of copyrighted material without the permission of the right holder and in some cases without requiring any compensation. The copyright exception could be provided as a narrow and clearly defined exception. One of the most prominent ones relevant to AI training is the so-called text and data mining (TDM) exception. Even if the term does not equal to AI training, it is argued to include it. The TDM exception appears in several jurisdictions such as EU, UK and Japan, though its scope and application varies across each. Whereas in the US, AI training could be considered under the general fair use doctrine which is a flexible legal principle applied on a case-by case basis.
4. The US Fair use: Four Factors at the Heart of the Legal Storm
US is one of the leading hubs of AI research and investment with major competitors such as Google, Open AI, NVIDIA, Microsoft and Meta. Over 40 copyright lawsuits have been filed and pending against AI powerhouses before the US Courts which is significantly higher than any other jurisdiction. Given this volume of litigation, global attention is on the US as a key battleground for the unfolding of AI and copyright conflict.
In the US, copyright limitations are introduced in 17 U.S. Code § 107- 122. While § 108 and onward provides limitations for particular circumstances, § 107 introduces the fair use doctrine as a catch-all clause that applies a factor test on a case-by case basis. Under this section if a use is determined as fair use it is not considered as an infringement of copyright. Fair use requires the consideration of four factors for the analysis as follows:
“(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work.”[3]
The flexibility of fair use analysis allows new forms of use of a work to be considered dynamically considering technical developments of the day against the interests of copyright holder long after the US copyright law was drafted. The courts weigh each above factor to determine whether it favours one party or the other, and then makes a final decision based on all the factors together. It is important to note that the factors are not equally significant for the decision. Factor one and four are usually considered to have more weight than the rest by the US courts.
The first factor has a strong bearing on the decision.[4] The most important issue discussed under factor one is whether the use is transformative or not. A use can be transformative if it transforms the work so as to parody, comment on or otherwise add new meaning. Also, it can be transformative if it is used for a different purpose and in a different market than which the work was created for.[5] There are differing views on this criterion. Some suggest the AI training use is transformative because it transforms the copyrighted material into numeric data for the training process and that this serves a functional purpose.[6] To support this, former judgements ruling for fair use where non-expressive elements are copied regarding transforming the imagery of a video game into source code[7] or changing photos and images to data metrics in a search engine[8] are provided.[9] It is claimed that AI training allows building a technology for users to create new expressions whereas the original copyright protected content is created for aesthetic and entertainment purposes. Also, it is claimed that AI training could be referred as non-expressive use, meaning it is not taking expressive elements from the work.[10]
On the contrary, it is argued that this training process is no different than how a human learns.[11] So the original works used as training data are not just created for enjoyment or aesthetics but also to inform and enlighten. Humans make use of the created works “to recognize patterns, understand context, or make associations between words, sentences, and paragraphs” just as AI systems[12], hence there is no different purpose here. In the most recent decision of the Supreme Court it was decided that transformative test is not simply on adding new meaning or purpose but also whether the use will compete in the market of the original work.[13] Here, it is claimed that the training is ultimately for the purpose of creating outputs that will compete with copyrighted training data.[14]
Factor two, on the other hand, considers the nature of the copyrighted work. Here the analysis depends on how creative the original work is. The lower the creativity the weaker the protection.[15] Given the vast volume of material used to train AI models, it’s inevitable that the datasets span a wide spectrum, including highly creative works among the trained music, text, and images. On the contrary, it is pointed out that when the copy is from published works it is more likely to be considered fair use.[16] Some argue that for AI, the fed material is just data points and they all look equal so this factor is neutral.[17]
The amount of work copied establishes the third factor. This is considered through both qualitative and quantitative analysis. The little amount taken could still be weighing for fair use if it has taken the “heart of the work”. The opposite holds true as well. It has been mentioned that even taking the whole work might be reasonable considering the use is “valid and transformative”.[18]
Lastly, the effect of use in potential market is regarded as a key factor for determining fair use. Market harm is usually determined by whether the use creates a substitute to the work or not.[19] In the first decision, as summary judgement, on AI training from the US, the District court of Delaware decided against transformativeness of the training activity as it would be competing with the trained data.[20] However, recently the District Court granted the request for interlocutory appeal of its decision to Third Circuit and stayed proceedings pending the appeal, even though stating it remains confident in its judgement. Furthermore, the commercial nature of the AI companies weighs against fair use in factor one. It has also been decided that when a use is transformative the commercial nature of a use is less important.[21]
Considering the many ongoing cases it is unclear how the US formula for AI training will unfold and how each up and coming decision will affect each other. Though it is clear that analysis of whether AI training is transformative will play a determinative role by also affecting the market substitution analysis.
5. The EU TDM exception: A Default Go-Ahead with a Veto Button
The TDM exceptions are established under Art 3-4 of the Digital Single Market (DSM) Directive of 2019. They are regarded as the exceptions that could apply to the training of copyright protected material by AI systems. These exceptions have been incorporated into the DSM Directive as mandatory exceptions that all Member States have to transpose into their national laws.
Art 2(2) of the DSM Directive defines TDM as “any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.” As the definition reveals, TDM does not equal to AI training. It corresponds to a broader scope of acts than just AI training activities but it is generally acknowledged that TDM could cover web scraping, pre-training and training stages of AI under the EU law.[22] This view has strengthened after the reference of the AI Act Recital 105 and Art 53(1)(c) to the TDM exception under the DSM Directive.[23] However, TDM exception is not a comprehensive safeguard to all training activities of AI, it is considered to cover varying aspects of training but not necessarily the whole training process.[24] Also, post-training acts are considered irrelevant to the TDM exception, including the AI generated output.[25] The further acts of communication to the public will not be covered by the TDM exceptions as well.[26]
Under EU law, TDM is provided as an exception to the right of reproduction of the copyright holder. Art 3 of the DSM Directive provides a TDM exception for the purpose of scientific research, whereas Art 4 introduces a general exception covering all TDM activities including those for commercial purposes. Specifically, Art 3 exempts TDM from copyright protection regarding the reproduction and extractions of lawfully accessed works for the purpose of scientific research performed by research organisations and cultural heritage institutions. On the other hand, Art 4 establishes the TDM exception pertaining to reproduction and extractions of lawfully accessible works for any TDM activity. TDM pursued for scientific research allows the storage of the copies of works including for verification purposes[27], meanwhile TDM for commercial purposes could only be retained for as long as necessary.[28]
The TDM exceptions apply automatically on the conditions mentioned above. The very important distinction between the two TDM exceptions is the opportunity of right holders to reserve their rights (commonly referred as opt-out) under Art 4(3) of the DSM Directive. However, there is no option to reserve rights for TDM carried out for the purpose of scientific research. For works that are publicly available online the opt-out of the right holders from TDM activities should be made in an appropriate manner such as “machine-readable means”. This concept is explained further in the recitals, which are interpretive tools, though they are not legally binding. Machine readable means are exemplified as “metadata and terms and conditions of a website or a service”. The recent German decision Kneschke v LAION[29] includes an obiter dictum that a reservation expressed in natural language could be considered as machine readable means. It adds that such reservation would be examined under the technical capabilities at the time of the use.[30]
Very importantly, the opt-out provision also triggers the remuneration of the right holders from the use of their works for TDM. As the right holders are allowed to opt-out, they can either choose to simply reserve their rights or they could choose to reserve their rights if they are not remunerated for such uses.[31] The obligation of General Purpose AI systems (that practically includes Generative AI models) to form and make publicly available a list of content used for their training established in the AI Act under Art 53(1)(d) facilitates the opt-out mechanism provided in Art 4(3) of the DSM Directive. This obligation helps to address the challenge for right holders to know which AI models were trained on their works.
The method of such remuneration is highly disputed. Suggestions include the involvement of collective management organisations[32], agreements for royalty payments in set fees or for revenue sharing with AI developers[33] as well as compensation based on outputs of Generative AI that could be a substitute for the original works[34]. Regarding the remuneration some argue against an opt-out system and instead for a statutory license.[35] The statutory license would restrict the option of simply opting out and instead permit all TDM of works in exchange for remuneration of right holders. If we draw comparison from the data of YouTube’s Content ID on the motivation of right holders to look for compensation, it could be observed that over %90 of copyright holders opt for monetization instead of blocking the content or tracking (the viewing statistics without blocking or monetization).[36] However, Generative AI systems introduce a distinct dynamic. As a result of being trained on original works, these models can generate outputs that may directly compete with or replace the work of right holders. This may drive some right holders to simply opt-out.
The first ever decision on the TDM exception in the EU was decided on 24 September 2024 by the District Court of Hamburg (Landgericht Hamburg) in Kneschke v LAION. The dispute was regarding a photograph of Robert Kneschke that was used in a dataset of LAION, a non-profit organisation. The dataset was created through reproductions of publicly available images and their descriptions that included analysing them through its software to see if images and descriptions match. The created dataset includes text-image pairs that could be used to train Generative AI models. The plaintiff argued it had opted-out and therefore the reproduction of his image constituted copyright infringement. The court decided that the reproduction of plaintiff’s image is relevant to the TDM exception for scientific research.[37] Therefore, the claimed opt-out is not valid under this exception. For the TDM exception, the German court reasoned that LAION was a non-profit organisation, the dataset was not created for commercial purposes and that it was publicly available for free.[38] The Court found it irrelevant that the created dataset would be used for commercial purposes by AI systems.[39]
The court also considered the defence of temporary copying which is also argued regarding AI training by some. When considered for AI training, the temporary copying exception within Art 5 of the Information Society Directive provides that the use of the work must a) be transient or incidental, b) constitute an integral and essential part of a technological process, c) enable a lawful use and d) have no independent economic significance. This defence was dismissed by the German court. It explained that for the copy to be considered as transient, the process should be automated to delete the copy without any human support and should be stored for only as long as is necessary for the technical process. The court reasoned that duration of the storage was specified by the defendant and that automatic deletion process was programmed to do so. Also, the copy was deemed as not incidental due to the fact that the downloading of the image was a conscious and active part of the AI analysis.[40]
While the TDM exceptions have been established and transposed into national laws in the EU, the decisions by national courts and the recently published third draft of General Purpose AI Code of Practice seems to provide further interpretive tools for the moment.
6. The UK TDM exception: Renewed Debate over AI and Copyright
The UK TDM exception was adopted even before the EU in 2014 within Section 29A of the Copyright, Designs and Patents Act.[41] Unlike EU’s two-fold approach, the UK had adopted a TDM exception for the right of reproduction regarding lawfully accessed works only for non-commercial purposes. In 2021, the UK Intellectual Property Office (UKIPO) initiated a consultation on how AI should be addressed within IP laws. One of the topics covered broadening the TDM exception. The outcome of the consultation was to introduce a TDM exception for any purposes, including commercial purposes, without an option to opt-out.[42] However, after major backlash from the creative industries the UK government decided not to proceed with the proposal.[43]
In December 2024, the UKIPO started another consultation specific to AI and copyright. The objectives of the consultation are listed as supporting the control of right holders over their works as well as the development of leading AI models in the UK and lastly, greater trust and transparency. On the TDM exception, the UKIPO announced its proposed approach as similar to the EU’s TDM exception. The proposal is to introduce a TDM exception for any purpose, including commercial purposes, for lawfully accessed works with an option for right holders to reserve their rights. Also, it is stated that the EU opt-out is not very precise and the UK would like to be more clear on the method of the reservation of rights. Furthermore, it is added that there is need of transparency regarding the training input used by AI developers.[44] The deadline to respond to the consultation ended by 25 February 2025 and the outcome of the consultation is pending. In the meantime, the submitted responses include concerns of artistic communities. Especially the practical hardships of an opt-out approach and the need for fair remuneration of right holders are highlighted.[45]
7. The Japanese TDM exception: AI in Safe Heaven
Japanese Copyright Act was amended in 2018 to include Art 30-4, a copyright exception covering TDM. The provision provides a very broad exception and that could be considered a dreamland for AI companies. The exception concerns the use of a copyright protected work for non-enjoyment purposes. The TDM exception is referred as data analysis such as extraction, comparison, classification, or other statistical analysis within the list of uses that are provided as for non-enjoyment.[46]
In the context of Japanese copyright law, enjoyment refers to appreciating the expression of a work by human senses. Therefore, when there is no purpose of enjoyment in the work, it is considered that it does not qualify for copyright protection such as in the case of TDM.[47] As mentioned above a similar discussion exits in US copyright law as well.
The Japanese exceptions permits all TDM activities regardless of whether they are carried out for commercial purposes or not.[48] There is also no opt-out option for the right holders and it has been mentioned that any contract for the reservation of rights would be invalid.[49] In addition, there are no restrictions on the type of use regarding the TDM.[50] The only restrictive condition seems to be that the use should not cause unreasonable prejudice to the interests of the copyright holder. Accordingly, it has been commented that the exception would not extend to unlawful access to works.[51] There are views to the contrary that interpret the silence of the law on the lawfulness of accessible content for TDM activities as no such limitation for this exception was intended.[52]
Although copyright is territorial, there are international agreements such as Berne Convention, TRIPS Agreement, WIPO Copyright Treaty providing global standards and principles for copyright protection. For example, all these agreements provide the three step test for the limitation of copyright. According to which any exception to copyright must fulfil the following conditions: i) a certain special case, ii) that does not conflict with the normal exploitation of the work and iii) unreasonably prejudice the legitimate interests of the right holder. It appears that the very broad Japanese TDM exception might be at odds with the last two steps. Especially incompliance with three step test under the TRIPS agreement could be brought before the World Trade Organization panel.
8. Conclusion
As AI and specifically Generative AI technologies continue to evolve, become more widespread and integrated into the society the legal, policy and ethical questions around the topic get more heated. In terms of copyright, the conflict mostly relates to reproduction of works in many activities within the input and output stages of the Generative AI. The copyright issues related with the input stage are heavily discussed in light of the ongoing cases and many different approaches appearing in different jurisdictions. This cutting edge technology has refreshed the considerations of the testing the limits of traditional copyright law. Meanwhile, copyright law itself is trying to shape the development of this breakthrough innovation. Skating on thin ice, it is crucial that the balance between innovation and protection stays at the heart of the discussion.
[1] Karen Hao, ‘What Is Machine Learning?’ (MIT Technology Review, 2018).
[2] ibid.
[3] 17 U.S. Code § 107.
[4] Van Lindberg, ‘BUILDING AND USING GENERATIVE MODELS UNDER US COPYRIGHT LAW’ (2023) 18 Rutgers Business Law Review 1 51.
[5] ‘Kadrey v Meta, Amicus Brief of Copyright Law Professors, Document 525, 11 April 2025’.
[6] Michael D Murray, ‘GENERATIVE AI ART : COPYRIGHT INFRINGEMENT AND FAIR USE’ (2023) 26 SMU Science and Technology Law Review.
[7] Sega Enterprises Ltd v Accolade Inc, 977 F 2d 1510 (9th Cir 1992)
[8] Kelly v Arriba Soft Corp, 336 F 3d 811 (9th Cir 2003); Perfect 10 Inc v Amazon.com Inc, 508 F 3d 1146 (9th Cir 2007).
[9] Murray (n 5) 280.
[10] ibid.
[11] Robert Brauneis, Copyright and the Training of Human Authors and Generative Machines, vol 1 (2024).
[12] Kadrey v Meta, Amicus Brief of Copyright Law Professors, Document 525, 11 April 2025’ (n 4).
[13] Andy Warhol Foundation v. Goldsmith, 598 U.S. 508 (2023).
[14] ‘Kadrey v Meta, Amicus Brief of Copyright Law Professors, Document 525, 11 April 2025’ (n 4).
[15] Saliltorn Thongmeensuk, ‘Rethinking Copyright Exceptions in the Era of Generative AI: Balancing Innovation and Intellectual Property Protection’ (2024) 27 The Journal of World Intellectual Property 285.
[16] ‘Kadrey V Meta, Amicus Brief of Intellectual Property Law Professors, Document 509-2, 31 March 2025’ (2025) 7.
[17] Lindberg (n 3) 52.
[18] ibid 53.
[19] ibid.
[20] Thomson Reuters Enterprise Centre GmbH v Ross Intelligence Inc (US District Court, District of Delaware, No 1:2020cv00613, 11 February 2024).
[21] ‘Kadrey v Meta, Amicus Brief of Copyright Law Professors, Document 525, 11 April 2025’ (n 4) 8; Campbell v Acuff-Rose Music, Inc. 510 US 569 (1994).
[22] Pedro Quintais, ‘Generative AI , Copyright and the AI Act’ (2025) 56 Computer Law & Security Review 2.
[23] ibid.
[24] European Copyright Society, ‘Copyright and Generative AI : Opinion of the European Copyright Society’ 5.
[25] ibid 6.
[26] Eleonora Rosati, ‘Is Text and Data Mining Synonymous with AI Training?’ (2024) 19 Journal of Intellectual Property Law and Practice 851 <https://doi.org/https://doi.org/10.1093/jiplp/jpae092> 851-852.
[27] DSM Directive Art 3(3).
[28] DSM Directive Art 4(2).
[29] Regional Court of Hamburg, Kneschke v LAION, 29 September 2024, 310 O 227/23.
[30]Kneschke v LAION paras 100-102.
[31] Martin Senftleben, ‘Generative AI and Author Remuneration’ (2023) 54 IIC - International Review of Intellectual Property and Competition Law 1535 <https://doi.org/10.1007/s40319-023-01399-4>.
[32] ibid.
[33] Nicola Lucchi, ‘ChatGPT : A Case Study on Copyright Challenges for Generative Artificial Intelligence Systems’ [2023] European Journal of Risk Regulation 1.
[34] Senftleben (n 30).
[35] Christophe Geiger and Vincenzo Iaia, ‘The Forgotten Creator : Towards a Statutory Remuneration Right for Machine Learning of Generative AI’ [2024] Computer Law & Security Review.
[36] YouTube Copyright Transparency Report, Jun 2023- Dec 2023.
[37] Kneschke v LAION parag 109-112.
[38] Kneschke v LAION paras 117-119.
[39] Kneschke v LAION para 114.
[40] Kneschke v LAION paras 58-66.
[41] Eleonora Rosati, ‘No Step-Free Copyright Exceptions : The Role of the Three-Step in Defining Permitted Uses of Protected Content ( Including TDM for AI-Training Purposes )’ Stockholm Faculty of Law Research Paper Series 15.
[42] UKIPO, ‘Consultation Outcome Artificial Intelligence and Intellectual Property: Copyright and Patents: Government Response to Consultation’ (2022) <https://www.gov.uk/government/consultations/artificial-intelligence-and-ip-copyright-and-patents/outcome/artificial-intelligence-and-intellectual-property-copyright-and-patents-government-response-to-consultation>.
[43]Minister of State, Department for Digital, Culture, Media and Sport ‘Artificial Intelligence: Intellectual Property Rights’ (UK Parliament, 2023) <https://hansard.parliament.uk/commons/2023-02-01/debates/7CD1D4F9-7805-4CF0-9698-E28ECEFB7177/ArtificialIntelligenceIntellectualPropertyRights>.
[44] UKIPO, ‘Copyright and Artificial Intelligence’ (2024) <https://www.gov.uk/government/consultations/copyright-and-artificial-intelligence/copyright-and-artificial-intelligence> The options considered within the consultation also included leaving UK copyright laws as they are, requiring licensing in all cases and a broad TDM exception with no or few restrictions.
[45] Martin Kretschmer and others, ‘Response by the CREATe Centre to the UK Government’s Consultation’ (2025); Gaetano Dimita, Michaela MacDonald and Uma Suthersanen, ‘Queen Mary Law Research Paper No. 443/2025-Response to the Copyright and AI Consultati’ (2025).
[46] Japanese Copyright Act, Art 30-4.
[47] Artha Dermawan, ‘Text and Data Mining Exceptions in the Development of Generative AI Models : What the EU Member States Could Learn from the Japanese “ Nonenjoyment ” Purposes ?’ 54.
[48] Thongmeensuk (n 14).
[49] Eleonora Rosati, ‘No Step-Free Copyright Exceptions : The Role of the Three-Step in Defining Permitted Uses of Protected Content ( Including TDM for AI-Training Purposes )’ Stockholm Faculty of Law Research Paper Series 14.
[50] Artha Dermawan, ‘Text and Data Mining Exceptions in the Development of Generative AI Models : What the EU Member States Could Learn from the Japanese “ Nonenjoyment ” Purposes ?’ 54.
[51] Rosati (n 42) 17.
[52] Dermawan (n 40) 54.