University of Waterloo response to the Consultation on Copyright in the Age of Generative AI

Tuesday, January 16, 2024

On October 12, 2023 the government of Canada, through the Innovation, Science and Economic Development Canada, asked Canadians to provide their thoughts through their Consultaiton on Copyright in the Age of Generative Artificial Intelligence which closed January 15, 2024. Responses were collected in survey format that asked for long-form answers to prompts on a number of thematic areas. The Copyright Advisory Committee and Government Relations led a consultation open to all members of the UWaterloo community in November; individuals were also encouraged to submit their own responses. The University's response to the survey is copied here:

Technical Evidence

The University of Waterloo Context

We appreciate this opportunity to share our experiences and views on the current challenges and opportunities presented by copyright and generative artificial intelligence. As a premier comprehensive research institution in Canada, the University of Waterloo is committed to supporting inquiry through research and learning. The University enrolls more than 35,000 students across its six faculties and is home to the world’s largest co-operative education system of its kind.

The University of Waterloo has consistently been the top comprehensive research university and has also been recognized as Canada’s most innovative university for over a quarter of a century.  The University’s uniquely entrepreneurial culture encourages experimentation and innovation. This culture is supported by the University mission to advance learning and knowledge through teaching, research, and scholarship, nationally and internationally, in an environment of free expression and inquiry.

In order to achieve our mission, balanced copyright law is a necessary component for success. The University of Waterloo has demonstrated a balanced approach through its IP policy where in many cases the author is the owner of copyright in works created through research and teaching.

We urge you to use this opportunity to reinforce the foundation of the Copyright Act to facilitate the increase of access to information, the advancement of knowledge, and the continued technological growth of Canadian society.

Development of AI systems

The way AI systems are used and designed are heavily context dependent. At the University of Waterloo, researchers approach development of AI systems from a wide variety of lenses using approaches based on their discipline, research question, and the availability of data. Research on AI takes place across our campus, from Computer Science to Economics, from Engineering to English Language and Literature.

Comfort with copyright risk and/or ability to purchase permission for use heavily influences the kinds of systems that can be developed and by whom. Researchers might choose a method of sourcing content based on the kind and/or focus of the system being designed, the amount of funding dedicated to the project, and/or the kind of hardware or software they have available to them. Given copyright limitations and variance between national copyright laws, some researchers will even limit use of works based on the countries in which their collaborators reside.

When accessing information for training datasets, some researchers rely on web scraping to gather publicly available information. Others rely solely on information that is openly licensed (for example, through Creative Commons) and in the public domain. Others yet, are able to rely on information licensed by the institution for text and data mining purposes. Those who rely on open, public domain, or institutionally licensed content have necessarily smaller training data sets which limit the capabilities/outputs of the final system. For example, a system built on only public domain data would not be able to surface information about COVID-19. Those who rely on institutionally licenced data are limited by the licence conditions, which may allow training of a system, but not allow generative output of more than a few hundred words.

How this data is used will differ depending on the kind of system that is being offered. For example, a researcher designing a generative system, will use the underlying data to train the system such that it can find patterns and create output, like new images, text or code. A researcher designing a recommender or classifier systems would use the underlying data to train the system such that it could make a recommendation or provide a classification for a user. The generative system faces copyright challenges in the input and output stages, where the recommender/classifier system faces copyright issues mainly with input.

Approaches to changing the Copyright Act should be done with an understanding that there are a wide range of users and creators of AI systems with a diverse range of use cases.

Use of AI generated content

In day-to-day operations, the University is exploring how generative-AI can be used to improve service and processes. For example, staff are testing out AI tools capacities for assisting with basic writing and evaluation tasks and for deploying chat bots to provide frontline user support.  

The University is only beginning to understand how use of generative AI will impact the way instructors teach and students learn. As with many other institutions, the University is taking a varied approach to the use of AI by students, encouraging instructors to be clear with students about their class policy. There are several instructors who are actively engaging with AI services in the classroom, even incorporating it into assessments. Many support staff have spent a great deal of time working to clarify what AI systems can and cannot do, when they can be used in a pedagogically sound way, and how instructors can maintain the integrity of their courses. The University has also provided clarity around copyright related risk for using of AI-generated content in teaching; guiding instructors to make risk-informed use of content when connected to learning outcomes.

Concerns about the nature of the tools available and their capacity for infringement have limited uptake in some areas. Changes to the Copyright Act that address AI systems potential for infringement and user liability would increase confidence in use.

Text and Data Mining

TDM Activities and factors

Informal observation suggests that research involving TDM is being conducted across many Canadian universities, including the University of Waterloo. It appears that researchers take several approaches to do so, including relying on the fair dealing exception, through licensing agreements, or web-scraping information from the publicly-accessible Web. How researchers engage with these projects seems to be influenced by what is understood as the common practice in their field; for example, researchers may look to widely used practices for web-scraping used in the US and EU. Note that in each of these cases, researchers must engage in a careful risk analysis regarding their planned use and any potential conflict with contracts from various providers. In many cases, the researcher could make a good case for TDM under the Copyright Act, but many institutional licenses or website terms of use would prohibit them from further engaging in their research. In addition, given the scale of data collection needed, analyzing terms of use for each website to determine whether each site could be used would not be feasible.

To provide clarity for AI developers and users and enable Canada to be competitive in AI research and development, we recommend that the fair dealing exception is expanded to be illustrative (i.e. through the use of “such as” language) and extended to include informational analysis. Fair dealing is currently limited to the eight enumerated purposes in the Act. Although the Supreme Court has encouraged a large and liberal interpretation of these purposes, it would provide greater confidence for researchers developing AI systems for illustrative language and the specific purpose of informational analysis to be added. Explicit inclusion of informational analysis would provide clarity and would reduce complications in using fair dealing for TDM research; rather than guessing that TDM might fit under a research purpose, researchers would be confident their work would fit under informational analysis. Illustrative language would have the effect of legislating the Supreme Court’s guidance on large and liberal interpretation of the purposes. The use of the works for this purpose would have all the safeguards of fair dealing, requiring the use to be fair and tested against the six-factor test outlined by the Supreme Court in CCH Canadian Ltd. v. Law Society of Upper Canada, 2004 SCC 13 (CanLII). While commercial uses would not necessarily be precluded, the six-factor test would act as a safeguard against unchecked commercial usage. A precursor to using fair dealing also requires that the content is accessed legally, another safeguard. This expansion of the fair dealing exception would be most effective if accompanied by the addition of a contract override provision and revision to allow circumvention of technological protection measures (TPMs) for non-infringing purposes.

Many uses of content are limited by contracts of various kinds – whether that be the terms of use of a website, or of a streaming platform, or an institutional license (e.g., the contracts libraries may sign to gain access to databases). Contract override ensures that users are not prevented from accessing their user rights under the Act by signing away rights via contract. Note that contract override does not mean that users do not have to pay for access, rather it means that once they have access all users have the same rights to reuse. In this way, contract override works to ensure a more democratic ability to use information and respects the technological neutrality goal of the Act. This would mean that the ability to exercise one’s user rights under the Copyright Act would not be connected to negotiating power. Currently, large companies or user groups with more resources would be more likely to be able to negotiate a licence with favourable terms than a single person, who might be saddled with the standard terms and conditions of a website with no option to negotiate. With contract override, this isn’t an issue. We would encourage the Government to explore the contract override provision in Ireland’s Copyright and Related Rights Act, 2000, section 2 (10).

The current TPM language in the Act does not allow users make non-infringing uses of a work (aside from limited carve outs, e.g., accessibility uses) when a TPM is in place. This means that users are denied the ability to exercise their user rights for content that is digitally locked down. This violates principles of technological neutrality; for example, a user may be able to copy a short excerpt from a print book under fair dealing, but not the same title in eBook form. In terms of AI usage, if a user wanted to incorporate video content stored on DVDs in a training dataset, they might be limited by digital rights management on the DVD from copying it off of the original medium. Current language would not allow a user to break the digital lock to use the content for fair dealing purposes. We recommend that the TPMs section in the Act is revised to allow users to exercise user rights (e.g., fair dealing) regardless of TPMs.

A multifaceted approach would enable development of AI while providing opportunities for creators to be compensated where appropriate.

Licenses available

The University of Waterloo Library licenses information from a wide variety of publishers. The bulk of our licenses are with foreign information providers, which influence the copyright law referenced in our contracts and our ability to use Copyright Act exceptions to use this content. Only 14% of our licences permit any kind of TDM. Of those that do allow TDM, most come with heavy restrictions on reuse of the data that would make incorporation into AI systems impractical, such as restrictions on the number of words that can be included in any published extract. Some of the publishers that do not allow TDM as part of their standard license will offer it as an added service, but often at prices researchers consider prohibitive. Traditionally, TDM allowances in Library licenses have been directed at enabling research looking for patterns within a corpus to answer a specific research question; these patterns may have been found programmatically, but not through the building of an AI system or service. In recent months, publishers have started to introduce new clauses in our contracts that aim to forestall use of content as training material for AI systems, further clarifying their position that the TDM clauses in these licenses were not designed with AI in mind.

Obligations for AI systems

The University understands AI development from the perspective of researchers and learners as developers of AI systems. To try to strike a balance between the requirement for attribution to respect the moral rights of the creators and the complexity of AI systems, the government could use language similar to the non-commercial user-generated content exception (s. 29.21). As provided in section 29.21(b) the user of a work is required to mention the source where it is reasonable to do so. Using similar language when developing provisions for AI systems would encourage attribution while being flexible. That said, there are two sets of issues with AI systems related to copyright and attribution – input and output. Regarding input, a compromise between traditional respect of moral rights (attribution) and feasibility in an AI development environment would be for developers to maintain records of what content was used to train their systems. This is to say that developers should be able to tell users where content was sourced and to understand that it is extremely difficult, if not impossible in many cases for developers to identify the copyright owner and to provide a list of copyright-protected content used to train a system. While a great deal of content is used and it is a time-consuming task to maintain a list of sources, this requirement could help creators understand where their works are being used. Regarding output, at this early stage in the development of AI systems, it seems likely that generated outputs contain only insubstantial amounts of the training material; such tiny amounts of content that the Copyright Act does not require attribution or permission.  

Authorship and Ownership of Works Generated by AI

We recommend that the status quo be maintained. The current requirements for copyright protection, namely originality, fixation, and the exercise of skill and judgement work well. We understand these requirements to mean that content generated by AI is not protected unless an original work is created with the sufficient addition of human skill and judgement. We support the approach taken by the United States Copyright Office in their document, Copyright registration guidance: Works containing material generated by artificial intelligence, and recommend the work of Dr. Carys Craig, The AI-Copyright challenge: Tech-neutrality, authorship, and the public domain as two slightly different but interconnected ways to approach this issue.

Infringement and Liability regarding AI

At present, the way that many AI systems are trained relies on encoding content, changing bits of text into numbers that relate to each other (Stephen Wolfram Writings article What is ChatGPT doing and why does it work may be helpful for understanding this). When a user enters a prompt, the system looks to the training data for patterns in those small bits to generate new content. Currently, those small bits of information are not encoded with attribution information and so the resulting system is not able to provide the end-user with information about the source material. This has implications for testing infringement and for considering liability. Regarding infringement, the end-user does not know what sources are used to generate the content, and so is limited in their ability to understand whether permission is needed. They could compare the result to existing content if they’d started with a reference, but lacking that, they may have no clue if something was infringing, especially given the breadth and depth of the training models. Regarding liability, we again face the issue of dual copyright concerns related to input and output. When it comes to outputs, things are more complicated, and depend on the promises made by developers to users. If a developer promises that their content is copyright-cleared, the user would have a reasonable expectation to be able to use the service without issue. Both infringement and liability would be better addressed by reducing risk of infringement through the expansion of the fair dealing exception and addition of a contract override provision.

Comments and Suggestions

Copyright is just one tool in the toolbox for providing appropriate controls on use of content in AI systems. A multifaceted, nuanced approach will be necessary to enable AI to be an opportunity rather than a hindrance to society. Changes to law, policy, and regulations should be done with an eye to equitable access and use of this new technology across our society.