Trouble in the cockpit? Popular AI tool faces class action lawsuit
A class action against GitHub Copilot has been filed alleging violations of the rights, including copyright, of authors that have created or contributed to codebases stored as public repositories on GitHub.
GitHub Copilot was released to the public earlier this year. The tool operates as a plugin for Visual Studio and similar IDEs and offers software developers 'auto-complete' functionality. By simply starting to type (e.g. a function definition and parameters), Copilot will suggest lines of code in real time. The developer then simply checks that the code does what they intended.
We considered the possible IP implications at the time of the public launch, raising the possibility that training the algorithm may have infringed copyright in developers' code and that code written with Copilot's assistance could be vulnerable to third-party infringement claims.
Now, a lawsuit against the companies behind the tool has been filed in California. The complaint alleges that GitHub, Microsoft (which owns GitHub) and OpenAI (the AI lab responsible for ChatGPT, DALL-E 2 and GPT-3) have violated the rights of individuals who have posted their code to public GitHub repos (shorthand for 'repository', being a place where code is stored, and changes can be tracked). The suit, seeking class-action status, targets both Copilot and OpenAI's Codex tool, which provides the technology underlying Copilot.
Purported evidence of utilising GitHub repo content
The class action's factual arguments rely on an observed pattern that Copilot appears to output snippets of code from certain repos verbatim. The plaintiffs observe, for example, that an extract from a popular JavaScript self-learning programme are blindly included by Copilot, including comments and code test execution (which are extraneous to the function itself). This particular learning programme is typically forked (copied verbatim as a starting point for a new codebase) by those undertaking the module.
As the class action complaint sets out, this means there are an unusually high number of instances of the tutorial code extract on GitHub and, in turn, the training data for the AI. This allegedly causes Copilot to autocomplete functions containing the entire tutorial extract, complete with tests for the functions and, apparently, question marks where the user of the tutorial had been expected to provide their own answer.
The plaintiffs deploy this observation like a 'Mountweazel trap' to suggest that the Copilot relies on indexed harvested complete code snippets rather than intelligently generating code in real time.
Legal arguments
The complaint makes three primary accusations:
- DMCA violations – The class action alleges that Copilot violates provisions of the Digital Millennium Copyright Act by distributing copyrighted works without any copyright information i.e. distributing the snippet without including the licence and author information.
- Breach of contract – By omitting to reproduce open licence notices when code snippets are 'suggested', Copilot violates the conditions of such licences, by which the original code had been made available to Copilot/Codex.
- Unlawful competition – Copilot passes off code as an original creation and GitHub, Microsoft and OpenAI have been unjustly enriched through Copilot's subscription charge.
Defence
At the time of writing, no Defence has been published, and we won't speculate in this article about potential defensive arguments under US law. However, we note that the following issues would likely arise in many jurisdictions:
- Standing – Copyright infringement claims typically need to be brought by the owner or exclusive licensee of the specific, identified copyright works alleged to have been infringed. Similarly, claims for breach of contract need to be brought by a party to the contract alleged to have been breached. It is common for claimants in actions arising from open source software to fail to demonstrate that they have sufficient standing to sue on these bases.
- Originality of source works – The code snippets that the plaintiffs rely upon are all utility functions – they have been written to perform standard operations such as determining if a number is even. Even if the first example of such utility function was an original work, for commonplace functions it may be very difficult to identify the original author of the function. The author of a textbook instructing learners how to print "Hello World!" will likely have copied the code from elsewhere, meaning that the sample code in their textbook is not itself an original copyright work capable of being infringed.
- 'Fair use'/'fair dealing' defences - The use of publicly available information to train a machine learning model may be considered fair use in some countries, and outputs based on training data may not be considered a derivative work. In addition to existing 'fair use' defences, legislatures in many territories have enacted specific exemptions from copyright infringement liability for data mining, so as to encourage innovation and the development of new products and services based on AI.
A brave new world?
There is rather dramatic commentary in the complaint to the effect that the use of code stored on GitHub in this way runs against the open-source ideals of GitHub's original users. The complaint's authors consider that Copilot paves the way for a "brave new world of software piracy" in which AI "ignores, violates, and removes the Licenses offered by thousands—possibly millions—of software developers … on an unprecedented scale".
Existing intellectual property laws have struggled to keep pace with the age of machine learning, and policymakers are nervous about making changes that way stifle innovation. Other AI tools in use today were conceived and trained in similar ways to Copilot. In years to come, further litigation seems inevitable about whether AI technologies profit unfairly from the efforts of individual creators – whether of software code, written articles, photos, videos, songs and artworks – and where the correct balance should be struck between protecting creators and encouraging the deployment of AI.
What happens next?
We expect that the open source community will follow any future proceedings with a close eye. An initial hearing for the case is yet to be fixed, and it may never make it to court. What is certain, however, is that the case has already generated a huge amount of interest. The open source community, developers of new AI-based tools, and the wider population of programmers who incorporate the outputs of AI tools into their code would all benefit from clarity, both from legislators and the courts.