A written work is the words and symbols in it, and the order in which they appear. Alaska for Looking [1] is clearly a derivative work of Looking for Alaska; clearly, only one of these is necessary to preserve that identity.

1: nerdfighteria.info/v/Exiizp4Kh

A single chapter of a book is still a derivative work of that book; a book with every other word deleted is still a derivative work of the original. Clearly, complete reproduction is unnecessary to maintain derivative identity.

A statistical model capable of reproducing the structure of a work well enough to write valid code, then, must encode at least some of at least one of the contents or the ordering of symbols, and is thus a derivative work.


In other words: GitHub, stop laundering copyright. If you trained Copilot on GPL code, you are obliged to release that derived product as GPL. If you trained it on MPL code, you are in violation of that license.

Even if you believe that the OpenAI Codex isn't a derivative source work, it is _definitely_ object code under the definition given in the GPL, which means they are still required to open source...

... ""all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities."

@tindall don't worry, they tested it and it only gives back verbatim code that someone else wrote in 1 in every 1000 autocompletes

which is apparently a rare thing according to Microsoft

and also they had to specifically force it to not spit out the GPL if you autocompleted in an empty file because it had consumed so much GPL'd code


@tindall i'm waiting for someone to patiently explain to me why this isn't true in a way that sounds credible, but i think the answer boils down to: because copyright exists for the benefit of the megacorps, not in order to meaningfully restrain them.

@brennen @tindall Even better: it gets into the differentiation between algorithms and code expression of those algorithms, and I shudder to think that we're once again going to open this box.

courtroom, fantasy; (thread missing CW) software licensing, corporations 

@tindall .hg
prosecutor: What is the project's licensing model?
Google employee: It's open-source.
prosecutor: Where is the source code located?
Google employee: The code is in a git repository at android.googlesource.com.. There are instructions for building it on the website at source.android.com..
prosecutor: Very well. Please download the source code onto this laptop and build the project.
Google employee:
Google employee: *sweating bullets*

@tindall no they are not. Whether they need to comply depends on them creating a work that is considered derived from a work to which they only had a license under gpl and them having no other right to use the software for this purpose. But I don’t see, where training your model is different from e.g a line count program and distributing your findings. Also it is questionable whether the short snippets „reproduced“ even constitute protectable work.

@tindall This is based on german copyright law, but should be similar enough to transfer to US law because copyright law is very similar around the world thanks to the revised berne convention on copyright. I study law and was often surprised how different lawyers see the world. Computer people (hackers) tend to apply their technical knowledge to law problems and it’s very often very wrong.

@aurorus if you don't see how training a model that can reproduce code snippets or even whole files a nonnegligible percentage of the time is different from running a word count we are living in different realities.

@tindall I think they're bypassing that by the fact that they're not distributing the model itself, but instead running it on their servers with access through an api.

Now, of course, the training data probably does include *AGPL* code too.

@kepstin @tindall They're obviously transmitting the derived code to your text editor though, so I don't think they can hide behind that.

@ari and, nonetheless, there are AGPL sources in that corpus

@ari @tindall for sure - to get to the point where the suggestions this thing generates are legally usable, it really needs to tell you (or let you filter suggestions by) the license - and give you all attributions required by the license of the derived snippit.

And given that most models of this sort are "throw in a bunch of training data and see what comes out", getting that sort of structure isn't really possible as far as I know?

@kepstin @tindall Nope. As far as I know this is an area of anti-research because if you start keeping paper trails of which input data caused you to decide what you ruin the entire point of AI, which is to claim you're absolved of liability because the computer totally did it by itself.

@tindall my fear at the outset of Copilot is the result of the Oracle vs Google case.

I believe Microsoft will point to that, draw enough gray areas to confuse the courts and say something like "the code we trained our models on were APIs. the resultant generated code conforms to a similar API, but it is not the same API. technically, we're just doing what google did to oracle and you just said that was okay."

and if that's "not okay" then it means they could use AI to generate any conceivable API that doesn't exist and then copyright it?

idk. that's just my fear on how this shakes out, despite it being clearly derivative of the open-source works it was trained on.

@tychi @tindall upside: that also means training on a set of proprietary licensed source code can also be used as a path to generate free software, if true

@tindall @cwebber the entire area of AI created anything is fuzzy. Usually, the best guess is that the person who programs/trains the AI has ownership. I know there's some effort in the EU going on to sort out these legalities, but since they'll have an impact on international copyright law as well, I expect this to not resolve very quickly.

@tychi yeah, i mean, it's possible. i don't think most courts are quite that easy to deceive, even for microsoft.

Sign in to participate in the conversation

cybrespace: the social hub of the information superhighway jack in to the mastodon fediverse today and surf the dataflow through our cybrepunk, slightly glitchy web portal support us on patreon or liberapay!