oh my gods. they literally have no shame about this.

GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.


it's official, obeying copyright is only for the plebs and proles, rich people and big companies can do whatever they want

@tindall Did anyone ask for the Github source code yet? It should be GPL itself now, right?

@irimi1 unfortunately I don't think that really follows. Codex/Copilot should be, though.

@tindall I would expect someone to challange that, I think it’s an interesting question.

Regardless of that I’d love to see consequences for a big company like Github that just takes all their public code, including projects with restrictive licenses, and makes it into a project of their own. Earning money on other people’s (or even their customers’) achievements.

@irimi1 there will be zero consequences. they are too rich and too entrenched in the ecosystem.

@irimi1 @tindall

I dunno. They also have very deep pockets. I could definitely see some enterprising law firm launching a class-action law suit on behalf of the owners of one or more source-under-glass repos.

@suetanvil @irimi1 @tindall i hope you're right

but has anyone ever sued on a gpl violation claim?

@tindall @irimi1

The knowledge base should be because it’s derived from GPL’d code.

@tindall @irimi1 I hear somewhere, that GitHub employees used GitHub copilot in their everyday work. Then GitHub can include parts of code under GNU GPL. Then I think GitHub should be open source, right?

@tindall @irimi1
Here I read about this: drewdevault.com/2021/07/04/Is-

Here's qute from docs.github.com/en/github/copi
"During GitHub Copilot’s early development, nearly 300 employees used it in their daily work as part of an internal trial."

@tindall @irimi1 Is the issue that Github have trained their AI on publicly available text? I'm not sure that would be a copyright violation - after all the original work has merely been read, temporarily copied; it hasn't been redistributed under a false licence.

If mere reading and temporary processing can constitute copyvio, then disability readers, translation, search engines would be in trouble.

@jim @irimi1 The model is known to reproduce some code, including GPL-licensed code, verbatim; therefore, it must contain verbatim copies of that code, however it is encoded.

@tindall @irimi1 I see, interesting. It could boil down to how “generic” those snippets are then, whether the snippets contain an element of originality, or if they are approximately the most generic solution available, perhaps.

@jim @irimi1 the snippet in question is clearly, deeply original. it is a cursed coding crime that contains several "magic constants" with high entropy.

@tindall @jim @irimi1 Do you have a source for this BTW? I'd really like to learn more.

class action
class action
class action

we can only hope

@tindall @ultranova consider

feeding the leaked Windows source code into a Markov chain, then use that to write something useless, and let see if Microsoft argues that this violates their copyright

@uint8_t @ultranova I really want to do this but I really don't want Microsoft to sue me

@tindall it isn't something I would personally do, but

@tindall @uint8_t @ultranova Finally, a good excuse to do some opsec LARPing and try to do something online that's untraceable to oneself . . . I mean for someone else to do, I definitely won't do that personally of course (I mean I almost certainly actually won't, I'm not lying, I'm very lazy and have too many nascent or half-finished projects already, but it's tempting . . .)

@uint8_t @tindall I feel you would need them to first win a class action, because then you have legal precedence :)

@ultranova @uint8_t not a class action - just one suit. But even then, getting sued sucks.

@tindall It's funny because I keep saying this and nobody believes me

@tindall I wish I was more flabbergasted by that response.

@tindall @irimi1

It was funny when they said the AI community considers it fair use to trawl publicly available data without a license, but more importantly the legal community considers it fair use too.

When a machine reads copyrighted content there is no engaged reader to enjoy and appreciate the content and that's ultimately what the copyright doctrine is about!

This is an excellent read, which brings up several concrete examples like Google Book and image search:


/via lobste.rs/s/nuve73
/via botsin.space/@lobsters/1065236…

@clacke @tindall @irimi1 As mentioned before, the issue is that Copilot most likely stores third party code in its knowledge base, as they generate snippets verbatim from existing codebases. This would meet the (A)GPL's requirement for derivative works, which would require the entire codebase to be released.

@tindall eh, no. There is a datamining exception, that allows this kind of thing:

And it's important and useful, for scientists and investigative journalists.

It also happens to be useful for Microsoft Github Copilot here. And I share your frustration about this. The problem is: it's really difficult to make it not useful for Microsofts of this world without a lot of blocking scientific research and investigative journalism.

@tindall that is obviously still a conversation worth having, though!

Still, Microsoft Copilot does seem to infringe every now and then, when it quotes verbatim full passages from certain pieces of code:

*That's* where Microsoft needs to get smacked hard for copyright infringement and licensing violations!

@Shamar @rysiek @tindall @chebra I talked with her on Twitter (in German) and she wasn’t even aware that Copilot reproduced Quake’s Inverse Square Root Hack, including the “// What the fuck?” comment. And what she said about no copyright would be better for copyleft is plain BS: Then everybody would be able to only distribute binaries.

@rysiek could an AGPL license like with a clause about AI models useful here? I mean could it make sense? If you train your AI with this, the model would be covered by the same AGPL like license, and you have to release it.

Just wondering: if someone make the same of github copilot but for music using YouTube, the songs made by the AI would not be copyright infringement although it can be possible that some parts are verbatim copy from trained material?


@eriol @rysiek The whole argument from the GitHub side is that training an AI on publicly readable works is fair use. All copyleft licenses - indeed, even MIT and other permissives - are based on copyright, so if GitHub is right, _no_ license will protect you.

@tindall @eriol and they are largely right. There was a huge fight around datamining exception in the Copyright Directive about 2 years ago. Datamining exception stayed in the directive.

@rysiek @eriol The problem isn't the datamining exception itself, which is fine, but rather what we consider datamining. how much do you have to dress up verbatim copying for it to be acceptable?

@rysiek @tindall I agree on the datamining exception of course. So as both of you say IIUC the model extracted from the data anyway can't be seen as a derivative work because is like when someone make a new discovery on a pre-existent set of data, right? The model by default is something new, right?

@eriol @tindall at least it should be, yes. The problem with Copilot is that its output is way too close to the original verbatim training data.

@eriol @rysiek That's the question. It's been demonstrated that the original content is in there in a very real way, since it can reproduce parts of it verbatim, but who knows if a court will see that as copying or not.

@tindall @eriol @rysiek If you want a test case, start GPLing your copilot output, particularly if it coincidentally contains quotes from some GPL hating megacorp. It's not hard to find candidates for the latter.

@rysiek @tindall @eriol For best effect, I guess you need to copilot-wash code from a project with a CLA, so that GPLing without copilot would be a challenge.

@eriol @tindall @rysiek i mean, the model itself is something new. However if you take a painted picture and extract the style, then the style itself is not copyrighted, but if teach that style to someone and they reproduce the image the style was originating from, it should still be a copyright issue, because that's a derived work.

So while copilot is probably not under (A)GPL, its results might be. A legal minefield.

@sheogorath @eriol @tindall all creative work is derivative:

The question is where the line between derivative work and inspiration exactly lies.

@rysiek @eriol @tindall yep and in case of copilot, thata something courts should hopefully figure out soon. In best case in multiple jurisdictions with different results! 🍿

@rysiek @tindall @eriol This is well settled (see also the Turnitin 'ai') in the academic plagiarism context, maybe the conclusions there can carry over.

@eriol @tindall @rysiek Reminder: Software licenses are based on copyright law which is based on the concept of creative work. Copyright infringement can happen with the *output* of a creative process e.g. distributing code written with the help of Copyalot. Training an AI kernel is besides the point.


Sign in to participate in the conversation

cybrespace: the social hub of the information superhighway jack in to the mastodon fediverse today and surf the dataflow through our cybrepunk, slightly glitchy web portal support us on patreon or liberapay!