2

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

 1 year ago
source link: https://lwn.net/Articles/899530/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

[Posted June 30, 2022 by jake]

The Software Freedom Conservancy (SFC) has issued a strong call for free software projects to give up GitHub and to move their repositories elsewhere. There are a number of problems that SFC has identified with the GitHub code-hosting service and, in particular, with its Copilot AI-based code-writing tool that was trained on the community's code stored in the company's repositories. Moving away from GitHub will not be easy, SFC said, but it is important to do so lest the free-software community repeat the SourceForge mistake.

Specifically, we at Software Freedom Conservancy have been actively communicating with Microsoft and their GitHub subsidiary about our concerns with "Copilot" since they first launched it almost exactly a year ago. Our initial video chat call (in July 2021) with Microsoft and GitHub representatives resulted in several questions which they said they could not answer at that time, but would "answer soon". [...] Last week, after we reminded GitHub of (a) the pending questions that we'd waited a year for them to answer and (b) of their refusal to join public discussion on the topic, they responded a week later, saying they would not join any public nor private discussion on this matter because "a broader conversation [about the ethics of AI-assisted software] seemed unlikely to alter your [SFC's] stance, which is why we [GitHub] have not responded to your [SFC's] detailed questions". In other words, GitHub's final position on Copilot is: if you disagree with GitHub about policy matters related to Copilot, then you don't deserve a reply from Microsoft or GitHub. They only will bother to reply if they think they can immediately change your policy position to theirs. But, Microsoft and GitHub will leave you hanging for a year before they'll tell you that!

(Log in to post comments)

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 21:21 UTC (Thu) by bluca (subscriber, #118303) [Link]

Github is bad, it's a proprietary platform! Says the SFC on Twitter which is, er, what was the word... https://twitter.com/conservancy/status/1542503499094626304

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 22:20 UTC (Thu) by Trelane (✭ supporter ✭, #56877) [Link]

I'm not sure what your trying to argue here.. Would you mind stating your argument more directly?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 22:51 UTC (Thu) by Karellen (subscriber, #67644) [Link]

I think they're just trying to point out how very intelligent they are.

https://thenib.com/mister-gotcha/

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 23:23 UTC (Thu) by scientes (subscriber, #83068) [Link]

The real issue here is that git was always suppose to work decentralized, but while git is quite easy to set up to share your repositories, that is too much for many people.

One issue is that NAT means that you need to rent a VPS, as it makes it much harder to just share from your personal computer.

Also, it is sort of a semantic web thing.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 12:41 UTC (Fri) by tchernobog (subscriber, #73595) [Link]

git != github, though. An ALM solution is not the same as of a VCS.

Git does not offer out of the box any issue tracking, CI pipelines, discoverability of branches and code review UI, etc. All things which are highly desirable by most software projects out there, open source or not. The fact that your code is hosted in a safe environment, it's easy to search across repositories, etc. is something many people are willing to pay for (either in money, or in liberty, as often is the case).

SFC's use of Twitter.

Posted Jul 1, 2022 13:23 UTC (Fri) by bkuhn (subscriber, #58642) [Link]

Karellen, I feel you could have made your point with less sarcasm, but raising the question that SFC organizationally uses Twitter is a reasonable thing to ask about. But, I would encourage in future that you perhaps frame your inquiry with some like “I do feel it as somewhat hypocritical that SFC has called for folks to give up GitHub, but they aren't calling for folks to give up Twitter — and in fact SFC is using Twitter actively!” That would be a respectful way to ask your inquiry.

Speaking as the person at SFC who's primary job (as Policy Fellow) is to analyze and consider and recommend policy of how we approach these proprietary software situations, I'll note that we now live in a difficult and complex world where it has become increasingly difficult (at least in industrialized countries) to engage with communities and pursue the normal functions of life without interacting with proprietary software. Personally (outside of my work at SFC), I refuse to use Twitter also. It was a difficult decision for SFC to continue using Twitter (which, BTW, I prefer to call Agrawaland — and I used to call Dorseyville (and I guess I'll be calling Musktown soon?) — all to note that Twitter is not a democratic platform, it is a for-profit company's property under the autocratic control of its CEO).

My colleague Karen Sandler and I gave two keynotes (at FOSDEM 2019 and 2020, respectively) about the challenges FOSS activists face in choosing when to use or refuse to use proprietary software. These are hard issues to decide. In fact, we internally talked quite a bit while planning the Give Up GitHub campaign to determine if GitHub had crossed enough lines that they are substantially worse in their behavior than other proprietary software companies. We believe they are, which is why we launched the campaign, but we understand that you may have a different opinion.

Meanwhile, I'll put it on the agenda for future blog posts that I should write explaining how SFC came to the decision to keep participating in Agrawaland — particularly after they previous regime (Dorsey's) cut off the federation features abruptly (which lead to identi.ca's demise). Thanks so much for your inquiry and this will make a useful blog post. I can't promise a timeline for it as we have a lot of writing in the pipeline, but I will look into it!

SFC's use of Twitter.

Posted Jul 1, 2022 18:08 UTC (Fri) by Karellen (subscriber, #67644) [Link]

But, I would encourage in future that you perhaps frame your inquiry with some like “I do feel it as somewhat hypocritical that SFC has called for folks to give up GitHub, but they aren't calling for folks to give up Twitter — and in fact SFC is using Twitter actively!” That would be a respectful way to ask your inquiry.

Um, the original inquiry was not mine? I'm a bit confused which parts of your response are directed at me, and which at bluca!

SFC's use of Twitter.

Posted Jul 1, 2022 19:24 UTC (Fri) by bkuhn (subscriber, #58642) [Link]

Karellen wrote:
> Um, the original inquiry was not mine? I'm a bit confused which parts of your response are directed at me, and which at bluca!

Yes, I'm so sorry for using your name, Karellen. I grabbed the wrong post. I was replying to bluca when I wrote this:

>> I feel you could have made your point with less sarcasm, but raising the question that SFC organizationally uses Twitter is a reasonable thing to ask about. But, I would encourage in future that you perhaps frame your inquiry with some like “I do feel it as somewhat hypocritical that SFC has called for folks to give up GitHub, but they aren't calling for folks to give up Twitter — and in fact SFC is using Twitter actively!” That would be a respectful way to ask your inquiry.

SFC's use of Twitter.

Posted Jul 1, 2022 22:58 UTC (Fri) by Karellen (subscriber, #67644) [Link]

I grabbed the wrong post.

Ha! No problem, I've replied to the wrong person by accident myself plenty of times in the past, on various fora.

Thanks, and keep up the great work.

SFC's use of Twitter.

Posted Jul 2, 2022 7:00 UTC (Sat) by oldtomas (guest, #72579) [Link]

Actually, what looks like a tangent seems to me to be at the very core of the subject. Bluca's somewhat snotty remark (and your answer to it) have somehow struck gold, I think :-)

You write, and very correctly

> I'll note that we now live in a difficult and complex world where it has become increasingly difficult (at least in industrialized countries) to engage with communities and pursue the normal functions of life without interacting with proprietary software.

(I'd disagree with the "industrialized" part: people in poorer countries are even more dependent on the "pay with your data" model)

Surveillance capitalism has learnt to interpose in our communications channels with other people or with the world in general, be it perception (Google glasses [1]), hospitality (AirBnB), personal relationships (Facebook), small markets (Amazon, eBay), public communications (Dorseyland -- uh -- Twitter), you name it.

They just insert themselves into the channels to strip-mine and monetize all that huge potential which was "going to waste" before. Wild west, claims, land that didn't belong to anyone: all over again, yay! (Of course, we know that society as a whole pays some price. It ain't a zero-sum game. The dead Rohingya
people, courtesy of Facebook is one particularly bitter example).
Coding is an inherently social endeavour.

As mpldr notes elsewhere in this comment section, what Github pulled off (in its pre-Microsoft phase) is to cast a social network over collaborative software development. The parallels to Facebook are chilling. That's what Microsoft shelled out ~$7B for. They are drowning in cash, sure, but this is a significant amount, even for them. They didn't this for the goodness of their hearts.

Personally, I'm far more worried by this than by the questions about license status of software snippets shovelled around by some NLP AI. Although this latter question is also quite important (and thorny) and I'm happy SFC is taking on it.

Keep up the good work!

[1] Some might interject that one's dead. This instance is, but the breed ain't

SFC's use of Twitter.

Posted Jul 3, 2022 22:39 UTC (Sun) by alfille (subscriber, #1631) [Link]

Perhaps I'm a little tangential here, but giving "ironic" names to someone else's product is also disrespectful (Agrawaland, Musktown, etc) and doesn't raise the level of discourse.

SFC's use of Twitter.

Posted Jul 4, 2022 4:51 UTC (Mon) by oldtomas (guest, #72579) [Link]

I think in this case it's justified: it reminds of the problematic fact that we, in societies which call themselves "democratic", are more and more outsourcing vital infrastructure to corporations whose governance is all but.

SFC's use of Twitter.

Posted Jul 4, 2022 16:03 UTC (Mon) by ttuttle (subscriber, #51118) [Link]

I hate this. Whether or not I care about someone disrespecting the product, it's obnoxious:

In a conversation about the merits of the product, it's lazy -- it's a way to smear the product without giving a proper explanation.

In a conversation about something else, it's rude -- it's a way to push the speaker's opinion about the product even when it's irrelevant or distracting.

SFC's use of Twitter.

Posted Jul 5, 2022 22:34 UTC (Tue) by hummassa (subscriber, #307) [Link]

Is it always, though? I'm not sure. If you consider it a commentary on the autocratic governance of such hypermedia giants, it's not unexplained (the reason for your annoyance #1), and neither irrelevant (annoyance #2.1) nor distracting (#2.2).

Maybe your annoyance with such discourse comes from lacking the usage of the https://en.wikipedia.org/wiki/Principle_of_charity -- a principle that is very useful in a respectful and beneficial dialogue. "Be strict on what you produce and lax on what you consume", like Unix :-) ...

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 22:50 UTC (Thu) by josh (subscriber, #17465) [Link]

Says Conservancy on their blog (hosted on Free Software), and on federated services, which they also mirror to Twitter to reach a wider audience. Which is consistent with their suggestions to use GitHub as a mirror, if you can't switch away entirely.

https://i.kym-cdn.com/photos/images/original/001/259/257/...

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 23:27 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]

> Github is bad, it's a proprietary platform! Says the SFC on Twitter

The primary focus of this blog post (which itself is free software, aggregated on a proprietary platform which is permissible in their view) is not about the proprietary nature of the platform itself but about the implications of Copilot. I suspect you don't agree with the criticism of Copilot, however deflecting from it using this approach is unhelpful.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 0:10 UTC (Fri) by bluca (subscriber, #118303) [Link]

'twas just an attempt at a silly joke - I read the post first on Twitter, and found the juxtaposition funny. Evidently it was only funny to me, and that's ok

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 3:46 UTC (Fri) by Trelane (✭ supporter ✭, #56877) [Link]

Ah, ok. I totally did not get the joke. :)

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 6:55 UTC (Fri) by josh (subscriber, #17465) [Link]

Poe's Law in action. It's hard to tell the difference between a joke and a snide comment.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 10:05 UTC (Fri) by bluca (subscriber, #118303) [Link]

Especially when it's a bad one ;-) My future career as a comedian is in tatters!

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 8:58 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

Are you working at microsoft? https://github.com/bluca

It seems something you should disclose since you are all over the comment section defending copilot (with what I think are fundamentally wrong arguments).

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 11:07 UTC (Mon) by bluca (subscriber, #118303) [Link]

"disclose"? Mate this is a subscriber-only news item comment section, not the Supreme Court

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 12:34 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

> "disclose"? Mate this is a subscriber-only news item comment section, not the Supreme Court

You have 24 comments in this page defending copilot.

It is not a legal requirement that you disclose that you work for microsoft but it would be more honest.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 12:46 UTC (Mon) by bluca (subscriber, #118303) [Link]

If you remove the tinfoil hat you'll realize the two things have nothing to do with each other

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 15:36 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

Then why not be the bigger person next time and just say so directly yourself?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 16:30 UTC (Mon) by bluca (subscriber, #118303) [Link]

There's all sort of completely irrelevant things that are not said. I don't care about reading your (or anyone else's) autobiography when reading your comments on an LWN article.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 21:36 UTC (Thu) by mpldr (subscriber, #154861) [Link]

We all knew where Microsoft buying GitHub would lead… don't kid yourselves. They tried to make code a social network (and in a way they unfortunately succeeded) and what does always come with social networks by large corporations? Predatory monetization schemes and not standard-compliant behaviour (to this day releases are lightweight tags, presumably to drive a user to GitHub to read the changelog).

I would love to see people moving away from that platform to something more interested in actually being free (and just a completely crazy idea: maybe built on an open ecosystem). My projects have only been mirrored to GitHub for discoverability, but I am seriously reconsidering if I should waste CI seconds on it for no benefit except for Microsoft.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 22:09 UTC (Thu) by nix (subscriber, #2304) [Link]

> to this day releases are lightweight tags, presumably to drive a user to GitHub to read the changelog

I have never heard of any project ever putting its changelog in an annotated git tag. It's just not what anyone ever does, whether their projects use github or not. Changelogs go in git log | git shortlog or in files in the repository itself. (And github is perfectly happy to make annotated tags into releases, or was last time I tried -- my objection there is that you have to form your release tag names in a particular, highly stereotyped, frankly unusual way or you get ridiculously named files in your release tarballs.)

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 22:23 UTC (Thu) by mpldr (subscriber, #154861) [Link]

https://git.sr.ht/~sircmpwn/aerc/refs as an example. It's just some prose and git shortlog but it has the advantage that you don't have to look what the previous tag was and you can instead just look at the specific changelog with git show v0.5.0

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 7:46 UTC (Fri) by Sesse (subscriber, #53779) [Link]

I have my changelogs in NEWS and copy them to the annotated tag.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 5:29 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

> I would love to see people moving away from that platform to something more interested in actually being free (and just a completely crazy idea: maybe built on an open ecosystem).

Who will bell the cat?

More prosaically: What open platform are you proposing we use instead of GitHub?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 10:36 UTC (Fri) by mpldr (subscriber, #154861) [Link]

I personally like the email based approach of Sourcehut very much and would recommend to everyone who hasn't tried it to do so. It's performance is way better than GitHubs (both web frontend and uptime wise) and it has a payment model I find very sensible. (i.e. please pay if you can; if you can't reach out and something can be sorted out)

As always: try before you buy; maybe that approach is not for you, which is completely fine. If it's not for you, check out Codeberg and git.disroot.org

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 12:32 UTC (Fri) by bluca (subscriber, #118303) [Link]

Email workflows are absolutely terrible for anybody who is not already used to email workflows. And even for them in some cases (like me).

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 17:32 UTC (Fri) by mpldr (subscriber, #154861) [Link]

I agree in so far that especially for code reviews a UI is needed… at least for seeing all comments at once, in most cases switching to email was rather painfree for me. The only thing somewhat new was using send-email and am correctly.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 17:51 UTC (Fri) by bluca (subscriber, #118303) [Link]

Responding to old threads before being subscribed (most mailing list archives do not show message-id so even if you know what that is, you can't reply), tracking status of things, mountains of spam, corporate email servers mangling plain text emails, etc - these are all extremely painful to do with projects using emails, and trivially easy and intuitive to do on Github or Gitlab

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 22:34 UTC (Fri) by mpldr (subscriber, #154861) [Link]

> most mailing list archives do not show message-id so even if you know what that is, you can't reply

The message ID is usually in the URL, but (at least nowadays) there's usually a "Reply to thread" Button.

> tracking status of things

That's kind of a "you" problem. Some people do, and some people don't. And those who don't can usually follow better through the WebUI and use the aforementioned Reply to thread button

> mountains of spam

Reject text/html and you don't have spam.

> corporate email servers mangling plain text emails

What in the everloving bleep?! No mailserver should modify an emails content unless there's good reason (like a virus) because that will mess up PGP or – in corpo world – S/MIME signatures, so I somewhat doubt there's a lot of this behaviour.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 22:56 UTC (Fri) by bluca (subscriber, #118303) [Link]

> The message ID is usually in the URL, but (at least nowadays) there's usually a "Reply to thread" Button.

Most mailing list archives I see have neither.

> That's kind of a "you" problem. Some people do, and some people don't. And those who don't can usually follow better through the WebUI and use the aforementioned Reply to thread button

Well no, it's not a 'me' problem. It's blinding obvious if you open a PR on Github whether it has been merged or not. Instead open a random patch email on a mailing list archive and try to guess.

> Reject text/html and you don't have spam.

Have you actually ever seen a mailing list? I'm starting to doubt it, with statements like that.

> What in the everloving bleep?! No mailserver should modify an emails content unless there's good reason (like a virus) because that will mess up PGP or – in corpo world – S/MIME signatures, so I somewhat doubt there's a lot of this behaviour.

What they 'should' do according to you is irrelevant - vast majority of corporate email servers do exactly that, and there's diddly squat you can do about it as an employee.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 9:16 UTC (Sat) by ddevault (subscriber, #99589) [Link]

Since this thread is ostensibly about SourceHut, I can comment on the specifics of SourceHut's mailing list UI. We have a "reply to thread" button, and we also have a "forward this thread to me" button, which both make it pretty easy to participate in old discussions. We also have mbox exports for the past 30 days of a list, or the complete archives, which you can pull into your mail client and retro-actively filter into folders or whatever you wish.

SourceHut also tracks the review status of patches:

https://lists.sr.ht/~sircmpwn/hare-dev/patches/

We also take responsibility for managing spam across the whole site, and remove that burden from list maintainers. Spam is exceedingly rare on SourceHut -- I think I've only ever seen one spam email make it past our filters in the past 3 years.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 13:45 UTC (Sun) by vimpostor (guest, #159442) [Link]

> SourceHut also tracks the review status of patches: https://lists.sr.ht/~sircmpwn/hare-dev/patches/

That looks really nice, but I wonder how does Sourcehut actually know whether a patch is applied?
Surely it can only reliably detect this, if you apply using the web interface.

I however like to apply from the command line and sometimes edit the patch before applying for some minor code style changes. Obviously this changes the patch, so how would Sourcehut know in that case that the patch was applied?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 13:55 UTC (Sun) by ddevault (subscriber, #99589) [Link]

Right now it's done semi-manually. You can change the patchset status through special email headers (I have my mail client configured to add these when I reply to say "thanks" for the patch), or through the API.

But we intend to make it automatic. The essential heuristic is the commit date, which matches the Date header and survives amending and rebasing.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 15:17 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

> The essential heuristic is the commit date, which matches the Date header and survives amending and rebasing.

Wouldn't the author date be the one to trust?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 15:35 UTC (Sun) by ddevault (subscriber, #99589) [Link]

Right, my mistake.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 23:48 UTC (Sat) by mpldr (subscriber, #154861) [Link]

> Well no, it's not a 'me' problem. It's blinding obvious if you open a PR on Github whether it has been merged or not. Instead open a random patch email on a mailing list archive and try to guess.

I'd say the Linux kernel, Git, various GNU projects, and pretty much all Sourcehut projects tell a different story.

> Have you actually ever seen a mailing list? I'm starting to doubt it, with statements like that.

Sourcehut lists (chef's kiss), Mailman, Google Groups… I got around a bit and I may not have seen all possible solutions but at least some.

> What they 'should' do according to you is irrelevant - vast majority of corporate email servers do exactly that, and there's diddly squat you can do about it as an employee.

Then I'm just glad that I did not have the displeasure of experiencing this… it was mostly Outlook/O365 and Gmail with custom domains so far.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 12:20 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

> I'd say the Linux kernel, Git, various GNU projects, and pretty much all Sourcehut projects tell a different story.

I beg to differ. I didn't know *my own patches* had been pulled until I manually did a `log --author=` check out of curiosity and found my patches had finally made it. This is with Linux and Git at least.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 13:01 UTC (Sun) by brunowolff (guest, #71160) [Link]

> What in the everloving bleep?! No mailserver should modify an emails content unless there's good reason (like a virus) because that will mess up PGP or – in corpo world – S/MIME signatures, so I somewhat doubt there's a lot of this behaviour.

I complain about this every couple of months. We use o365 and have enabled Safelinks. Safelinks corrupts email messages to make what appear to be URLs, proxy URLs for the same resource. It does this in at least text/html and text/plain parts. The replacement has a pattern that you can use to undo this in your mail reader (using a preprocessing script) with a low false positive rate. Another broken feature o365 has is temporarily replacing attachments with dummy ones while the original attachments are being scanned for viruses. If you notice this in time, you can go back and undelete the message (even if it was expunged) and get the attachment after it has been cleared. There are other broken features of this service not related to corrupting messages as well. I think they intentionally support only limited uses for email, which they think are common and don't care much about whatever they break for less common cases. Good luck trying to get an exception from your security people to opt out of this brokenness, even if your your case, the threat is extremely small and the brokenness causes more grief on average. They is more brokenness coming. They recently notified people about an attachment blocking feature that purports to block attachments of types on a blocklist, without providing a definition of how they determine the types of attachments. They don't say if they use Content-Type, Content-Disposition (from filename) and/or actually scan the attachment to determine the type, nor how that data actually maps to their list which were not standard mime type names. For people using the web interface there is more brokenness related to charset and no support for format=flowed.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 23:29 UTC (Sat) by ceplm (subscriber, #41334) [Link]

One side note: I have learned today that SourceHut (being hosted on Cloudflare) doesn't allow access via tor.

Could anybody confirm?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 23:39 UTC (Sat) by mpldr (subscriber, #154861) [Link]

I'm not sure what lead you to thinking that sourcehut is running on CloudFlare. sr.ht is on IP 173.195.144.0/20 which is managed by Data102. In fact Sourcehut Pages explicitly deny traffic from CloudFlare proxies, so it would also be unreasonable to assume they'd use it themselves. Sourcehut also is available via TOR (at least with my current exit node, maybe the one you were using happend to be abused)

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 6:55 UTC (Sun) by ceplm (subscriber, #41334) [Link]

OK, then I am sorry. Current maintainer of https://sr.ht/~martanne/vis/ (@ninewise on Libera IRC) complained that it didn’t work for him.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 22:10 UTC (Thu) by bpearlmutter (subscriber, #14693) [Link]

How about deploying tools to generate bad code and upload it to github, complete with forking repos, starring, following, downloading, filing and resolving issues, all sort of traffic. People could run it as a daemon.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 23:16 UTC (Thu) by dullfire (subscriber, #111432) [Link]

INAL, but sounds like it could lead to CFAA charges.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 13:07 UTC (Fri) by bpearlmutter (subscriber, #14693) [Link]

Perhaps it is a general-purpose testing tool for checking how a gitlab system stands up to a substantial load, for use in regression testing and tuning of gitlab instances. Which can also back-end on other similar systems instead: gitorious, github, etc.

Not our fault if people use it for other purposes.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jun 30, 2022 23:48 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

How about a tool that would subscribe to FSF projects' mailing lists and post fake patches? Perhaps generated by the AI.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 1:31 UTC (Fri) by sfeam (subscriber, #2841) [Link]

Makes total sense that an AI would subscribe to a forum devoted to software freedom! Google's LaMDA should join also.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 0:48 UTC (Fri) by bluca (subscriber, #118303) [Link]

I find this to be a very weird rant coming from the SFC which usually does great work. Let's look at the 3 questions which are supposed to be the smoking gun:

> 1. What case law, if any, did you rely on in Microsoft & GitHub's public claim, stated by GitHub's (then) CEO, that: “(1) training ML systems on public data is fair use, (2) the output belongs to the operator, just like with a compiler”? In the interest of transparency and respect to the FOSS community, please also provide the community with your full legal analysis on why you believe that these statements are true.

I assume this must be borne out of US-centric view - there's no need to invoke fair use here in the Europe, data mining on publicly available text and data bodies is exempt from copyright rules as per the copyright directive from a couple of years back. Whether a repository is proprietary or under an FOSS license, it's completely irrelevant, anyone can data mine all day long, as long as it's publicly and legally accessible.

> 2. If it is, as you claim, permissible to train the model (and allow users to generate code based on that model) on any code whatsoever and not be bound by any licensing terms, why did you choose to only train Copilot's model on FOSS? For example, why are your Microsoft Windows and Office codebases not in your training set?

It is not trained only on FOSS? It's been stated many times it's trained on what is publicly available on Github (which includes proprietary repositories with no license) because... that's what the law allows. Moreover, it is common knowledge that Windows and Office are not stored on Github, so they aren't even accessible to Github.

> 3. Can you provide a list of licenses, including names of copyright holders and/or names of Git repositories, that were in the training set used for Copilot? If not, why are you withholding this information from the community?

This makes no sense as a question? The license, if any, is in the each repository on Github. What does it mean it is being "withheld"? It's all public repositories that anyone can freely clone without even an account...

All in all, Github provides an absolutely fantastic service, for free for OSS maintainers including CI time, and now with fancy autocomplete as an extra feature, wrapped in a very nice interface - although that's obviously subjective. Gitlab's interface comes very close (and for some things it's even better), but the huge fragmentation (having to set up and use dozens of accounts, one for each project that has an instance, is just a pain) coupled with no free CI for OSS projects means it just not enough. Everything else outside of these two is just atrocious to use for anybody who still hasn't got used to it (and sometimes not even then - looking at you, Gerrit).

Given all of this, personally I find these appeals very unconvincing. It's a proprietary service... so what? So is my internet connection, my phone service, my bank's website, the local tram ticket machine... these are all running on somebody else's machines and providing a service for external users, and if the owners are fine with running proprietary software on them, I certainly won't lose any sleep over it. It's my machines that I care about, so that I can tinker with them, fix issues that bother me, and so on.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 5:32 UTC (Fri) by rsidd (subscriber, #2582) [Link]

These are all fair points. Github and its servers are based in the US, so US law, not EU law, would apply. But the reality is that copyright claims on AI-derived work have not been tested. If copilot emits a code snippet that is identical or highly similar to a copyrighted work, is that copyright infringement? If you ask AI art machine to generate artwork of a soup can, and it outputs something similar to Andy Warhol (because that's in the training set), is that infringement? How similar is too similar?

For developers and creators, this is a legal minefield and in my opinion people should stay away from copilot in any work that is going to be shared with others. But it is not a reason to boycott github.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 5:52 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

There are problems at both ends. For example, suppose a music-generating AI (not Copilot) generates a song which is very similar to John Cage's 4'33", but has a few extra notes. Does that infringe? The problem is, 4'33" consists entirely of rests, so it's hard to say whether it's even possible to infringe on that piece in the first place. Obviously, that's an extreme example, but it is analogous to the problem of "boilerplate" code. Nobody owns for(int i = 0; i < MAX; i++), either.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 13:36 UTC (Fri) by dan_a (subscriber, #5325) [Link]

> The problem is, 4'33" consists entirely of rests, so it's hard to say whether it's even possible to infringe on that piece in the first place.

An out of court settlement suggests it might be: https://www.billboard.com/music/music-news/musician-settl...

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 7:31 UTC (Fri) by farnz (subscriber, #17727) [Link]

I'm just going to note that I'd expect caselaw around humans remembering code snippets and reproducing them to be relevant here: it's not the worst analogy for what such a machine learning model does, and it's reasonably likely that at some point, there have been copyright cases based around the human ability to remember something they've seen before and reproduce it.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 8:07 UTC (Fri) by rsidd (subscriber, #2582) [Link]

Not in the software domain, but George Harrison was famously found guilty of "subconscious plagiarism" in his "My Sweet Lord", which resembled the Chiffons' "He's So Fine" which he had undoubtedly heard. This was quite blatant, so the ruling was against him; a short tune fragment, like a code fragment, may escape consequences (jazz musicians consciously "quote" tune fragments all the time and aren't expected to pay royalties on the quoted bits).

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 8:33 UTC (Fri) by Karellen (subscriber, #67644) [Link]

1. What case law, if any, did you rely on in Microsoft & GitHub's public claim, stated by GitHub's (then) CEO, that: “(1) training ML systems on public data is fair use, (2) the output belongs to the operator, just like with a compiler”? In the interest of transparency and respect to the FOSS community, please also provide the community with your full legal analysis on why you believe that these statements are true.

I assume this must be borne out of US-centric view - there's no need to invoke fair use here in the Europe, data mining on publicly available text and data bodies is exempt from copyright rules as per the copyright directive from a couple of years back. Whether a repository is proprietary or under an FOSS license, it's completely irrelevant, anyone can data mine all day long, as long as it's publicly and legally accessible.

I note that you (along with many other CoPilot defenders) always focus heavily on the data mining (or "model training") side of the legal implications, and tend to ignore or gloss over the code generation side of things.

I have no issues at all with anyone gathering, analysing and performing computations on whatever FOSS source code they can get their hands on. It's out there with a license that explicitly states you're free to read it, analyse it and learn from it, for your own benefit.

Where I do have issues is where CoPilot outputs source code which is distributed to others. I fail to understand how the source code it produces can not be considered a "derivative work" of its source code inputs, as without those inputs it would produce no output at all. And producing and distributing a derivative work does require a license - or (as SFC ask for) some kind of explanation why the distributor feels a license is not needed.

It is strange if CoPilot's authors invoke the comparison with a compiler, where the output is owned by the operator. Because that's only true if the inputs are owned by the operator. You can't run someone else's source code through a compiler and then claim copyright ownership of the object code just because you invoked the compiler. I am not a copyright lawyer, but I know that isn't how copyright law works.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 10:12 UTC (Fri) by bluca (subscriber, #118303) [Link]

I am not a judge nor a lawyer - obligatory premise.

To me it seems pretty obvious: the work is not consumed under the terms of the license, whatever it might be, so the license doesn't apply to anything that is produced from it. If there's a dual licensed project GPL+commercial (as it's quite common), and I buy the commercial license, anything I do with it is not affected by the terms of the GPL, because that's not how I got the project. In the same way, TDM copyright exceptions are what allow me to train a model on anything publicly accessible, which means I do not see how any claims about the output of the model being subject to the original licenses of the input hold water. The original license is irrelevant, because the law gives me an exception. That is a good thing by the way, we need more exceptions to our ever-more-draconian copyright laws.

Now on the question on whether the output of the model is a derived work - under copyright law, and not under the terms of whatever the original license was - that sounds complicated but it definitely does not seem as clear cut as "Infringement!" as some maximalist takes make it sound. When Copilot was first announced, Felix Reda (who was actually a MEP when these laws were written) wrote an excellent article that touched on that, and it still applies today:

https://felixreda.eu/2021/07/github-copilot-is-not-infrin...

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 11:06 UTC (Fri) by Karellen (subscriber, #67644) [Link]

Thanks for the link, it's a very interesting read.

Going to the "Machine-generated code is not a derivative work" section, I don't think it's as clear-cut as the author of that piece makes out.

Firstly:

copyright conflicts would constantly arise when two authors use the same trivial statement independently of each other, such as “Bucks beats Hawks and advance to the NBA finals”, or “i = i+1”. The short code snippets that Copilot reproduces from training data are unlikely to reach the threshold of originality.

I think there's an important difference here. Obviously, it's possible for two small snippets of work to be identical, and still have been generated independently, with neither snippet having its origins in the other. But the output of CoPilot does have its origins in the code it is trained on. It has colour.

From the classic 2004 essay (written by a lawyer) What Colour are Your Bits?:

I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property!

The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from. The scrambled file still has the copyright Colour because it came from the copyrighted input file. It doesn't matter that it looks like, or maybe even is bit-for-bit identical with, some other file that you could get from a random number generator. It happens that you didn't get it from a random number generator. You got it from copyrighted material; it is copyrighted. The randomly-generated file, even if bit-for-bit identical, would have a different Colour. The Colour inherits through all scrambling and descrambling operations and you're distributing a copyrighted work, you Commie Mutant Traitor.

Emphasis in original - it matters where the bits came from. But the whole thing is worth reading, if you've not seen it already.

Secondly:

On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive. Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either. The output of a machine simply does not qualify for copyright protection – it is in the public domain. That is good news for the open movement and not something that needs fixing.

(Emphasis mine.) Going back to the compiler analogy, this paragraph seems to imply that the output of compilers does not qualify for copyright protection - which is clearly absurd. And just because CoPilot doesn't produce output which corresponds to all of its input, that shouldn't matter either. Compilers throw away comments. And dead code. And redundant instructions. (Given suitably clever optimisation passes.) But that machine-generated output still qualifies for copyright protection.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 11:23 UTC (Fri) by Karellen (subscriber, #67644) [Link]

Correction: The author of "What Colour are Your Bits?" is not a lawyer. I got confused where they said "We lawyers think about colour..." in a place that was not a (hypothetical) quote. My bad. I think the essay still stands though.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 11:41 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Thanks for the link, it's a very interesting read.

> Going to the "Machine-generated code is not a derivative work" section, I don't think it's as clear-cut as the author of that piece makes out.

Assuming the law were applied correctly (which it usually isn't :-( machine-generated code is a derivative work (it's a translation) of the original input. The transformation applied by the machine does not create or destroy copyright. So the machine-generated output is, FOR COPYRIGHT PURPOSES, IDENTICAL to the original input.

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 11:37 UTC (Fri) by Wol (subscriber, #4433) [Link]

> To me it seems pretty obvious: the work is not consumed under the terms of the license, whatever it might be, so the license doesn't apply to anything that is produced from it.

But as I keep hearing repeated, to consume the work outwith the licence IT HAS TO BE FOR ACADEMIC / LEARNING PURPOSES.

So if Copilot is used to produce academic research papers or teaching material (ie books etc), then that's fine.

But if it's used to provide programming prompts and snippets of code to copy and use, THEN THE EXCEPTION DOES NOT APPLY, AND COPYRIGHT DOES APPLY.

So ANY AND ALL code supplied to the general populace is suspect. In other words, if I work for any programming shop, be it software house or end user, and I incorporate code from Copilot into my work, that code is copyright the original author. And that author's copyright applies! Which means I damn well better know where Copilot got it from!!! Okay, many snippets may be too small for copyright to apply, but that's a completely different argument.

tldr; if you're using Copilot to help you WRITE code (as opposed to providing you with study material), you are almost certainly breaking Copyright Law.

And if you're using Copilot to provide study material you're an idiot. It's teaching you the consensus method, not the correct method.

So just don't use copilot :-)

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 12:37 UTC (Fri) by bluca (subscriber, #118303) [Link]

> But as I keep hearing repeated, to consume the work outwith the licence IT HAS TO BE FOR ACADEMIC / LEARNING PURPOSES.

That is factually wrong, and using all caps doesn't make it right. Consuming legally accessible public corpora for TDM is allowed for any purpose under the EU directive. The only difference is that academic institutions are allowed to ignore generic opt-outs.

There is currently no mechanism to express such opt-out, like you can for scrapers with a robots.txt. The W3C is working on a common spec for that: https://www.w3.org/2022/tdmrep/
Of course it's a generic opt-out, you can't pick and choose the parsers you don't like.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 14:50 UTC (Fri) by bluca (subscriber, #118303) [Link]

Let's quote verbatim to nip this in the bud once and for all:

> TITLE II
>
> MEASURES TO ADAPT EXCEPTIONS AND LIMITATIONS TO THE DIGITAL AND CROSS-BORDER ENVIRONMENT
>
> <...>
>
> Article 4
>
> Exception or limitation for text and data mining
>
> 1. Member States shall provide for an exception or limitation to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, Article 4(1)(a) and (b) of Directive 2009/24/EC and Article 15(1) of this Directive for reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining.
>
> 2. Reproductions and extractions made pursuant to paragraph 1 may be retained for as long as is necessary for the purposes of text and data mining.
>
> 3. The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.
>
> 4. This Article shall not affect the application of Article 3 of this Directive.

https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L0790&from=EN

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 15:05 UTC (Fri) by Wol (subscriber, #4433) [Link]

In that case, does that mean copilot is over-reaching itself? Okay, in the EU data is covered differently from text, but the exception appears to be for DOING the mining.

Which search engines like Google will love. It (and I'm quite happy with this) makes it legal for them to have huge search databases.

But there's a very big difference between using that mined data to direct people back to the original source document, and outputting something based on that source (essentially creating a derived document) to be passed on to a third party without the first party knowing anything about it.

Maybe the grounds for feeling that way have changed, but I still feel that actually *using* the output from Copilot for pretty much anything other than study is a very dangerous occupation, and maybe even using it for study ...

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 15:10 UTC (Fri) by bluca (subscriber, #118303) [Link]

During the proceedings legislators explicitly talked about AI/ML applications and development benefiting from the changes and clarity brought forward by this directive. So no, it's definitely not "over-reaching" to use this beyond indexing purposes.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 15:18 UTC (Fri) by ldearquer (subscriber, #137451) [Link]

Not an expert, but I think you may be confusing training and real world operation.
For training it seems OK to use whatever is lawfully accessible, and retain copies for as long as training lasts for this purpose.
But I don't see how that adds any exception on copyright for real world usage of your trained system.

As an example, if you have a neural network that identifies the music style of an input song, I understand you may use copyrighted stuff for training your system. In real world usage, a user may input some song, and your system may respond "that's likely country music". But the inverse, where user input is "country music" and your system starts giving away excerpts of copyright protected songs...

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 22:58 UTC (Fri) by bluca (subscriber, #118303) [Link]

'real world operation' doesn't really mean much. The output of Copilot is quite clearly either transformative (and as such not a copyright issue) or so small that it most likely doesn't qualify for originality. In either case, the original license is irrelevant, so the outrage about supposed 'GPL infringement' is moot.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 9:58 UTC (Sat) by ballombe (subscriber, #9523) [Link]

This is not established. Nothing prevents copilot to return verbatim copy of files it learned. And renaming variable names is not transformative.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 13:12 UTC (Sat) by bluca (subscriber, #118303) [Link]

Verbatim copies of files? Never seen that? Before you paste the link to the usual inverse square root gif, that's not a file, and see another reply above about it.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 17:09 UTC (Sat) by ballombe (subscriber, #9523) [Link]

You are reversing the duty of proof. It is Microsoft to prove this cannot happen, not us.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 21:45 UTC (Sat) by kleptog (subscriber, #1183) [Link]

> You are reversing the duty of proof. It is Microsoft to prove this cannot happen, not us.

That's ridiculous. You can't prove a negative. It's the same as asking someone to prove they're not beating their spouse.

If Microsoft came out with a statement that as far as they can tell it doesn't happen, people will just claim they're lying. The only relevant evidence is if someone comes up with actual examples.

This is even leaving aside of code formatters like Black which are so opinionated it's almost that the point that for any piece of code there is only one way it can be formatted, so you couldn't even tell the different between an actual copy and an accidental one if you wanted to.

If you take a step back and think about what it would take to build such an AI model, if the model has any understanding of the structure of code, there's no reason at all to think that it will randomly copy entire blocks of text literally from the input. It's going to be working at a completely different level.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 15:13 UTC (Sun) by ballombe (subscriber, #9523) [Link]

> If you take a step back and think about what it would take to build such an AI model, if the model has any understanding of the structure of code, there's no reason at all to think that it will randomly copy entire blocks of text literally from the input. It's going to be working at a completely different level.

Nobody outside MS really know how copilot actually work, so you cannot make any claim about it. 'AI model' is just a buzzword.

I do not see how the Math.isPrime example can occur outside literal copying.
copilot seems to be less transformative than a C compiler generating machine code and so far binaries have always been considered derivative from the source.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 12:38 UTC (Mon) by nye (guest, #51576) [Link]

> Verbatim copies of files? Never seen that?

This back and forth kind of misses the point IMO. If copilot outputs something which is a verbatim copy of a substantial piece of code, then *of course* it shouldn't magically have had its copyright removed. Similarly, if a person with an exceptionally accurate memory writes down some copyrighted code that they memorised last year, the fact that they didn't literally copy/paste it has no real bearing on its copyright status. It feels like this shouldn't be controversial.

It seems you assert that it isn't or shouldn't be possible for copilot to do this, but how ever accurate that is I don't think it's particularly important - partly because it's hard to prove and partly because it could be subject to change.

All of the talk about verbatim outputs seems like a largely pointless distraction from the important part: the infinite set of outputs which are *not* a verbatim copy of a substantial piece of code, and which the copyright maximalists argue must be considered a derivative of all of its training inputs.

Here is what it boils down to: if I, as a programmer, either A) perform a sequence of steps, or B) write a program to perform a sequence of steps, then assuming that all inputs and outputs are the same, does the choice of A vs B affect the legality of the outcome? I don't believe that there's a logically coherent argument for the answer being "yes".

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 12:48 UTC (Mon) by bluca (subscriber, #118303) [Link]

You are assuming one-liners or boilerplate that everybody else is also using in the exact same way pass the threshold of originality (or whatever it is called in legalese). That is one big assumption to make.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 13:35 UTC (Mon) by nye (guest, #51576) [Link]

> You are assuming one-liners or boilerplate that everybody else is also using in the exact same way pass the threshold of originality (or whatever it is called in legalese). That is one big assumption to make.

I am definitely not assuming that. If we're talking about "a verbatim copy of a substantial piece of code", then that's essentially my definition of "substantial", but I specifically said "if" in that section, and my point was that IMO it's not at all the important part of the discussion; it's just a distraction (this is why I considered it unimportant to define "substantial" in that context).

FWIW, while we're further entertaining the distraction anyway, I'm not even convinced that the repeatedly-cited fast inverse square root should be eligible for copyright protection - on the grounds that the only bit of creative work in it is the choice of a magic constant, which isn't typically something that would be considered copyrightable. It would be interesting to see if a court is ever asked to rule on this specific piece of code (although I think it's basically always a sad day when we get to the point that a court is required to rule on anything, so "interesting" should not be construed as "good").

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 11:27 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

The examples so far are not one liners.

But even if they were, because the model is asked to predict what someone else would have written given the same program structure, they are far less independent from this very same structure than random snippets found on the web.

And, general structure is one of the things that distinguish fair use from plagiarism.

You can not have it both ways, mimic accurately what others would have done, and pretend you are not deriving their work (this is especially striking where people have used ML to complete damaged work of arts, more accurately than the best best forger. Who cares that the forgery was done one stroke at a time.)

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 13:37 UTC (Tue) by bluca (subscriber, #118303) [Link]

Except in real world usage the similarity is with either the same project in which it is being used (so it's moot), or with something obvious and standard like boilerplate used in the same way by every user of a given library or api, which means the test of originality would not be met.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 16:47 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

If you’ve found a reliable legal way to determine originality automatically why are you even posting there? It would be worth its weight in gold (printed in extra large font on lead plates) to every single legal department of whole Fortune 500.

And if you did not, how can you claim the tool never outputs anything original?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 17:55 UTC (Tue) by bluca (subscriber, #118303) [Link]

It is a very simple trick: actually use the tool you are talking about, to see what it does outside demos and funny gifs

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 12:56 UTC (Mon) by Wol (subscriber, #4433) [Link]

> All of the talk about verbatim outputs seems like a largely pointless distraction from the important part: the infinite set of outputs which are *not* a verbatim copy of a substantial piece of code, and which the copyright maximalists argue must be considered a derivative of all of its training inputs.

Not just the maximalists. Taking the word "derivative" at face value, all the output is derivative of the training data.

The question isn't whether it's derivative, the question is whether it's sufficiently *trivial* not to be copyright, or sufficiently complex and derived from just one or two training items to be a blatant copyright violation. And that will probably have to be determined on a case-by-case basis.

tldr; don't assume because it comes from Copilot that it's copyright-free... (don't assume that it isn't, either).

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 13:42 UTC (Mon) by nye (guest, #51576) [Link]

> Not just the maximalists. Taking the word "derivative" at face value, all the output is derivative of the training data.

That is the maximal possible interpretation, so yes, just the maximalists, by definition. You haven't even added so much as any vague handwaving about transformative use!

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 14:19 UTC (Mon) by Wol (subscriber, #4433) [Link]

> just the maximalists, by definition

Except the quote I was replying to said COPYRIGHT maximalists.

And I certainly didn't claim that the output was - or even should be - copyright. I just said that it was - BY DEFINITION OF THE WORD - derivative.

If I openly said that *some* output is too trivial to copyright, how does that make me a copyright maximalist? And again, isn't "transformative use" - by definition - derivative? FFS, it's a *transformation* - it's the same thing but altered ...

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 16:06 UTC (Sat) by NAR (guest, #1313) [Link]

What if we replace (in this argument) the artificial intelligence with natural one? Humans learn differently that computers, but we still see lot of code and that influences our behaviour. If I see only GPL'd code, if I learn coding only using GPL'd examples, shall my code also be under the GPL?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 13:18 UTC (Fri) by eduperez (guest, #11232) [Link]

> Where I do have issues is where CoPilot outputs source code which is distributed to others. I fail to understand how the source code it produces can not be considered a "derivative work" of its source code inputs, as without those inputs it would produce no output at all. And producing and distributing a derivative work does require a license - or (as SFC ask for) some kind of explanation why the distributor feels a license is not needed.

> It is strange if CoPilot's authors invoke the comparison with a compiler, where the output is owned by the operator. Because that's only true if the inputs are owned by the operator. You can't run someone else's source code through a compiler and then claim copyright ownership of the object code just because you invoked the compiler. I am not a copyright lawyer, but I know that isn't how copyright law works.

You do not need to go that far: there are proofs that show how CoPilot can output code that is a verbatim copy from one of it's sources; they cannot pretend that a verbatim copy of some code is not a copyright infringement, just because it passed through some AI algorithm.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 13:40 UTC (Fri) by bluca (subscriber, #118303) [Link]

The threshold of originality is not a single character, and almost never a single word (you have trademark for that), and rarely a single line. If you type "for (" and it autocompletes to "for (i = 0; i < n; ++i) {" is that copyright infringement? It's a non-trivial question how big a snippet needs to be before a verbatim copy is infringement. It's not that black and white.
Even for larger chunks, with the usual example being the fast inverse square root - but that snippet has been copied so many times in so many places, it's almost folklore now. And in all cases I've seen someone was _intentionally_ steering the autocomplete engine toward that answer. Would it pop up in a completely unrelated case when not actively trying to make it appear? And would a copyright lawsuit from the original author based solely on that snippet win a court case? I don't know, I'm not a judge, but it's really not as clear cut.

Certainly less common large verbatim copies should not happen, and the team was adding some checks for that IIRC. There might even be a config to disable that from happening now?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 14:09 UTC (Fri) by LtWorf (subscriber, #124958) [Link]

You know as well as me that copilot generated entire sections of GPL licensed code.

https://news.ycombinator.com/item?id=27710287

So talking about "a single word" or "a single byte" or "a ⅓ of a bit" is just misleading. Copilot just copies entire functions… and you have no way of knowing if it just copy pasted an entire module from somewhere or it "created" something original.

That it "should not happen" doesn't really matter. It has been shown that it does happen.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 14:15 UTC (Fri) by bluca (subscriber, #118303) [Link]

That's the fast inverse square root I mentioned. It's not entire random modules, it's a single snippet that is so well know it has its own wikipedia entry. Also the original license is irrelevant.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 13:42 UTC (Fri) by Karellen (subscriber, #67644) [Link]

To be fair to CoPilot, I think that argument is only solid for output which is a verbatim copy of one specific input. Given that most of CoPilot's suggestions are not verbatim copies of one specific input, I'm not sure how useful it is in the general case. Just because a few of its outputs are blatant copyright infringement, I don't think that it necessarily follows that all of its outputs are.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 14:14 UTC (Fri) by anselm (subscriber, #2796) [Link]

Just because a few of its outputs are blatant copyright infringement, I don't think that it necessarily follows that all of its outputs are.

But nor can one conclude that the particular output in front of one is necessarily free of any copyright infringement.

As far as I'm concerned this lack of provenance is one major problem with the approach. The other major problem with the approach is that it is by no means guaranteed (AFAIK) that Copilot output actually does what it is supposed to do. I wonder whether it is usually less work to validate, debug, and clean up something that came out of Copilot than it is to come up with the same thing from scratch and avoid the entire minefield in the first place – i.e., whether Copilot is “worth it” in daily practice.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 13:13 UTC (Sat) by bluca (subscriber, #118303) [Link]

Have you used it?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 22:31 UTC (Sat) by anselm (subscriber, #2796) [Link]

Doesn't really matter. According to its own FAQ (and not just according to common sense), Copilot output code should be “rigorously tested”, “reviewed and vetted”, and “checked for security vulnerabilities”. (The Copilot FAQ also says that Copilot's output “may contain insecure coding patterns, bugs, or references to outdated APIs or idioms” and that it “may not always work, or even make sense”. Yep. Sounds just what we need. Bring it on.)

As a programmer I probably spend more time writing tests for my code and ensuring that it does what it is supposed to do than I do to come up with the code in the first place; if I need to write the tests and debug the code, anyway, then having to write it first is really the least of my worries, and if I write the code myself then at least copyright is much less likely to be an issue. Also, writing the code from scratch will probably be more creative and fun than having to bang dubious Copilot output into shape if it contains “insecure coding patterns” or “references to outdated APIs”, let alone subtle errors that render it inappropriate for the actual use case at hand.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 23:20 UTC (Sat) by bluca (subscriber, #118303) [Link]

So you haven't - that is quite obvious, given you don't really seem to understand what it's really good at. Try it out and you'll see.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 23:51 UTC (Sat) by anselm (subscriber, #2796) [Link]

Whatever. The examples on their web site leave me underimpressed. E.g., the “IsPrimeTime.java” example takes a comment that reads

// Math.isPrime(int) returns whether the given number is prime or not
@Test

and completes that to

// Math.isPrime(int) returns whether the given number is prime or not
@Test
public void testIsPrime() {
  assertTrue(Math.isPrime(2));
  assertTrue(Math.isPrime(3));
  assertTrue(Math.isPrime(5));
  assertTrue(Math.isPrime(7));
  assertTrue(Math.isPrime(11));
  assertTrue(Math.isPrime(13));
  assertTrue(Math.isPrime(17));
  assertTrue(Math.isPrime(19));
  assertTrue(Math.isPrime(23));
  assertTrue(Math.isPrime(29));
}

which is obviously an impressive blob of code but fails completely at its stated purpose. If that is really the best Copilot can do, to a point where they feel they must put it out as an advertisement, then please explain to me again why I should want to pay for drivel like that.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 10:47 UTC (Sun) by bluca (subscriber, #118303) [Link]

It's a fancy autocomplete tool, not a replacement for an engineer. I've been using it since the preview started, and it's incredibly good at saving time when dealing with boilerplate and other such repetitive chunks. It really excels when for example dealing with a library with a very verbose interface, or when adding yet another unit test to a file full of tests with some variations between them, or when a reusing the same pattern across a project. It doesn't conjure up creative new algorithms on its own, that's not it's job, it's the coder's. And of course you need to check what comes up, you don't blindly accept anything it throws at you, just like you do with ye olde autocomplete, and there's nothing wrong with that.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 15:45 UTC (Sun) by anselm (subscriber, #2796) [Link]

If what Copilot does is worth $100/year to you (or your employer), then more power to you. From the examples on the web site – which I presume are showing Copilot at its best, because why else pick them as examples? –, I personally don't see that for me, and in any case my favourite editor is not among the ones Copilot supports, so getting to where I could actually use Copilot in the first place would be too much of a hassle as far as I'm concerned, so I think I'll pass.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 17:12 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

> there's no need to invoke fair use here in the Europe, data mining on publicly available text and data bodies is exempt from copyright rules as per the copyright directive from a couple of years back

And anyone here thinks that will survive the first person that trains a ML on the production of some first-rank singer, makes it output a song, and tries to make some money out of it? Especially if it ends up a success? That’s not science fiction people are already succeeding to generate pretty nice paint forgeries from museum archives.

Hollywood would have none of it.

Data will be quickly redefined to exclude anything copyrightable. Or some other clause will clarify it applies to the model not using the model to create the same form of work.

Exactly like the “we are only neutral publishers” clause cloud providers fought for stood no longer than youtube starting to make big money from other people’s creations.

Lawmakers may not understand tech but they understand the money trail plenty fine.

Copilot is free (as in beer) to use? No? I thought so. Usual unethical behavior of big companies that knowingly cash on shady behavior, hoping the law takes its time to catch up with them.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 17:18 UTC (Tue) by Wol (subscriber, #4433) [Link]

> > there's no need to invoke fair use here in the Europe, data mining on publicly available text and data bodies is exempt from copyright rules as per the copyright directive from a couple of years back

> And anyone here thinks that will survive the first person that trains a ML on the production of some first-rank singer, makes it output a song, and tries to make some money out of it? Especially if it ends up a success? That’s not science fiction people are already succeeding to generate pretty nice paint forgeries from museum archives.

Data MINING. Ie putting copyright materials *IN*to the model. And using the output for bug hunting, learning, that sort of stuff. But the output is a work, it's copyrightable, and it inherits the parent copyright.

The GP needs to stop confusing *IN*put with *OUT*put. Otherwise he's likely to spend the rest of his life paying off the lawyers who couldn't defend him ...

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 17:29 UTC (Tue) by farnz (subscriber, #17727) [Link]

The laws as written today in the EU do not cover the output of a model - only the data mining for training, and the model itself.

Therefore, I'd expect that the output of a model can infringe copyright - just as a human who reads something isn't covered by copyright, nor are they caught up in copyright if they remember chunks of what they read, but copyright does kick in if they reproduce something protected by copyright verbatim from memory.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 8:23 UTC (Fri) by flussence (subscriber, #85566) [Link]

I don't begrudge them for trying but they've picked an impossible battle:

You can *not* convince people to care about minutiae like this unless they were already predisposed to it. Even on a site like GitHub, because it's less about Social Coding™ these days and more of a forum where people go to complain that their freeware isn't working while others roleplay as Nick Krause every October to win promotional merchandise.

Microsoft's empire was built on pandering to those who think of the computer as a nuisance first and would prefer not to think about it at all, and in selling them products that reinforce that apathy. GitHub fits right into that market, and suggesting alternatives like GitLab or SourceHut misses the point of why people use this and not those in the first place (it's not because they're technically or morally superior in some way - GitLab isn't even either of those).

I mean, just look at everything that's happened to Twitter in the last 6 years that some people consider crossing an unacceptable boundary. There are a billion people still there (for reasons other than using it as an advertising space), who just don't care enough to leave, and nothing will make them.

Stuff like this is just Evangelical Christian preaching about sinners going to hell with some of the words changed. That's all it's ever been.

Actually after writing all that maybe I do begrudge them for trying. This campaign won't work, there's four decades of documentation of it not working, please try something different already.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 8:42 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> Microsoft's empire was built on pandering to those who think of the computer as a nuisance first and would prefer not to think about it at all

That's a pretty healthy attitude in general, actually. You don't think about the plastic used in your lighting switch every time you turn on the lights, do you? Unless you're a lighting switch enthusiast, you just want them to work and don't bother you.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 9:15 UTC (Fri) by flussence (subscriber, #85566) [Link]

>That's a pretty healthy attitude in general, actually.

And I never said it wasn't.

My continued existence on this earth is mostly paid for by the work of sparing others from having to think too hard about the computers they're using. When I say this isn't going to move the needle one bit, it's because I've seen into enough real people's lives to understand why.

If Copilot is automated piracy, then maybe this is a good time for a reminder that piracy is a service problem. We're not just failing to learn from GNU's historic mistakes by pretending things like this are effective, but those of the likes of SCO, RIAA, Microsoft itself. People want better service, not lectures and scaremongering. The free-as-in-freedom is just a side effect.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 8:24 UTC (Fri) by vegard (subscriber, #52330) [Link]

I don't think giving up GitHub for free/open source software is really a solution here, because all the code is still out there and can still be fed to a machine learning algorithm. Finding the code is not the difficult part; even if GitHub didn't create CoPilot, anybody else could do so with the same data. In my opinion the thing to fight is the use of machine learning to launder code (since this undermines FOSS licensing), and this fight is much bigger than just GitHub.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 9:28 UTC (Fri) by danpb (subscriber, #4831) [Link]

There are many good reasons to not host OSS projects on a fully closed source service like GitHub, but avoiding the risk of Copilot is a pretty weak one that will achieve nothing in practice. Regardless of where a project is officially hosted, by the very nature of being open source licensed, any person / organization can import the code to GitHub at any time they wish. IME, essentially any software package that has ever existed, has already been imported into GitHub by someone.

The debate over Copilot is important to have as the answer is not entirely clear either way. There is no practical way for any open source projects to avoid being imported into Copilot though, and moving project hosting makes no difference to this. Using Copilot as a justification for moving hosting service is at most a statement of unhappiness with their approach.

GiveUpGitHub misses the practical details, like the need to maintain an official mirror

Posted Jul 5, 2022 0:02 UTC (Tue) by abartlet (subscriber, #3928) [Link]

Also, a project will want to maintain an official mirror on github, being the modern freshmeat.net for those who remember that, so that the official mirror is maintained and not some out of date or worse a malicious one.

Given that, I don't see what can be done.

Why is Copilot so bad?

Posted Jul 1, 2022 23:04 UTC (Fri) by SLi (subscriber, #53131) [Link]

I like to think I'm rather pro free software. However it seems I view this very differently from many other people.

To me, the ideological background for free software is that the world would be a better place if software was free and there would not be restrictive copyright laws. Because this is not the world we live in, the decision was made to use the tools of the copyright law itself to work against the world order that uses it to suppress software freedom. I think it has worked reasonably well.

From this perspective, it rubs me the wrong way when people seem to argue that copyright should restrict the creation and use of AI models. It seems antithetical to what I thought were the goals of the movement. Granted, Copilot is not an open model, and that is not a good thing for free software. But please, for the sake of humanity, don't try to expand the copyright madness to AI models, which (to me) anyway seem closely equivalent to a human having read the code in question and producing code based on what they learned.

For once I feel that the copyright law seems (based on what I judge to be the most beliavable expert opinions) to be in the state where I want it to be. Of course, as an AI practicioner, I might also be biased... But I really think it would be a very silly world where you couldn't use publically available data to train models without copyright ruining any hope of technological progress.

Why is Copilot so bad?

Posted Jul 2, 2022 1:14 UTC (Sat) by pabs (subscriber, #43278) [Link]

A world without copyright is not one in which most software is free software. The essential ingredient of free software is equality of access to a work between upstream and downstream. Most of the restrictions on downstream access go away without copyright, but there is still one that does not; source code. In a world without copyright law, source code doesn't become magically public. Trade secret source code will still exist and NDAs and other contracts will protect it.

Why is Copilot so bad?

Posted Jul 2, 2022 9:09 UTC (Sat) by SLi (subscriber, #53131) [Link]

Ok, but is it seriously a good idea to prevent people from creating (including free and open) AI models from publically available data? To me that's the kind of copyright maximalism I did not expect from the free software movement.

In fact, this feels somewhat like a knee-jerk reaction based on a hated company being behind this. Among the masses, I think that's a large part of it. I'm not arrogant enough to think that genuinely knowledgeable and philosophical people like those at SFC have that attitude, though, so that's what leaves me confused.

To me, it seems that the argument is essentially that it should not be realistically legally possible to create good AI models (for which you need at at least hundreds of gigabytes of source code) because of copyright reasons and the impossibility of vetting a copyright-safe set of such code. And I think this is a very counterproductive argument.

Training only one such model is expensive (and bad for environment) enough computation that it really, really should not be done separately for each mutually incompatible free software license, let alone retrain every time you discover that there's license unclarities with some small part of the input (I know, unheard of in the free software world...).

Why is Copilot so bad?

Posted Jul 2, 2022 11:49 UTC (Sat) by Wol (subscriber, #4433) [Link]

> Ok, but is it seriously a good idea to prevent people from creating (including free and open) AI models from publically available data? To me that's the kind of copyright maximalism I did not expect from the free software movement.

There's nothing wrong with MODEL. But there's everything wrong with the USES THAT MAY BE MADE of the output.

The output is - must be - a derivative work of the inputs used to create it. That's what the word "derivative" means.

This then brings copyright into play. The output may be a conglomerate of multiple similar works, in which case the copyright status is probably "too trivial to be eligible". Or the output may be the sole match for a complex piece of code someone is trying to write, tempting them just to take Copilot's output verbatim as the solution to their problem. In that case the copyright status is "blatant piracy". And there's all the other points on the spectrum between them.

Mining publicly available code and using it for education is fine - why else would it be publicly available? It's generally accepted that stuff is out there for people to read and learn from.

But it's NOT generally acceptable that stuff is put out there for others to make a fast buck from. Using Copilot output for commercial purposes is NOT an acceptable default status - a lot of it has been shared on a "share and share alike" basis and people who don't "play the game" are Stealers of the Commons. Dunno about other countries, but "Finders Keepers" could land you in jail for theft over here (unlikely, but perfectly possible - you have a duty to try and find the rightful owner).

Cheers,
Wol

Why is Copilot so bad?

Posted Jul 2, 2022 12:40 UTC (Sat) by SLi (subscriber, #53131) [Link]

The model is pretty much useless if you cannot use it for anything without fouling copyright. I really think that would be a very harmful development.

Luckily, I also think that that understanding of what a derivative work means in the copyright context is pretty wild and likely incorrect. Well, whatever the law turns out to mean, I wish people stopped advocating for such harmful interpretations. It may be that the law ends up preventing any real AI code models, but it definitely should not.

Why is Copilot so bad?

Posted Jul 2, 2022 13:10 UTC (Sat) by bluca (subscriber, #118303) [Link]

Thank you for clearly expressing these points - the future some of the commentators wish for is very bleak. It seems to me mostly a knee-jerk 'Microsoft bad!' reaction, while they don't realize that if using public code repositories to train an AI model was not allowed, in practice it means only giant corporations with huge caches of proprietary internal code (like... Microsoft!) would be able to legally build and sell an AI product such as Copilot. That would be a really sad and bleak outcome. Fortunately the law, at least in Europe, doesn't seem to go in that direction.

Why is Copilot so bad?

Posted Jul 4, 2022 9:23 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

I seem to understand that you work for Microsoft. However most of us don't, so we can speak our minds more freely, since we are not afraid of getting fired :)

Copyleft only exists because copyright exists. I think we can agree on this point.

People use copyleft licenses because they want their work to remain free.

IF copyright didn't exist and all software source was public domain, we'd all be very glad that copilot was there to help writing code that would be free.

However copyright does exist, and copilot is going to be used mostly to write copyrighted proprietary software, using copyleft software. This is clearly something that the authors of copyleft software didn't want.

Not using github is not a solution because anyone (including microsoft itself) has every right to just mirror whatever on github.

Now you claim (and have an economical interest in claiming so) that copilot does not infringe. However you aren't a judge. And while I do agree that creating the model does not infringe, the output generated from the model is another thing entirely, and that might be infringing.

In any case people who wrote GPL code know that their work is going to be used in proprietary code, which goes against the license and against their wishes when they chose that license.

You are just betting that a future lawsuit will say that you are right. But even if you are wrong it will be the users of copilot being in violation, so microsoft is betting that it will be very hard to find who to sue and no lawsuit will ever happen.

To respond to your comment, no, having your license terms respected is not "bleak". Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.

The law to train a ML model doesn't say anything about using that model to generate new content.

Why is Copilot so bad?

Posted Jul 4, 2022 11:47 UTC (Mon) by bluca (subscriber, #118303) [Link]

> I seem to understand that you work for Microsoft. However most of us don't, so we can speak our minds more freely, since we are not afraid of getting fired :)

Ahah if that were an actual issue, I'd have been fired a long time ago, I can assure you

> (and have an economical interest in claiming so)

I am not in GH and I am not a shareholder, so you can park this nonsensical tinfoil-hattery straight away - I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.

> To respond to your comment, no, having your license terms respected is not "bleak".

It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software beside some boring indexing or suchlike software, as it would be de-facto impossible to compile a legal training corpus unless you have a metric ton of private code available to you. That would be dreadful, and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.

> Microsoft would be very free to train copilot on their internal code but didn't… don't you find that interesting? Instead they chose to build copilot on other people's works, which are indeed copyrighted.

It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code, that is to say all of which is publicly available Github (there's loads), as the team has said multiple times in public, because that's where the training data comes from. If it's on different systems (external or internal), it wasn't used, it's as simple as that - I don't even know if the GH org can access other systems, but from my own experience, I'm pretty sure they cannot even if they wanted to.

> The law to train a ML model doesn't say anything about using that model to generate new content.

Lawmakers were clearly and openly talking about AI applications, and not just some indexing applications or some such activities. A giant trunk of AI r&d is in the field of generating content, like GPT and so on. It seems like a bold assumption to think that the lawmakers weren't aware of all that.

Why is Copilot so bad?

Posted Jul 4, 2022 13:03 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

> Ahah if that were an actual issue, I'd have been fired a long time ago, I can assure you

You claim that, but here you are with 26 comments defending microsoft's actions.

> I am not in GH and I am not a shareholder

I'm sure you vested or will vest stocks. It's common practice. And you do get a salary I hope?

> I am simply a free software developer and a happy user of Copilot for a year, unlike the vast majority of commentators here who have obviously never seen it outside of a couple of memes I might add.

Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.

> It would be incredibly bleak, as nobody outside of a few major corporations would ever be able to build AI/ML software

Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.

> That would be dreadful

It is dreadful indeed. I'm not sure why you are considering microsoft to be this little innocent startup company.

> and I am happy the law is going in a different direction and the original license is irrelevant for AI training, as it's better for everyone.

That's your personal opinion that you keep repeating but there is no agreement. And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.

> It's not interesting at all, in fact it's quite boring and obvious - it is trained on plenty of MSFT's own code

The open source one… not the proprietary one… Be intellectually honest please. I talked about proprietary code and you replied something entirely OT.

> If it's on different systems (external or internal), it wasn't used

And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?

> Lawmakers were clearly and openly talking about AI applications

Generating code is not the only ML application that can exist. Classifiers are ML.

I'm sure the lawmakers were aware, and that's why they talked about "training data" but not about "spitting out the training data verbatim".

You are reading what you would like to be written rather than what is actually written.

Why is Copilot so bad?

Posted Jul 4, 2022 18:16 UTC (Mon) by bluca (subscriber, #118303) [Link]

> You claim that, but here you are with 26 comments defending microsoft's actions.

And...?

> Most people would give it a try but getting it to work is non trivial (using a specific proprietary editor, setting up a vm to isolate said editor, giving up the credit card number). So it's not like it's easy to test and form an opinion.

You forgot hand-carving a new silicon behind a blast door in an hazmat suit. Also TIL that Neovim is a proprietary editor. And there's no need for credit cards if you are an open source maintainer, you get it for free.

> Uhm… Microsoft is a major corporation building an AI/ML software violating the licenses of probably millions of smaller fishes. It's happening now.

You are both failing to see the point (major corporations would be fine if the law worked like the maximalists wanted it to, it's the rest that would be worse off) and also talking nonsense, there is no license violation anywhere. Feel free to point to the court cases if not. Just because a few trolls and edgy teenagers shout "violation!" it doesn't mean it's actually happening, you need to prove it. Can you?

> And in this case it is not better for the authors, as you can see by the fact that the authors are indeed complaining.

The fact that some are complaining doesn't mean the alternative if the law was different would be better. There's plenty of anti-vaxxers complaining about the vaccination programs wordwide, it doesn't mean we'd be better off without vaccines.

> And why is that? Why microsoft didn't use its own internal git repos for training? I'm sure there is a lot of code there… is there some fear about the license of the output perhaps?

It's because of the aliens trapped in those repos, duh! Now if you take off your tin foil hat for a moment and go read other replies, I've already given my uninformed guess on why only public repos on Github are used.

> You are reading what you would like to be written rather than what is actually written.

I'm not the one claiming that training a model violates copyright when it's explicitly allowed by law.

Can we stop here?

Posted Jul 4, 2022 18:30 UTC (Mon) by corbet (editor, #1) [Link]

I'm thinking that perhaps this particular subthread has gone as far as it needs to; let's stop it here.

Thank you.

Can we stop here?

Posted Jul 5, 2022 14:35 UTC (Tue) by nye (guest, #51576) [Link]

It reflects badly on you that you post this as a reply to someone responding to repeated baseless personal attacks.

Can we stop here?

Posted Jul 5, 2022 14:52 UTC (Tue) by corbet (editor, #1) [Link]

Perhaps you have the time to watch an out-of-control comment thread - on a holiday - to find the perfect point at which to intervene. I apologize, but I lack that time.

Why is Copilot so bad?

Posted Jul 4, 2022 9:01 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

The model can be used to find probability of bugs or whatever. Not all models are used to generate.

For example in my ML course at university we trained a thing to recognise handwriting. We didn't use it to generate a new font.

Why is Copilot so bad?

Posted Jul 3, 2022 4:28 UTC (Sun) by pabs (subscriber, #43278) [Link]

I don't believe that you can create a free and open ML model from publicly available data of unspecified provenance and licensing, you need free and open data for that, training an ML model against publicly available proprietary data creates what the Debian Machine Learning Team calls a ToxicCandy model. That is one where Debian couldn't theoretically redistribute the training data in main, upload the training code to main, then train the model using the code/data from main and then redistribute the model in main and then have downstream users create an offline mirror of Debian and then do completely offline modifications to the training data (removing data bias for example) and completely offline retraining of the model.

https://salsa.debian.org/deeplearning-team/ml-policy

Of course, the prohibitively large sizes of most of the training data sets and the prohibitively large costs of training make this scenario infeasible for various actually useful models, but maybe if there were a group working on and funding libre ML with training data storage, compute and reproducible training, then it would become feasible to have actually libre ML.

I don't believe that ToxicCandy models, nor proprietary models are a good idea. Also I believe that the purposes that many uses of ML models are put to are very unethical and that ML researchers need to think carefully about what the model they are creating will enable.

I haven't thought about Copilot enough to comment on the rest of your post.

Why is Copilot so bad?

Posted Jul 3, 2022 6:27 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

free and open data

We need to be careful with the usage of that phrase. "Data" can refer to any set of information under the sun. But many information sets are not subject to copyright protection in the US (see Feist v. Rural), and are subject to sui generis database rights (similar but not identical to copyright) in the EU. This is further complicated by the fact that US law allows you to copyright the "selection or arrangement" of data that would otherwise not be subject to copyright.

In the case of Copilot, the inputs are, of course, subject to copyright. But, if you'll excuse my use of US law (it's the legal system I know best), there are a whole bunch of unanswered questions:

  • Are the model's parameters, as a whole, subject to copyright protection?
    • Probably not; the parameters are not the product of human creativity, but are instead produced by an automated process. US law generally doesn't like giving copyright to non-humans. Furthermore, they don't directly encode meaningful human creativity as such - the parameters are basically a big pile of statistical information.
  • Assuming, for the sake of argument, that the model is not subject to copyright protection, can it nevertheless be a "derivative work" within the meaning of the copyright statute?
    • Bizarrely, the copyright statute provides a reasonable definition of "derivative work," but no definition is provided for the word "work," (17 USC 101) so this remains unclear to me. It probably depends on whether the model counts as a "work."
  • If the model is not a derivative work, does creating the model violate any of the other exclusive rights in 17 USC 106?
    • I strongly doubt it. If it's not a derivative work, it doesn't look like any of the other rights are applicable. It's derivative work or nothing.
  • Is the output of the model a derivative work of its inputs?
    • IMHO this is the question that most likely matters, and the one that everyone has been stubbornly ignoring in order to pontificate about random other stuff that is functionally irrelevant. I don't think any of the other questions have much bearing on how this comes out. Instead, you'd probably get the usual "substantial similarity" test - the plaintiff would come up with *one* *particular* piece of code that is allegedly infringed, and the output which allegedly infringes it, and then the judge would look at the two bits of code and see how similar they are, on a case-by-case basis. Enjoy litigating the crap out of that...
  • If some particular output of the model is indeed a derivative work of one of its inputs, can GitHub's TOS indemnify itself for the infringement?
    • I have no idea, but I assume that Microsoft's lawyers looked at this and said "yes." Otherwise, it probably wouldn't be available.

Your legal system will probably have a different set of unanswered questions, which may in turn have different answers. Regardless, trying to make strong claims about what is or is not legal is a fool's errand at this point.

Why is Copilot so bad?

Posted Jul 3, 2022 6:32 UTC (Sun) by pabs (subscriber, #43278) [Link]

My comment was unrelated to legal systems, but about what could be considered "free and open ML". I consider the Debian ML policy linked above a good start towards defining such a thing.

Why is Copilot so bad?

Posted Jul 4, 2022 13:15 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

> Enjoy litigating the crap out of that...

If you were a company… would you buy copilot knowing that after that every single github user can hit you with an infringement lawsuit?

I'm sure there are patent trolls interested in acquiring the rights to some github projects and going to get claims around :)

If I was a CTO in charge of a company I'd just not buy into it because the potential cost in legal fees and complete bankrupcy seems to greatly outweigh the time we could save.

Why is Copilot so bad?

Posted Jul 5, 2022 8:19 UTC (Tue) by cortana (subscriber, #24596) [Link]

Worse, C-levels now need to concern themselves with liability from their employees using Copilot themselves (after all, if an employee thinks they'll perform better at there job if they pay for it themselves...) and not disclosing the fact that they are committing unlicensed code to their employer's repos...

Why is Copilot so bad?

Posted Jul 5, 2022 8:49 UTC (Tue) by geert (subscriber, #98403) [Link]

People can commit unlicensed code to their employer's repos without Copilot, too.

Why is Copilot so bad?

Posted Jul 3, 2022 23:21 UTC (Sun) by SLi (subscriber, #53131) [Link]

I think the problem with a strong desire or mandate to have a model trained only a large (like Debian-sized or Github-sized) dataset of strictly free software or even license compatible code is the practical impossibility of coming up with such a body of code. Although if I understand your argument correctly, you don't mind license incompatibility (like the old-style BSD woes or the OpenSSL saga) as long as it's legal and FOSS?

You observed correctly that training such a model costs millions. It may be possible in the future that training will become less expensive due to algorithmic or hardware improvements. In practice what you can expect to happen is that people (and companies) will train larger and more useful models than the current ones.

So, assume you have trained such a model on, say, all the code in Debian. Now it turns out there's a small piece of code there that is actually not free software, perhaps not even distributable (happens all the time, I think?). What are you going to do, retrain it from scratch?

Why is Copilot so bad?

Posted Jul 4, 2022 3:36 UTC (Mon) by pabs (subscriber, #43278) [Link]

My comment was more general than just code generation models.

I hadn't thought of license incompatibility, but presumably it would indeed be a concern.

I know approximately zero about ML, but AFAIK retraining is the only option when it comes deficiencies in a model due to bad input data. For example if a model is indirectly biased against certain groups of people, the procedure is presumably to analyse the bias in the input data, then discard some subset of that data or add more data and then retrain the model from scratch. If an ML chatbot is racist because it was trained on internet comments from various sites, you either just delete all the ones from 4chan and hope there are no racist comments on Twitter etc :) or manually comb through all the millions of comments and delete racist ones. Or just give up on the internet as a source of input data :) So yeah, retraining is the only option in the face of non-free or non-redistributable code input.

Why is Copilot so bad?

Posted Jul 4, 2022 9:49 UTC (Mon) by SLi (subscriber, #53131) [Link]

Yup... Well, I can maybe see a community perhaps with some public funding support coming together and coughing up a few million for the training for a significant model.

I have very hard time seeing that repeating whenever someone discovers there was a few kilobytes of non-free code in the input.

Why is Copilot so bad?

Posted Jul 3, 2022 5:39 UTC (Sun) by oldtomas (guest, #72579) [Link]

Defending Microsoft's position against (perceived?) "copyright maximalism" is... spicy.

Why is Copilot so bad?

Posted Jul 2, 2022 1:20 UTC (Sat) by pabs (subscriber, #43278) [Link]

Ignoring of the copyright concerns, here is a pretty terrible argument that I think is probably true; Copilot is bad for free software because it makes it easier to write software and most users of it will be writing proprietary software, so it increases the rate of creation of proprietary software and thus the amount of proprietary software that exists :)

Why is Copilot so bad?

Posted Jul 2, 2022 13:16 UTC (Sat) by bluca (subscriber, #118303) [Link]

A world in which it is not legal to train an AI on public corpora is a world in which only giant corporations with huge caches of proprietary code (or text or whatever) can build AIs. That's not a good outcome for free software.

Why is Copilot so bad?

Posted Jul 2, 2022 14:03 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

Admirable!

But the world we have right now is one where a giant corporation is using public domain (fine), permissively licensed (fine-ish) and copyleft (not so fine) code, rather than its own proprietary code, to train its AI model.

Why is Copilot so bad?

Posted Jul 2, 2022 14:22 UTC (Sat) by SLi (subscriber, #53131) [Link]

Yes. And I believe a good world is one where anyone can use any publically available data to train an AI and use it freely, or at least without copyright of the training material preventing it (ethical reasons may be a good reason to regulate some uses of AI).

Now it sounds like some want to throw the baby out with the bathwater and prevent all code AI models, apart from giants like Google or Microsoft training their own models on their own code (possibly for in-house use only, if they are scared of information leaks).

The society shouldn't adopt a copyright maximalist stance and stifle uses of AI merely because the first available models were proprietary.

The idea that "any use" of copyrighted works should require a license is a typical maximalist idea, and I expect to hear that more from the entertainment industry than free software proponents. Training an AI and using it to produce code is, rather clearly to me, one of those things that are only extremely tangential to any traditional purpose of the copyright system. It's fundamentally not at all different from a human reading publically available code and using the memories formed that way to write more. I don't think even the craziest copyright maximalists claim that the products of that are typically be derivative works.

Why is Copilot so bad?

Posted Jul 2, 2022 16:02 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

> It's fundamentally not at all different from a human reading publically available code and using the memories formed that way to write more.

If a corporation is allowed to bleach the copyleft off of your code by using it as feedstock for an incomprehensibly complex computer algorithm and then asking the algorithm to solve that problem, copyleft is gravely wounded.

Why is Copilot so bad?

Posted Jul 2, 2022 22:12 UTC (Sat) by kleptog (subscriber, #1183) [Link]

> If a corporation is allowed to bleach the copyleft off of your code by using it as feedstock for an incomprehensibly complex computer algorithm and then asking the algorithm to solve that problem, copyleft is gravely wounded.

How is this different to anyone looking at copylefted code in Github for inspiration to solve a problem they're having, and then using that idea, written in their own way, in their own program? Copyright is focussed on the copying of expression, not the copying of ideas. As long as you can argue the model is copying the idea, not the expression, copyright is completely irrelevant.

The whole issue comes down to the distinction we've made in copyright law between what compilers do (which is considered pure manipulation having no effect on copyright), and what people do (which is looking at pieces of source code to learn and use that to make more source code). Isn't the rule of thumb: if you're copying from one source it's plagiarism, if you're copying from two it's research?

I don't really see how a model built on examining lots of source, some copylefted, producing code reduces the value of the input code. If a computer model can actually come up with code that does something you've typed, perhaps it wasn't so original and it's the kind of thing we want to automate away anyway.

TBH, the idea of a model writing code for you to solve a problem sounds nice. But what would be really valuable is something that could see where many programs are solving a similar problem, that it makes a library for that and refactors all the other programs to use that.

Why is Copilot so bad?

Posted Jul 5, 2022 9:38 UTC (Tue) by farnz (subscriber, #17727) [Link]

Treating it as comparable to a human is what I suspect the courts will do, and rsidd has pointed out that music precedent in the case of George Harrison's "My Sweet Lord" suggests that if Copilot does output snippets of its training data unchanged, then unless that snippet is "purely functional", it'll be found to be a copyright infringement by the user of Copilot.

That's a risk for any user of Copilot to assess - are they OK about a possible infringement suit caused by the fact that Copilot has access to code owned by Alphabet, Meta, Microsoft and other entities whose code is on GitHub?

Why is Copilot so bad?

Posted Jul 4, 2022 8:48 UTC (Mon) by nim-nim (subscriber, #34454) [Link]

It would be utterly trivial to use machine learning on FLOSS code *legally*.

There are not a thousand of different FLOSS licenses.

There are not a thousand of different combination rules.

Compared to the number of files the model ingests to output suggestions, determining what is the license of a project, what other licenses it can be combined with, and what are the possible licensing for the result is *TRIVIAL*. No need for special magic research exemptions, no need to anger people, just apply the original licensing, legally safe by design in all jurisdictions.

Pretending everything is public domain is not just laziness, it’s *opinionated* laziness, that tries to blur the lines so everything not “protected” by bigcorp lawyers is free to pillage, and everything produced by this pillaging can be safely put out of bounds.

Why is Copilot so bad?

Posted Jul 4, 2022 10:27 UTC (Mon) by SLi (subscriber, #53131) [Link]

Only if all the licensing metadata is correct and the authors understood what they were doing and had the right to do so. Hint: It isn't and they didn't... I have followed Debian actively enough to know that a team of few interested people can rather easily stumble on copyright violations, even if the majority of people might prefer those to not be discovered.

Why is Copilot so bad?

Posted Jul 4, 2022 14:10 UTC (Mon) by nim-nim (subscriber, #34454) [Link]

Well, guess what, the real world is imperfect.

“Mister judge, some goods in that store are probably mislabeled, therefore I decided that paying for what i picked up was unnecessary”. How do you think that would work out ?

The law does not let out out of the hook because others may have made mistakes. Everyone makes mistakes. There’s a difference between making an honest mistake (trying and failing to achieve perfection) and not trying at all.

Why is Copilot so bad?

Posted Jul 4, 2022 16:17 UTC (Mon) by SLi (subscriber, #53131) [Link]

My point exactly. But would you have the world be such that every time you discover that there was some small piece of code in the ton of code you use to train the model on, you have to retrain the model for a cost of a few million?

As a practical matter, no such large corpus of code without any copyright violations to be discovered exists. I suspect the large corporations come closest. For the free software world, this idea would kill the last hope of training such models.

I believe Microsoft's motivation for not training it on their internal code is not about copyright violations, but being careful to not divulge trade secrets—which is obviously a non-issue with any code that is freely accessible.

Why is Copilot so bad?

Posted Jul 4, 2022 16:37 UTC (Mon) by bluca (subscriber, #118303) [Link]

> I believe Microsoft's motivation for not training it on their internal code is not about copyright violations, but being careful to not divulge trade secrets—which is obviously a non-issue with any code that is freely accessible.

I'm not in GH so I don't know, but if I had to take a wild guess I'd say it's much simpler than that. The non-GH internal SCM systems are such an horrendous pain in the back to use and even get access to, I'm willing to bet the team working on the model, even if given permission to use those sources, would "nope" the heck out very, very fast and never look back.

Why is Copilot so bad?

Posted Jul 4, 2022 17:53 UTC (Mon) by Wol (subscriber, #4433) [Link]

Givn that I've worked with SourceSafe, I'm inclined to agree with you ... :-)

Cheers,
Wol

Why is Copilot so bad?

Posted Jul 5, 2022 6:39 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

> For the free software world, this idea would kill the last hope of training such models.

Not at all.

JUST APPLY THE ORIGINAL LICENSING

We spent decades streamlining FLOSS licensing to make sure the number of actual licenses in play is small and their effects are clearly understood. The kind of legal effort proprietary companies coped on. As shown every single time some software giant tries to relicense its own (?) code and takes years clearing the effects of not doing due diligence before.

THERE IS NO VALID EXCUSE TO IGNORE FLOSS LICENSES.

There is easily 60% of github content that is governed by a small dozen of FLOSS licenses. This is more than enough to train a model on. Distinguishing between a dozen different terms is not hard.

This is especially galling from an extra-wealthy company who had the means for years to clear the legal status of its own code (but did not), spent years mocking people that wasted time arguing about the exact effect of FLOSS license terms, and then starts pillaging this very code without even trying to comply with the hardly won simple licensing state.

This is especially galling from a division (github) that has been asked for years to help committers navigate legalities, make half assed efforts, and then proceeds to ignore the result of those efforts.

Stop finding ridiculous excuses. FLOSS is about the only software trove where ML can work legally *because* of its licensing simplicity (that took a lot of effort to achieve). ASSUMING YOU APPLY THIS LICENSING. Otherwise, no better than proprietary software, and Microsoft has plenty of its own to play with, and it’s not welcome playing with other people’s software when not abiding with the legal conditions.

No better than the people that ignore creative commons terms because their own legal status is an utter mess, and they expect others to be just as bad. Not an honest mistake once they’ve been told repeatedly it’s not the case. They can stomp on their own licensing not on the one of others.

Why is Copilot so bad?

Posted Jul 5, 2022 6:48 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

(Also we *do* remember that Microsoft filed Amicus briefs on Oracle's side, when it spent years suing Google for 9 lines of rangecheck implementation, and now wants us to accept that copying of FLOSS code on industrial scale is not protected provided it’s mediated by a black box ML model).

Why is Copilot so bad?

Posted Jul 5, 2022 10:14 UTC (Tue) by SLi (subscriber, #53131) [Link]

So are you seriously telling me distributions like Debian do not find out regularly that they have been distributing something which is a copyright violation? Because if you are, you clearly just do not know.

There is no such thing as a truly massive corpus of code with a known license and guaranteed freedom from copyright issues. There just isn't. It's not an excuse.

Why is Copilot so bad?

Posted Jul 5, 2022 10:28 UTC (Tue) by amacater (subscriber, #790) [Link]

Debian and licenses - yes, it's one of the things that Debian maintainers do is licensing and copyright checking. If software is found where the licence is changed, it's removed. It's also one of the things that goes into Debian packaging checks, SPDX, reproducible builds ... there's a good faith effort to do this for every Debian package. Jokingly, I refer to Debian licence "fascism" as one of the saving graces of Debian because you _can_ be as sure as feasible that someone has checked.

This is not necessarily the case for other distributions - which may have other priorities / commercial pressures or whatever - but that's their world. Disclaimer: I am a Debian developer since about 1998 but don't currently package software, though I do keep note of the tools and processes that do.

Why is Copilot so bad?

Posted Jul 5, 2022 10:51 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

Irrelevant

The law does not require perfection, it deals with the real world.

The law requires good faith efforts, ie you do not get a free pass to appropriate stuff clearly labeled under someone else’s license, and you make efforts to fix things once you’re notified the labeling was in error.

Nothing more AND NOTHING LESS.

Why is Copilot so bad?

Posted Jul 5, 2022 10:58 UTC (Tue) by SLi (subscriber, #53131) [Link]

So are you saying that, if using models like Copilot is a copyright violation, the law would still not require you to stop using a model trained from Debian's source code once you have realized you trained it with unlicensed material? Because they did a good enough effort? Even if they could, at a significant cost, retrain it?

Why is Copilot so bad?

Posted Jul 5, 2022 11:58 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

It definitely NOT a copyright violation if you apply the licensing terms of the code you are copying.

If you ignore those terms, it MAY be a copyright violation, depending on the extent and originality of the copying and depending on how much it is linked to overall program structure (ie the more accurate the model will be, the more likely it will be to infringe).

The instrument you use for this copying (CTRL + C or fancy ML) is pretty much irrelevant in the eyes of the law. The law cares about effects (you killed your neighbor) not the instrument used (you used a knife like your arch arch ancestor, you printed a gun, used a fancy sci fi laser or Harry Potter’s magical wand). But, tech people keep thinking they will fool a judge just by using an instrument never encountered before.

Also the law deals with the real world, not absolutes, so infringing accidentally in good faith (the code was mis-labelled) is not the same thing as deliberately ignoring the code license pro-eminently displayed on the github project landing page. In one case you get condemned for one symbolic dollar (provided you did due diligence to fix your mistake), in the other it can reach billions.

As for the “significant cost of retraining” just try this in front of a judge and the peanut gallery, we all know here those models are periodically retrained for lots of different reasons (including mistakes in the data set and licensing mistakes are not less worthy than others).

Notwithstanding the fact that Microsoft is the operator of one of the worlds biggest clouds, which the judge will find it hard to ignore.

Why is Copilot so bad?

Posted Jul 5, 2022 12:20 UTC (Tue) by SLi (subscriber, #53131) [Link]

Ok, but that's my point exactly: There's not much hope for a free model in a world where you have to retrain it every time you discover it was tainted by freely available code which a human could read on the net but could not legally copy.

It may be, barely, possible for a large corporation like Google or Microsoft with their internal code bases which tend to be better curated (but still it will be hard).

You do realize that training a model the scale of Copilot costs a few millions every time you do it?

Good luck getting funding for retraining the free model every time Debian finds a copyright violation. I could see public or donated funding for a single training, but not for that.

So, if the law is what you claim it is, we can possibly still have proprietary models, but it's quite unlikely to have significant models trained on free software.

I think your rhetoric about tech people trying to fool judges is a bit misplaced and incendiary. I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough. And it's not like this is some device designed purely to try to circumvent law.

Why is Copilot so bad?

Posted Jul 5, 2022 12:50 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

> Ok, but that's my point exactly: There's not much hope for a free model in a world where you have to retrain it every time you discover it was tainted by freely available code which a human could read on the net but could not legally copy.

First computing power is dirt cheap and what was prohibitively expensive yesterday is wasted on ad processing and crypto mining today.

Second, the law does not deal with absolutes it deals with the real world and proportionality.

It does not require instantaneous systematic compliance. That, would be pretty much impossible to achieve in the material world. It requires speedy realistic compliance (as soon as you can, not as soon as it is convenient or cheap for you).

Periodic retraining would be fine, as long as you do not delay it unduly to avoid any consequence. And you *will* retrain periodically if only because computing languages keep evolving and you will need to make the model aware of new variants.

In the meanwhile, it is computationally cheap to filter output to ignore suggestions found in code you’ve been informed is tainted.

And if you are convinced the amount of tainted code will largely exceed you capacity to filter, and you proceed with your ML project anyway, it will be hard to take is as anything but willful copyright infringement.

And it all terribly inconvenient I know. The law is not about your individual convenience.

> I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough.

”Even for copyrightable platforms and software packages, the determination whether infringement has occurred must take into account doctrines like fair use that protect the legitimate interests of follow-on users to innovate. But the promise of some threshold copyright protection for […] elements of computer software generally is a critically important driver of research and investment by companies like amici and rescinding that promise would have sweeping and harmful effects throughout the software industry”

Gregory G. Garre, Counsel for Microsoft Corporation, BRIEF FOR AMICI CURIAE MICROSOFT CORPORATION […] IN SUPPORT OF APPELLANT

That’s what Microsoft thinks when the code in question is not produced by Joe Nobody on Github

Why is Copilot so bad?

Posted Jul 5, 2022 14:16 UTC (Tue) by SLi (subscriber, #53131) [Link]

Computing power dirt cheap? You clearly haven't moved into the world of AI yet. Seriously, training those models costs millions in electricity and computer time costs only per training run.

In the future, it's possible that you may be able to train models people train today for millions for less, but even that is a bit speculative (I think the biggest advancements are likely to come from algorithmic development, but it's probably still possible to squeeze some computation per watt more). You still won't be able to train in the future the better models they then train for dirt cheap.

Why is Copilot so bad?

Posted Jul 5, 2022 14:32 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

Then it was utterly foolish to spend those millions before writing the little amount of code necessary to check the legal metadata. Behaving foolishly is a general consequence of thinking rules apply to others not you.

Why is Copilot so bad?

Posted Jul 5, 2022 11:01 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

Also the default state of something you find somewhere on the street or on the web is not free to use it’s protected. You are not allowed to steal the pile of furniture lying in the street during a relocation just because every single table is not tagged off limits.

Why is Copilot so bad?

Posted Jul 5, 2022 12:15 UTC (Tue) by pabs (subscriber, #43278) [Link]

You are correct that Debian does have to remove code fairly regularly that was found to be non-free or even non-redistributable. Most instances are caught by maintainers before they enter Debian, but sometimes mistakes are made.

https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=ftp.deb...
https://snapshot.debian.org/removal/

Why is Copilot so bad?

Posted Jul 4, 2022 12:08 UTC (Mon) by nye (guest, #51576) [Link]

> Pretending everything is public domain is not just laziness, it’s *opinionated* laziness, that tries to blur the lines so everything not “protected” by bigcorp lawyers is free to pillage, and everything produced by this pillaging can be safely put out of bounds.

Nobody ever claimed that.

Why is Copilot so bad?

Posted Jul 2, 2022 14:23 UTC (Sat) by bluca (subscriber, #118303) [Link]

No, it's one where _anyone_ can do that, because building a model from public repos is not subject to copyright restrictions (and thus the license is irrelevant). That's a good thing, and it levels the playing field.

Why is Copilot so bad?

Posted Jul 2, 2022 15:59 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

No, it doesn't level the playing field.

Because the big capitalist proprietors still have more money and smart people to throw at the exercise than any other non-state actor.

Why is Copilot so bad?

Posted Jul 2, 2022 16:41 UTC (Sat) by bluca (subscriber, #118303) [Link]

Making it de-facto impossible for anyone not in a gigantic corporation to build an AI model doesn't change any of that, it makes things strictly worse. Fortunately it is not illegal, at least in Europe.

Why is Copilot so bad?

Posted Jul 5, 2022 9:47 UTC (Tue) by farnz (subscriber, #17727) [Link]

Building the model isn't subject to copyright restriction (which I agree is right and proper - we don't place copyright restrictions on people picking up information from code they read), but using it might be, just as I might be infringing copyright if I accidentally type in a byte-for-byte identical copy of something I read during code review at a past job.

There's precedent for this in human creativity - former Beatle George Harrison lost a case for "subconscious plagiarism" (hat tip to rsidd) because he listened to a song several years before writing a song that happened to be almost exactly the same melody. No copyright restrictions applied to George Harrison listening to the song he later infringed copyright on, but they did come into play once he created a "new" work that happened to be too similar to an existing work he knew about.

The same could well apply to Copilot - creating the model is OK (human analogy is consuming media), holding the model itself is OK (human analogy is having a memory of past work), but using the output of the model is infringement if it's regurgitated copyrightable code from its input ("subconscious plagiarism" in the Harrison case).

Why is Copilot so bad?

Posted Jul 5, 2022 10:54 UTC (Tue) by SLi (subscriber, #53131) [Link]

Code tends to be much more functional than "pure arts" like music. I doubt what you say is possible with code for a human (coming up with identical code might be, but in that case it's unlikely to be significant enough to be a copyright violation, and, you know, you are actually allowed to apply the generally useful stuff you have learned in previous jobs).

The copyright violation would have to be in the parts that remain to be filled in once you have copied the parts not protected by copyright—for example:

- the purpose ("what the code does", for example "reciprocal square root")
- how it does it, especially if it is the best or one of a limited number of good ways to do it (so, yes, perhaps counterintuitively the expression in a particularly clever code snippet might enjoy less protection)
- whatever is dictated by external factors (the magic numbers in the reciprocal square root code? There are probably other reasons why they are not protected, but they also aren't because they need to be exactly those numbers to work, as dictated by a mathematical law); this also applies to what the coding style dictates

So, in practice, for a small enough snippet that such an accident is plausible, what might remain
and what must pass the originality threshold to attain copyright protection is things like:

- variable names—but if they are "normal" and not very creative (using "i" for a loop counter or "number" for a number), it doesn't contribute a whole lot
- Stylistic things that do not come directly from coding style or the way things are commonly done. How you group your code. Perhaps the order of some lines of code, where you insert blank lines (in cases where it would be unlikely for two coders to do it the same way), etc.
- Comments. Short, purely technically descriptive snippers are probably unlikely alone to meet the originality threshold, but if you come up with enough similar technical prose, even in the form of multiple short comments that alone aren't original enough, I think this might be your best bet of violating copyright.

The threshold for originality (in the US) is "low", but not nonexistent. Some things that have been deemed to not meet the threshold are (and remember that with code you need to meet it with what is left once you remove the substantial unprotected elements):

- Simple enough logos, even when there clearly is *some* creativity involved: https://en.wikipedia.org/wiki/Threshold_of_originality#/m...
- Blank forms
- Typefaces
- This vodka bottle: https://en.wikipedia.org/wiki/Threshold_of_originality#/m...

Why is Copilot so bad?

Posted Jul 5, 2022 11:08 UTC (Tue) by farnz (subscriber, #17727) [Link]

It's unlikely to happen with a human coding, simply because I'm not going to copy any copyright-significant decisions from a colleague - I may have a very similar snippet, but the details will change, because that's the nature of a human copying out code from memory. It's more likely to happen with Copilot, since it sometimes regurgitates complete snippets of its input, unchanged, and in a very literal manner.

This is why I suspect the legality of Copilot is currently a lot greyer than either side would like us to think; where it copies code that's not eligible for copyright protection, it may be obvious that it's copied something, but not an infringement because there's no protection to infringe (just as me copying #define U8_MAX ((u8)~0U) from the Linux kernel is not infringing, because there's nothing in there to protect). The risk, however, comes in when the snippet is something that's eligible for copyright protection; I note, for example, that Copilot sometimes outputs the comments that go with a code snippet from its input, which are more likely to be protected than the code itself.

My guess is that if it comes to court, the training process and model will be non-infringing definitionally, because the law says so in the first case, and because in the second case, it's not a reproduction of the copyrighted inputs. The output, however, will face the tests for whether it meets the bar for protection, and if it does and is a reproduction of someone else's work, then it could be deemed infringing; the fact that the model and its training process are not infringements does not guarantee that the output of Copilot is also non-infringing.

So on the GitHub side, the thing they're skating over is that the training process and the tool can be non-infringing without guaranteeing that the output is also non-infringing. On the SFC side, they're skating over the fact that a direct copy does not guarantee infringement, since not all code is eligible for protection. The truth all depends on what a judge says if such a case comes before them - and I'd expect to see that appealed to the highest legal authorities (Supreme Court in the USA).

Why is Copilot so bad?

Posted Jul 5, 2022 11:52 UTC (Tue) by SLi (subscriber, #53131) [Link]

Yeah, I'm not sure there are lots of people who both understand something about the law and would be willing to declare that it's clear cut either way. I definitely am not. My gut feeling is that it will be deemed legal, possibly with some minor changes or post-processing, but I wouldn't bet my life on it.

My more important point is that it *should* be legal, as a matter of sane policy that also would be the result that benefits free software, just like most pushback against copyright maximalism.

Why is Copilot so bad?

Posted Jul 5, 2022 15:26 UTC (Tue) by farnz (subscriber, #17727) [Link]

I disagree that it should be legal - taking that position to an absurd extreme, if I train an ML model on Linux kernel versions alone, I could have an ML model that's cost me a few million dollars but that outputs proprietary kernels that are Linux-compatible and work on the hardware I care about. Effectively, copyright becomes non-existent for big companies who can afford to do this.

My position therefore depends strongly on what the tool actually outputs; if the snippets are such that they are not protected by copyright in their own right, and the tool only outputs unprotected snippets, then I'm OK with it; this probably needs some filtering on the output of the tool to remove known infringing snippets, which I'm also fine with ensuring is legal (it should not be infringement to include content purely for the purpose of ensuring that that content is not output by the tool - fair use sort of argument).

I also very strongly believe that the model itself should not be copyright infringement in and of itself - it's the output that may or may not be infringing, depending on how you use it, and it's the user of the model who infringes if they use infringing output from the model. That may sound like splitting hairs, but it means that Copilot and similar systems are fine, legally speaking, as are any other models trained from publicly available data. It's only the use you put them to that needs care - you could end up infringing by using a tool that is capable of outputting protected material, and it's on the tool user to watch for that and not accept infringing outputs from their tools.

Why is Copilot so bad?

Posted Jul 5, 2022 17:37 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

> I also very strongly believe that the model itself should not be copyright infringement in and of itself - it's the output that may or may not be infringing, depending on how you use it, and it's the user of the model who infringes if they use infringing output from the model. That may sound like splitting hairs, but it means that Copilot and similar systems are fine, legally speaking, as are any other models trained from publicly available data.

I suspect that would very much depend if someone manages to find a business model where a model, trained on someone else’s copyrighted production, makes a lot on money on its own (not via the output of original work copycats). People and lawmakers tend to take a dim view on someone making a lot of money from other people’s belongings without those getting a cut.

I doubt, for example, that the pharmaceutical companies will manage to escape forever paying back the countries whose fauna/flora they sampled to create medicines. The pressure will only grow with climate change and such natural products becoming harder to preserve.

Why is Copilot so bad?

Posted Jul 5, 2022 15:35 UTC (Tue) by nye (guest, #51576) [Link]

> My guess is that if it comes to court, the training process and model will be non-infringing definitionally, because the law says so in the first case, and because in the second case, it's not a reproduction of the copyrighted inputs. The output, however, will face the tests for whether it meets the bar for protection, and if it does and is a reproduction of someone else's work, then it could be deemed infringing; the fact that the model and its training process are not infringements does not guarantee that the output of Copilot is also non-infringing.

This seems eminently reasonable and appears (from the outside, of course) to be the same conclusion that Microsoft's lawyers have made. So far as I'm aware they haven't made an explicit statement on the matter, but I think it's reasonable to infer the first part (training process and model) from the fact that they approved the release of the software, and the second part (the status of the output) from the fact that they recommend "IP scanning" of any output that you use.

At least in the EU it's clearer such that it's hard to see how there could really be any other possible interpretation. Not sure if we have the same laws regarding ML data collection here in Brexit Britain or if that came too late.

Why is Copilot so bad?

Posted Jul 2, 2022 18:41 UTC (Sat) by Karellen (subscriber, #67644) [Link]

From this perspective, it rubs me the wrong way when people seem to argue that copyright should restrict the creation and use of AI models.

It's not the creation of AI models that I have a problem with. It's the creation of non-Free software, which is based on AI models, which is based on Free Software, that I have a problem with.

If people use Free Software to train AI models, which they then use to do things other than create more code (e.g. to find bugs, and report them), I'm fine with that.

If people use Free Software to train AI models to create new code which is licensed under terms compatible with the licenses of the original code, as is required for other derivative works, I'm fine with that too.

An AI model isn't magic. Even if the people who wrote it and trained it don't entirely understand all the internal connections and weightings, it's still just a bit of software that takes a big pile of code as input, does a bunch of processing on it, and spits out some more code as output. Like a compiler, or a transpiler, or a linter. Requiring that the code output by bit of software called "an AI model" has to follow the same rules as code output by any of those other software tools, in terms of respecting the licenses of its inputs, is not "expanding copyright madness to AI models", it's just copyright.

Why is Copilot so bad?

Posted Jul 2, 2022 19:03 UTC (Sat) by SLi (subscriber, #53131) [Link]

Would you say that a human brain is in some sense certainly more magic than that?

But, I believe your understanding of "the rules" is wrong. Copyright does not, in general, work so that everything that uses a work as input and produces an output necessarily would 1) either need a specific license to do so, or 2) produce outputs that are legally derived works or require a license of the input, even if the outputs are both complex and useful.

Why is Copilot so bad?

Posted Jul 2, 2022 22:57 UTC (Sat) by Wol (subscriber, #4433) [Link]

But that's what the word "derivative" means!

The output of Copilot is derived from its inputs. Therefore, by the definition of the word "derive", any and all output is a derivative of the input that was used to create it.

The only question is, to what extent does copyright either consider it a legal derivative work and hence subject to licence, or trivial and hence not subject to licence.

Any attempt to argue otherwise is basically playing Humpty Dumpty. The law does not define the word "derivative" as far as I know, so it means (approximately) what it means in common English. To argue that the output is not a derivative work is to argue that the English language is meaningless ...

(Oh, and while I don't know what the legal implications are, remember that the EU treats "works" and "data" separately. Saying that it's perfectly acceptable to treat works in public view as data fits nicely into the EU directive saying you can *train* an AI on by public "works" by treating it as data. But if you then treat the output as a work, you are promptly putting it back under copyright rules ...)

Cheers,
Wol

Why is Copilot so bad?

Posted Jul 3, 2022 6:38 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

US law does indeed define "derivative work" in 17 USC 101 as follows:

> A “derivative work” is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a “derivative work”.

Do not ask why they offer two completely different definitions back-to-back, because I have no idea. That's how it is in the statute book, and (presumably) how Congress wrote it.

Unfortunately, they do not define the word "work," and that's really the sticking point here. If I count the number of "E"s in a novel, and publish that number on a website, surely the number is not a derivative work of the novel, despite the fact that it has been "transformed" from the novel. A number is not a (creative) work, so it can't be a derivative work. But where do you draw the line? This definition does not tell us.

Why is Copilot so bad?

Posted Jul 4, 2022 9:06 UTC (Mon) by nim-nim (subscriber, #34454) [Link]

What you produce with the ML suggestions is definitely a work in the eye of the law.

You’re trying to nitpick by claiming that if you add a sufficient number of indirections and air-splitting in the derivation steps, it’s not (legally) a derivation.

But even if a judge would agree to follow this kind of reasoning (most would reject it out of hands, that’s basically muddying waters and a judge core job is to un-muddle what parties present to reach a verdict), that also works the other way :

“if splitting steps till a suggestion is too small to be considered a derivation in law works, how many of those tidbits can you combine the other way till you reach the critical mass and the end result is definitely protected?”

You can’t exempt yourself from legal obligations via technical foolery. It does not work that way law-side.

Why is Copilot so bad?

Posted Jul 2, 2022 23:11 UTC (Sat) by Karellen (subscriber, #67644) [Link]

Would you say that a human brain is in some sense certainly more magic than that?

I would say that the human brain is legally distinct from computer software, in a way that calling the computer software "AI" does not change.

Copyright does not, in general, work so that everything that uses a work as input and produces an output necessarily would 1) either need a specific license to do so

I thought I specifically pointed out that I do not think this, when I said that a program which takes code as input and produces bug reports as output, would not have issues. (Because "code" and "bug reports" are different things.) A tool that takes code as input and outputs lines-of-code metrics, or cyclomatic complexity metrics, would also not need a specific license to do so. (Even though some proprietary software licenses attempt to deny the rights to that sort of activity!)

Copyright does not, in general, work so that everything that uses a work as input and produces an output necessarily would [...] produce outputs that are legally derived works [...] of the input

Not in general, or necessarily, no. But in the case of Copilot, which takes code as inputs, processes it, and then outputs more code which is generated from the processing of those inputs, what possible definition of "derived work" could there be that excludes it? The outputs are a result of the inputs - change the inputs and you get different outputs. Take the "training" inputs out, and you get no output at all. And, the outputs are the same type of thing as the inputs - code. The output code is generated from - derived from - the input code.

Why is Copilot so bad?

Posted Jul 3, 2022 0:15 UTC (Sun) by SLi (subscriber, #53131) [Link]

I believe there are quite a few very possible way of it not being a derivative work—in the legal sense (I don't really care about the common meaning of the word). For example, if the copying of expression, as opposed to ideas, from any single work protected by copyright is de minimis, then the new work is not a derivative work of the original work. So, some amount of copying of expression can happen without copyright implications.

Another way is that where essentially no copying of expression, but only copying of ideas, is going on.

Why is Copilot so bad?

Posted Jul 3, 2022 22:41 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> For example, if the copying of expression, as opposed to ideas, from any single work protected by copyright is de minimis, then the new work is not a derivative work of the original work. So, some amount of copying of expression can happen without copyright implications.

Just to clarify for other commenters, this is a very complicated and jurisdiction-dependent legal analysis. As an example, in the 2nd Circuit of the US, they would do this: https://en.wikipedia.org/wiki/Abstraction-Filtration-Comp...

But in other jurisdictions, other tests will be used instead.

Why is Copilot so bad?

Posted Jul 2, 2022 23:15 UTC (Sat) by anselm (subscriber, #2796) [Link]

Copyright does not, in general, work so that everything that uses a work as input and produces an output necessarily would 1) either need a specific license to do so, or 2) produce outputs that are legally derived works or require a license of the input, even if the outputs are both complex and useful.

No. But if the device produces output that can be identified as a nontrivial part of a copyrighted work (e.g., a function definition), then the fact that it used an “AI model” does not mean it is somehow magically exempt from infringing on the copyright of that work.

In other words, if I produced that output myself by cutting and pasting the part in question from the original copyrighted work, I would obviously be infringing on its copyright. If Copilot produced the same output by passing the original copyrighted work through an AI model, why should that not be a copyright issue?

Why is Copilot so bad?

Posted Jul 4, 2022 12:06 UTC (Mon) by nye (guest, #51576) [Link]

> In other words, if I produced that output myself by cutting and pasting the part in question from the original copyrighted work, I would obviously be infringing on its copyright. If Copilot produced the same output by passing the original copyrighted work through an AI model, why should that not be a copyright issue?

Did anyone claim that it wouldn't?

Why is Copilot so bad?

Posted Jul 4, 2022 21:24 UTC (Mon) by anselm (subscriber, #2796) [Link]

Microsoft seems to think so. (They also claim that it doesn't happen very often, as if that was a valid excuse.)

Why is Copilot so bad?

Posted Jul 5, 2022 15:12 UTC (Tue) by nye (guest, #51576) [Link]

> Microsoft seems to think so

No they do not. They have not claimed that. They will not claim that. This straw man is *ridiculous* and seeing it repeated so often makes me want to scream.

The fundamental assertion that they're implicitly making by publishing copilot is that output from copilot is not automatically, ipso facto, an infringement of the license on its training data.

You[0] seem to be claiming that this further implies an assertion that the output from copilot is automatically, ipso facto, not an infringement of the license on its training data. Rather like claiming that "not all people are men" implies "all people are not men".

But not only are Github/MS not saying that, they are saying the opposite. In fact, what they *actually* say is this:

> You should take the same precautions as you would with any code you write that uses material you did not independently originate.
> These include rigorous testing, *IP scanning*, and checking for security vulnerabilities
(emphasis mine)

[0] In the plural sense. I imagine you *personally* have just been misled by the people making up this straw man, since it's so common.

Why is Copilot so bad?

Posted Jul 5, 2022 15:58 UTC (Tue) by anselm (subscriber, #2796) [Link]

> These include rigorous testing, *IP scanning*, and checking for security vulnerabilities (emphasis mine)

In other words, they want us to perform the due diligence that they're not prepared to do themselves. This does not detract from the fact that they're misleading Copilot users about the copyright status of the code that Copilot emits, so they're potentially violating licenses such as the GPL or BSD license which stipulate that code covered by them can only be passed on if the license grant is also passed on.

Why is Copilot so bad?

Posted Jul 4, 2022 22:42 UTC (Mon) by sfeam (subscriber, #2841) [Link]

Can a text generated by random generation of sequential characters constitute copyright violation? An unequivocal "yes" answer seems to sound a death-knell for clean room implementations. An unequivocal "no" answer lets Copilot off the hook.

This starts to sound very close to the classic "infinite number of monkeys typing at random" scenario. Are the monkeys inevitably guilty of copyright violation?

The stereotypical madly-typing monkeys generate text strings where each character c is generated with uniform probability P(c) and accepted into the output text independent of that monkey's previous typing history. What if we bias P(c) to favor more readable text (give the monkeys Dvorak keyboards?). What if we filter acceptance by previous history (Markov filters? trained monkeys?). What if we house the monkeys in a black box and label it "Copilot"? What if we replace the monkeys with a neural net? Where in this process of refining the scenario does the possibility of copyright violation creep in, if anywhere?

Why is Copilot so bad?

Posted Jul 5, 2022 0:57 UTC (Tue) by anselm (subscriber, #2796) [Link]

Where in this process of refining the scenario does the possibility of copyright violation creep in, if anywhere?

It doesn't really matter exactly how the monkeys came up with the copy. Your copyright problem starts where you take the monkeys' output, which is demonstrably identical to a preexisting copyrighted work, and pass it off as something you're entitled to dispose of as you please, because the original copyright holder's claim that – never mind those monkeys – you just ripped off their stuff will be difficult for you to refute. (In the case of Copilot, this is, if anything, more difficult, because you effectively showed the monkeys the original copyrighted work first, so their coming up with a verbatim copy eventually will surprise nobody.)

Why is Copilot so bad?

Posted Jul 5, 2022 22:19 UTC (Tue) by hummassa (subscriber, #307) [Link]

1. IANAL (but in another point in my life I was a paralegal in DA office and I participated in legal research, about prosecuting copyright violators)

2. the answer to
> Are the monkeys inevitably guilty of copyright violation?

is: the monkeys are never guilty (monkeys are not people, only people can be guilty)... but if you copy, distribute, or perform the work received from the monkeys in public, then you are.

3. clarifying the last part: if the monkeys (or the ML model) produces a copyrightable piece of a copyrighted work, then the monkeys are nothing else than another medium where the copyrightable work is fixed. The monkeys, or the ML model, are just like an HD or a DVD-RW or a physical, printed book.

4. so: if a random number generator generates the number that when represented in binary is the same number as the image of a Blu-Ray of "Avengers: Endgame", this does ABSOLUTELY NOTHING over the copyright of the work. You can't burn it to a BR and play it in a public setting. You can't even copy it. Only Sony/Marvel and their licensees can.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 2:44 UTC (Sat) by developer122 (subscriber, #152928) [Link]

So....................What are their other complaints? Besides copilot?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 2:54 UTC (Sat) by developer122 (subscriber, #152928) [Link]

Ah, ok. according to https://sfconservancy.org/GiveUpGitHub/ :

1) github has a services contract with ICE?

2) github is proprietary software

3) github discredits copyleft?

4) github is owned by microsoft

1) is the only one above that I can really find motivating. The others 2) and 4) are pretty much "well duh, it's a modern website" while 3) might be plausible but smells of conspiracy theory. TBH, I default to permissive licences wherever possible so maybe I'm just not the target audience.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 7:11 UTC (Sat) by gdt (subscriber, #6284) [Link]

Coming from Australia -- a "fair dealing" country rather than a "fair use" country -- I can't see how Copilot is not making unauthorised reproductions of my work. Fair dealing is a 'black-letter law' list of uses of works for which you do not need a copyright license, and training an AI simply is not on that list. Without fair dealing, the only way to reproduce the code to train the AI is via a license, and if that is a license like the GPL then the AI output must meet the terms of the license.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 6, 2022 3:02 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> Coming from Australia -- a "fair dealing" country rather than a "fair use" country -- I can't see how Copilot is not making unauthorised reproductions of my work.

Copilot isn't relying on "fair use" in countries where that applies either; it's relying on the fact that it isn't designed to reproduce the works it was trained on at all. Despite a few noteworthy corner-cases where Copilot did not work as intended, which are being addressed with filters, the goal of the system is extraction of common (i.e., not creative) elements and synthesis from many sources—not storage and retrieval of specific works.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 9:57 UTC (Sat) by ale2018 (subscriber, #128727) [Link]

Sadly, the IETF is just completing the transition to host standard development on Github. I don't think it's going to give it up any time soon...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK