5

Amazon's CodeWhisperer

 1 year ago
source link: https://lwn.net/Articles/900045/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Amazon's CodeWhisperer

[Posted July 5, 2022 by corbet]
There has been a fair amount of concern recently about Microsoft's Copilot system, which many see as possibly putting its users in violation of free-software licenses. But, naturally, Copilot is not the only offering of this type; Amazon has put out a preview version of "CodeWhisperer", which is also a machine-learning-based coding tool that was trained on (unspecified) open-source code. From the FAQ:
CodeWhisperer’s reference tracker detects whether a code recommendation may be similar to particular CodeWhisperer training data, and can provide those references to you. This allows you to easily find and review that reference code and how it is used in the context of another project.

(Log in to post comments)

Amazon's CodeWhisperer

Posted Jul 5, 2022 14:45 UTC (Tue) by NightMonkey (subscriber, #23051) [Link]

Howdy. So, convoluted question from a non-lawyer about these code Mad Lib(TM) engines and software patents:

So, let's say that a developer uses GItHub's CoPilot or Amazon's CodeWhisperer or other similar code Mad Lib tools. They love the MIT or Apache-licensed code (maybe even some GPL2?) that they see and use lots of it. 6 months later, a court finds the 'training data' code patented, and is, therefore, no longer Free. What then for the developer? How are they alerted to this problem? Or is it only a problem for the services, not the developer? Cheers.

Amazon's CodeWhisperer

Posted Jul 5, 2022 16:16 UTC (Tue) by dskoll (subscriber, #1630) [Link]

Also not a lawyer, but I know a little about patents from a previous job. A patent is different from copyright. To infringe copyright, you have to distribute a work contrary to the terms of its license, or derive a work from a copyrighted work and distribute it contrary to the original work's license.

For a patent, the only thing that matters is what you do, not how you got there. So for example, when the LZW compression algorithm was patented, it wouldn't matter if you copied a reference implementation, created a brand-new implementation on your own, or used a Copilot-derived implementation... you'd still be infringing the patent.

If you do infringe on a patent, it's sometimes better not to know, because willful infringement carries a lot higher penalty than inadvertent infringement.

I doubt Amazon or MSFT would be responsible for notifying users of their AI code-generating software about potential patent infringement... that risk lies entirely with the users.

Amazon's CodeWhisperer

Posted Jul 5, 2022 17:22 UTC (Tue) by Wol (subscriber, #4433) [Link]

Code Whisperer looks better than Copilot here - as I read it (I could be wrong) it does not present you with a suggested *completion* of what you're doing, it presents you with *examples* of what seem to be the same thing - WITH PROVENANCE.

So when Code Whisperer makes a suggestion, it looks like it tells you where you got it from, and you have the information you need to do due diligence.

It seems Copilot doesn't bother ...

Cheers,
Wol

Amazon's CodeWhisperer

Posted Jul 5, 2022 16:21 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

It’s a problem for the services if they commingle code with different legal terms and do not present you the terms attached to whatever they suggest.

That service aspect apart, it changes nothing for you as a consumer or publisher of code. The service can be sued as accessory to copyright infringement, but the infringement is still yours (unless the service promises legal insurance as part of its terms of use).

As a consumer, you’re still supposed to perform legal due diligence on the third party code you integrate.

As a publisher, you’re supposed to make sure your legal terms are clearly written and clearly notified.

Copyright is still the same dangerous hairball than when AT&T published Unix (Lions book and all) and everyone involved ended up in court due to general carelessness.

Amazon's CodeWhisperer

Posted Jul 5, 2022 18:30 UTC (Tue) by nickodell (subscriber, #125165) [Link]

The patent issue is clearer than the copyright issue.

For a patent, if you invent the same thing as a previous patent, then you're infringing on that patent. It doesn't matter if you invented it independently. (However, the penalties for willful infringement are higher.)

For copyright, if you come up with the same idea, the way you came up with it matters. One interpretation is that language models are doing some form of reasoning, so a similar work appearing in the training data isn't necessarily proof that the language model is copying that previous work. Another interpretation is that a language model is just copying part of its input and changing a few things.

There are awkward effects for both possible interpretations. If you accept the first interpretation, then how do you measure whether a model is doing "enough" reasoning? If you accept the second interpretation, that implies that the output of e.g. GPT-3 is jointly owned by every person who's written anything on the internet. Practically speaking, it would become illegal to train an AI on common crawl data.

I don't think any court has ruled on it either way.

Amazon's CodeWhisperer

Posted Jul 5, 2022 18:46 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

One thing that people have problems understanding is that copyright is not just about literal copying.

You can take all the words in a text, and arrange them in sentences meaning something else, and the result will be non infringing.

You can take the same text, and replace every single word with a synonym, and the result will be definitely infringing. None of the words survived but the structure is still the same.

That makes models, that analyze the structure of the code being written, and suggest bits to make it closer to someone else’s structure, especially problematic.

Amazon's CodeWhisperer

Posted Jul 5, 2022 21:11 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link]

The classic example with writing is that you can change the medium or genre of a work and it can still be a derivative. All those comic book movies are still derivatives of the original comics, even if they don't directly swipe story lines. Similarly, The Magnificent Seven is still a derivative of Seven Samurai even though the setting, character names, and even the language all changed.

That said, the functional nature of code makes it a more difficult case than something purely expressive like fiction or poetry. If there are few enough ways of achieving the same purpose efficiently, it's possible to argue the code is determined purely by functional constraints and therefore isn't expressive. This is especially true if the code is implementing a published algorithm, like quicksort or the sieve of Eratosthenes.

Amazon's CodeWhisperer

Posted Jul 6, 2022 5:48 UTC (Wed) by nim-nim (subscriber, #34454) [Link]

However the model will make sure that, out of the few enough ways to achieve a purpose, you select one others already chose. Hard to see how that can square with copyright if you intend to ignore original licensing.

Amazon's CodeWhisperer

Posted Jul 6, 2022 7:18 UTC (Wed) by LtWorf (subscriber, #124958) [Link]

Hollywood constantly rips off old sci-fi stories passing them off as original. I guess since most of the original authors are dead they don't fear repercussions.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK