So you know how AI models will take GPL code and mix it with completely incompatible licences. In the past, this was done with code snippets and small functions and various other independent parts of the code base. But as this idea of fully agentic programming becomes more and more popular, we're gonna see more of this. This will not be the last case, but at least right now, I'm not aware of a situation as crazy as this one. I'm not a lawyer, and I'm pretty sure the project maintainer isn't a lawyer either, but today we're here to talk about a project called Chardet, a Python character encoding detector. Okay, sounds fair enough. Let's scroll down just a little bit. Chardet 7. 0 is a ground up MIT licensed rewrite of Chardet. Same package name, same public API, drop-in replacement for Chardet 5. x and 6. x. Just much faster and more accurate. Python 3. 10 plus, zero runtime dependencies, works on PyPy. Okay, all of that sounds fine. Now the bit of drama is with the old license. It's now MIT licensed. It was originally an LGPL project and the current maintainer, that's not the guy who started the project. In case it's unclear, usually that's not a thing you can do, unless you either rewrite everything, which usually not a thing you can do, unless, one, you get permission from all the contributors or remove their code from the code base, and on the topic of removing it, do a full rewrite, get rid of everything but your own code, and then you're good to go. And that is what is being claimed here. A full rewrite that doesn't have any of the old code. And you might have noticed the emdash. So this was spotted by Morten Linderud of the Arch Linux project. Apparently Chardet got Claude to rewrite the entire code base from to MIT. That is one way to launder GPL code, I guess. And this isn't some like, oh, we don't know if Claude's in the code base. Claude's the second biggest contributor. So like, yeah, Claude is being used in this code base, and we'll get more to that in a bit. So this is the release for 7. 0. And a lot of what is here is duplicated in the pull request, talking about the change as well. And just so we're clear, there is no secret about Claude being used here. This isn't speculation. I put this together using Claude code with Opus 4. 6 with the amazing Obra Superpowers plugin in less than a week. It took a fair amount of iteration to get it dialed in quite like I wanted, but it took a project I've been putting off for many years and made it take four days. Now my main focus here isn't about the technical merits of porting the code. We recently talked about the Ladybird browser going from C++ over to Rust. Now in that case, they didn't change the license. It's still the license it always was, and that's totally fine. In that case, the main focus was, is this a good use of the tooling? Does it actually make it easy to migrate languages? In this case though, I don't care if the code is good. bad. The main focus here is changing the license, effectively trying to launder the code from one license over to a more permissive license. Now this can be done with a project, but it has to be done in a very careful way. When you're doing a rewrite of code that you don't own, it is very similar to doing reverse engineering. It is incredibly important that you are not deriving code from any of the existing incompatibly licensed code. Asahi Linux is a great example of this. This is why they want absolutely nothing to do with Apple engineers and the project. If you've worked for Apple, if you're an Apple contractor, you cannot touch the project because Apple can use that as an excuse. And when you're a person doing this porting, it is difficult, especially if you're an active maintainer on the project, because this is something you've seen over and over again. And trying to not derive from code you've seen that much is very difficult. But when we're talking about an AI, you can say, oh, don't use GPL code, don't use LGPL code. But we know for a fact that is not a guarantee that it's not going to do that. You don't know exactly what it was trained on unless you trained it yourself, which in this case was not done. And you don't know whether or not the code in your project is even in that training data. If it is, good luck trying to get it to be something entirely clean. We'll talk a bit more about this problem, but first I want to show you an issue created by a certain someone, Mark Pilgrim, the original creator of the project. You may remember me
Segment 2 (05:00 - 10:00)
from such classics as Dive Into Python and Universal Character Encoding Detector. I am the original author of Chardet. First off, I would like to thank the current maintainers and everyone who has contributed to and improved this project over the years. Truly a free software success story. However, it has been brought to my attention in the release of 7. 0. 0. The maintainers claim to have the right to re-license the project. They have no such right. Doing so is an explicit violation of the LGPL. License code, when modified, must be released under the same LGPL license. Their claim that it is a complete rewrite is irrelevant since they had ample exposure to the originally licensed code, i. e. this is not a clean room implementation. Adding a fancy code generator into the mix does not somehow grant them any additional rights. I respectfully insist that they revert the project to its original license. This right here is sort of the main crux of the argument. Is an AI capable of doing a clean room implementation? This is still a legal question to be answered. As there was quite a lot of discussion in this thread, a couple hundred comments, the current maintainer did respond. Hi Mark and thank you for reaching out. I've been the primary maintainer and contributor to this project for over 12 years and I don't take that stewardship lightly. I want to address your concerns directly. The core claim that this is not a clean room implementation and therefore must remain LGPL. I understand the concern and I think it's worth engaging with seriously. You're right that I have had extensive exposure to the original code base. I've been maintaining it for over a decade. A traditional clean room approach involves a strict separation between people with knowledge of the original and people writing the new implementation and that separation did not exist here. And I would reasonably argue that is basically the end of the discussion, but he does keep going. However, the purpose of clean room methodology is to ensure the resulting code is not a derivative work of the original. It is a means to an end, not the end itself. In this case, I can demonstrate the end result is the same. Now what is your bet? What is your bet here? What is... I don't know for sure, but I have a feeling. I can demonstrate the end result is the same. The new code is structurally independent of the old code through direct measurement rather than process guarantees alone. I ran jplag across every major release in this repository. jplag parses python source code into syntactic tokens, function definitions, assignments, control flow, etc., discarding all variable names, comments, whitespace, and formatting, then finds maximal matching sub-sequences, using greedy string tiling, renaming variables, or reformatting code, does not change the score. And as you'll notice, the similarity goes down and down over the years. It is worth noting, however, that with this rewrite, it also involved removing 540,000 lines of code. So... I would expect it to be very different because most of the code isn't there. I wanted your attention to a critical distinction in this data. Version 6. 0. 0 had major changes to how single-byte char-set detectors were trained and stored, and that version shows only 3. 3% average similarity to 5. 2. 0, but its max similarity of 80% tells you the files were carried forward from the prior release. It was still clearly part of the same lineage, still a derivative work, and still rightfully LGPL. Version 7. 0. 0 is qualitatively different. Its max similarity against every prior version is under 1. 3%. No file in the 7. 0. 0 code base structurally resembles any file from any prior release. The match tokens are common python patterns that appear in any project, argpass boilerplate, dict literals, import blocks. This is not a case of rewrote most of it carried some files forward, nothing was carried forward. Whilst that might be your argument, and that is a fair argument to make, are you certain? Are you certain that nothing at all was carried forward, no implementation details? This is the concern you have when doing any sort of rewrite like this. If anything is left over, that's still LGPL code. And that link he has there is just deleting all of the old files, but that doesn't mean that none of the implementation details were carried forward. All it means is the old files they were in have now been deleted. Which, if anything, kind of just makes it look like you are trying to
Segment 3 (10:00 - 15:00)
hide any similarities that were there, making it harder to find that older code and actually do that comparison. I'm not saying that is what he is doing, but it is now much harder for a third-party auditor to actually go and do that audit. On the LGPL scope, I'm not a lawyer, but my understanding of the LGPL is that its copyleft provisions apply to derivative works, code that is derived from the licensed code, and do not extend to independent implementations of the same idea. This is true, and the argument is that this is a derivative of the code. Character encoding detection via bombs, statistical modeling, and candidate elimination are well-established techniques described in publicly available research, predating both you, Chardet and Chardet. My understanding is that these ideas are not copyrightable expression, and that re-implementing them independently does not create a derivative work. Yes, that's true, but the specific implementation would be copyrightable. That's the distinction there. He's arguing that what he has done is not a derivative of the existing work, but is instead a derivative of the publicly available research. I don't know whether that argument would hold up, but that's the argument he's making. I'd also note that if prior exposure alone were enough to disqualify a rewrite, it would be very difficult for any maintainer of an LGPL project to ever write a new implementation of the same functionality under a different license, regardless of how different the resulting code is. I don't believe that's what the LGPL requires, but I'm open to hearing a different interpretation. The core question as I see it is whether the new work is derived from the old work, and the evidence above suggests that it is not. Again, the evidence that he has provided himself. It's basically, I did an audit on myself, and I proved that I'm not guilty. For full transparency, here's how the rewrite was conducted. I used the superpowers brainstorming skill to create a design document specifying the architecture and approach I wanted based on the following requirements I had for the rewrite. Public API compatibility, which, as we've seen a million times, is totally fine, should still be called Chardet, as the plan is to replace Chardet. Not based on any GPL or LGPL code. As we know, telling an AI to not base on GPL or LGPL code always makes it do that, and it never just pretends like it doesn't know what you're talking about. That's never happened before. High charge detection accuracy on test data. Language detection, not a hard requirement, but if it's easy or a byproduct of other design, do it. Fast and memory efficient to use multiple cores efficiently. No runtime dependencies. Must work in PyPy and CPython, clean and modern design. If using trained statistical models use data available via Hug and Face Load Dataset API, any training code should cache data locally, so we can retrain often during development process, benchmark often, and does not use tons of large dict literals that does not CPython 3. 12 where it takes forever to import such things. This is part of the reason why I think that a lot of this is AI generated. When he writes himself, like you get lines like this. I understand this is a new and uncomfortable area, and that using AI tools in the rewrite of a long-standing open source project raises legitimate questions. But the evidence here is clear. 7. 0 is an independent work, not a derivative of the LGPL license code base. The MIT license applies to it legitimately. I'm happy to discuss this further, and I genuinely appreciate you raising concerns. Now, when he said that this is something you wanted to do for many years, he actually does mean years. This is an issue from 12, going on 12 years ago, changing the license. At the time there was quite a bit of pushback about doing so because obviously that is not something you just go and do to a project that you don't own the code for, but it is something he has wanted to do. Now, he's been not only maintainer, but also basically the sole contributor for a very long time now. So a lot of that code has changed, a lot of things had already been removed from the code base, and a lot of stuff that may have been left from early developers like Mark may have just been non-copyrightable work like boilerplate and basic things that are not copyrightable whatsoever. But this takes us into an interesting question about AI and code and copyright, because it's not at all clear what's going to happen worldwide. But specifically in the US, they are leaning towards AI work probably not being copyrightable. Now, this is specifically regarding AI art, and maybe when it comes to code, things are going to go differently. But also, AI-generated patents are being rejected. So it is highly likely
Segment 4 (15:00 - 19:00)
that the same thing is going to happen with code as well. And with AF to answer questions like, if it is not copyrightable, what do you do with the AI-generated work? Is it owned by the person who goes and makes a modification to it? Is it in the public domain, which the public domain is very loosely defined, especially outside the United States? And a lot of countries don't actually have a definition of the public domain. And a lot of these questions get very, very complicated. And you basically start needing lawyers to answer pretty much simple questions like, if I use AI in my corporate project, what happens there? Who owns that? Is it just... Is it public domain? What happens here? But also, if the new AI-generated work is not copyrightable, is there a copyright violation here? Or is this new AI-generated work just considered to be a derivative of the original work, and then licensed under the original license, in this case, being the LGPL? But if that is not the case, and if the new work is not copyrightable, where does that actually put the license? If you have public domain code, and then just stick an MIT license on it, that's not now MIT license code. You have an MIT licensed version of it, but that code is still public domain. So assuming that this is not a copyright violation, does this license now even apply? Or is the new rewrite just in the public domain? These licenses we use for code are built around human work, and built around the fact that copyright does exist. So if there is no copyright, then are you able to apply any additional restrictions onto that code? I don't know the answer to that, but I would guess if it's in the public domain, which it seems like that's where AI work is going to go, the answer is probably no. But also important to this case, if you take something that was LGPL, and then used AI to effectively launder it into the public domain, presumably that would be a legal problem, because otherwise you could take any project, do a rewrite of it with AI, put it in the public domain, and then just do anything you want with it. But I am certainly not qualified enough to answer that question, and most of you probably aren't either. However, if you think you are, feel free to let me know why your specific interpretation of the law down below is the correct way that it should be done. Anyway, this rewrite also didn't come without bugs, so there's also that. So we have a LGPL to MIT rewrite by AI to launder the code from one license to another, and also it's kind of buggy. So, yeah. Um, here's the thing though, right? We can talk about license violations and whether or not something can and can't be done, but until things like this are actually tested in court, there's nothing that could really be done about it, right? Licenses don't actually exist. Legal teams exist that enforce license terms. Amongst the regular people, licenses only exist as a common courtesy, because most people don't have the money or the time to go and test them. And that is why organizations like the EFF and various other places do exist. I'm not saying this specific case should be tested in court, but this is going to have to be tested in court with various projects in the future, because otherwise this is going to be done many, many more times after this. And again, I don't know exactly how it would go. Anyway, let me know your thoughts down below. I just realized I forgot to change this for the video. That's fine. It is what it is. Um, yeah. Let me know your thoughts down below. What do you think? What do you think is the right path here? Do you think something wrong was done? Do you think this is a totally reasonable use of AI? Do you think it violated the license? I would love to know. So if you really liked the video and you want to become one of these amazing people over here, check out the Patreon, SubscribeStar, Liberapay, linked in the description down below. That's going to be it for me. And this might be an L for the GPL. I'm sorry, that was bad.