Data poisoning tool eyed to prevent AI copyright infringement

This content has been archived. It may no longer be accurate or relevant.

From Coinbase:

Amid the heated row between artificial intelligence (AI) companies and creators over alleged copyright breaches, researchers are developing a new tool that will protect digital artists’ intellectual property (IP) rights.

The tool, dubbed Nightshade, is designed to poison the data sets used in training generative AI models, causing them to malfunction. Per an MIT Technology Review report, Nightshade works by tweaking the pixels in digital art in an invisible way to the naked eye but affects how trained generative AI models interpret the image.

Early test results by the researchers at the University of Chicago have shown significant promise in contaminating machine learning data sets. For example, inputting the word dogs in “poisoned” systems generates images of cats, increasing the inaccuracies of affected models.

“The poisoned data is very difficult to remove, as it requires tech companies to painstakingly find and delete each corrupted sample,” read the review.

Ben Zhao, a leading researcher for the project, hinted at plans to make Nightshade open-source to allow other teams to create their product versions. Zhao says the end goal is not to stunt the development of AI but to “tip the power balance” between AI developers and artists “by creating a powerful deterrent” against the violation of copyrights.

The researchers developed Glaze—a tool designed to label digital art genres in a manner different from the original to trick AI models. Zhao revealed that Nightshade will be integrated into Glaze to give artists stronger copyright control over their creations.

Despite the promise shown by the tools, attempts at poisoning large-scale generative AI models will require “thousands of poisoned samples,” with experts predicting AI developers will begin working on defenses.

“We don’t yet know of robust defenses against these attacks. We haven’t yet seen poisoning attacks on modern [machine learning] models in the wild, but it could be just a matter of time,’ said Vitaly Shmatikov, professor at Cornell University. “The time to work on defenses is now.”

There are fears that bad actors could leverage the tools to carry out malicious attacks against machine-learning models. Still, the researchers say the bad actors will require a boatload of poisoned samples to inflict real damage.

Link to the rest at Coinbase

PG says this is an unwise overreaction to new technology by entrenched incumbents who benefit from the status quo. AI is an extraordinarily important technology and stunting its growth is a bad idea.

As PG has said before, from his understanding of how the creators of an AI program build their program’s corpus of image elements, copies of the originals are not created.

Instead, the originals are basically chopped up and broken down into a stew of words, image elements, etc. from which the AI program constructs a group of words or images in response to a user’s inputs that is not a duplicate of any of the works that were fed into the digital meat grinder in the first place.

If the AI program is not able to create a copy (or multiple copies) of an original work regardless of user prompts, where’s the copy of prior works? Where are the damages to the creator of the original work? How does the creator prove damages?

4 thoughts on “Data poisoning tool eyed to prevent AI copyright infringement”

  1. “The poisoned data is very difficult to remove, as it requires tech companies to painstakingly find and delete each corrupted sample,” read the review.

    Sounds like a job for AI.

    • Figuring out how to “poison” the data so that it corrupts the AI is not a trivial task.

      The result will be certain to create a signature in the data, which will allow it to be detected (and probably corrected) automatically.

      I do predict that, if this starts being used, it will generate a great deal of money for some lawyers. This is rather akin to the prohibition against deliberately setting “man traps” on your property. If the AI product of a big tech company is damaged by this malicious “AI trap” – you can bet that they will bring a massive law suit against the perpetrator, if they can be identified (which they surely can be, if they are trying to make money from their altered creation).

  2. Regardless of the legality or ethics of training LLMs and other forms of Automated Data Analysis software, this approach, even if widely adopted and effective (two big if’s) will achieve nothing…
    …other than *help” the existing players by protecting them from latecomers: they already trained their models with more than enough data, their challenge now is controlling outputs.

    Said players being Microsoft, Google, Facebook, IBM, and if they survive OpenAI.
    Late challengers? Amazon, Apple, Oracle, X AI (Musk) and a dozen startups in Europe.

    This is not a game for two guys in a garage, it takes massive datacenters and big money. (Amazon coughed up $4B just to get a foot in.) Making it harder for challengers to get in only gives more power to the first movers. And while being first is no guarantee of survival, locking out wouldbe challenges will be a big help.

    So yes, limit competition to just the biggest tech companies. See how that works out for you.

  3. I remain amazed at how many otherwise reasonable people don’t even want hear that training a dataset might be legal or not theft, let alone any details about why. Including my own sister. Who’s decided I’m closeminded and unwilling to hear reason simply because I said there’s a difference between training and output and how you implement it determines whether it’s infringing.

Comments are closed.