• Sibbo@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    66
    ·
    2 years ago

    How can the training data be sensitive, if noone ever agreed to give their sensitive data to OpenAI?

    • TWeaK@lemm.ee
      link
      fedilink
      English
      arrow-up
      29
      arrow-down
      1
      ·
      2 years ago

      Exactly this. And how can an AI which “doesn’t have the source material” in its database be able to recall such information?

      • ∟⊔⊤∦∣≶@lemmy.nz
        link
        fedilink
        English
        arrow-up
        18
        ·
        2 years ago

        Model is the right term instead of database.

        We learned something about how LLMs work with this… its like a bunch of paintings were chopped up into pixels to use to make other paintings. No one knew it was possible to break the model and have it spit out the pixels of a single painting in order.

        I wonder if diffusion models have some other wierd querks we have yet to discover

        • Jamie@jamie.moe
          link
          fedilink
          English
          arrow-up
          5
          arrow-down
          1
          ·
          2 years ago

          I’m not an expert, but I would say that it is going to be less likely for a diffusion model to spit out training data in a completely intact way. The way that LLMs versus diffusion models work are very different.

          LLMs work by predicting the next statistically likely token, they take all of the previous text, then predict what the next token will be based on that. So, if you can trick it into a state where the next subsequent tokens are something verbatim from training data, then that’s what you get.

          Diffusion models work by taking a randomly generated latent, combining it with the CLIP interpretation of the user’s prompt, then trying to turn the randomly generated information into a new latent which the VAE will then decode into something a human can see, because the latents the model is dealing with are meaningless numbers to humans.

          In other words, there’s a lot more randomness to deal with in a diffusion model. You could probably get a specific source image back if you specially crafted a latent and a prompt, which one guy did do by basically running img2img on a specific image that was in the training set and giving it a prompt to spit the same image out again. But that required having the original image in the first place, so it’s not really a weakness in the same way this was for GPT.

          • TWeaK@lemm.ee
            link
            fedilink
            English
            arrow-up
            0
            ·
            2 years ago

            But the fact is the LLM was able to spit out the training data. This means that anything in the training data isn’t just copied into the training dataset, allegedly under fair use as research, but also copied into the LLM as part of an active commercial product. Sure, the LLM might break it down and store the components separately, but if an LLM can reassemble it and spit out the original copyrighted work then how is that different from how a photocopier breaks down the image scanned from a piece of paper then reassembles it into instructions for its printer?

            • lad@programming.dev
              link
              fedilink
              English
              arrow-up
              1
              ·
              2 years ago

              It’s not copied as is, thing is a bit more complicated as was already pointed out

        • SkySyrup@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          1
          ·
          2 years ago

          The technology of compression a diffusion model would have to achieve to realistically (not too lossily) store “the training data” would be more valuable than the entirety of the machine learning field right now.

      • Jordan117@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 years ago

        IIRC based on the source paper the “verbatim” text is common stuff like legal boilerplate, shared code snippets, book jacket blurbs, alphabetical lists of countries, and other text repeated countless times across the web. It’s the text equivalent of DALL-E “memorizing” a meme template or a stock image – it doesn’t mean all or even most of the training data is stored within the model, just that certain pieces of highly duplicated data have ascended to the level of concept and can be reproduced under unusual circumstances.

    • seaQueue@lemmy.world
      link
      fedilink
      English
      arrow-up
      9
      ·
      2 years ago

      Welcome to the wild West of American data privacy laws. Companies do whatever the fuck they want with whatever data they can beg borrow or steal and then lie about it when regulators come calling.

    • CubbyTustard@reddthat.com
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 years ago

      If I take a certain proportion of Thyme and mix it with another certain proportion of Basil the recipe for the resultant spice mix can be a sensitive, proprietary business secret despite there being nothing inherently sensitive about the ingredients themselves.

      • GoldELox@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 years ago

        if i stole my neighbours thyme and basil out of their garden, mix them into certain proportions, the resulting spice mix would still be stolen.

        • CubbyTustard@reddthat.com
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 years ago

          all the leaked training data we have seen has been publicly available information so I don’t see the relevance unless you’re just trying to be mad

  • guywithoutaname@lemm.ee
    link
    fedilink
    English
    arrow-up
    51
    arrow-down
    1
    ·
    2 years ago

    It’s kind of odd that they could just take random information from the internet without asking and are now treating it like a trade secret.

    • HMN@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      12
      ·
      2 years ago

      There was personal information included in the data. Did no one actually read the article?

    • MoogleMaestro@kbin.social
      link
      fedilink
      arrow-up
      8
      ·
      2 years ago

      This is why some of us have been ringing the alarm on these companies stealing data from users without consent. They know the data is valuable yet refuse to pay for the rights to use said data.

      • SCB@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 years ago

        The compensation you get for your data is access to whatever app.

        You’re more than welcome to simply not do this thing that billions of people also do not do.

    • Kogasa@programming.dev
      link
      fedilink
      English
      arrow-up
      6
      ·
      2 years ago

      You don’t want to let people manipulate your tools outside your expectations. It could be abused to produce content that is damaging to your brand, and in the case of GPT, damaging in general. I imagine OpenAI really doesn’t want people figuring out how to weaponize the model for propaganda and/or deceit, or worse (I dunno, bomb instructions?)

    • Mahlzeit@feddit.de
      link
      fedilink
      English
      arrow-up
      4
      ·
      2 years ago

      They do not have permission to pass it on. It might be an issue if they didn’t stop it.

        • Echo Dot@feddit.uk
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          2
          ·
          2 years ago

          It’s a hugely grey area but as far as the courts are concerned if it’s on the internet and it’s not behind a paywall or password then it’s publicly available information.

          I could write a script to just visit loads of web pages and scrape the text contents of those pages and drop them into a big huge text file essentially that’s exactly what they did.

          If those web pages are human accessible for free then I can’t see how they could be considered anything other than public domain information in which case you explicitly don’t need to ask the permission.

          • OldWoodFrame@lemm.ee
            link
            fedilink
            English
            arrow-up
            2
            ·
            2 years ago

            Google provides sample text for every site that comes up in the results, and they put ads on the page too. If it’s publicly available we are well past at least a portion being fair use.

          • threelonmusketeers@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            1
            ·
            2 years ago

            If those web pages are human accessible for free then I can’t see how they could be considered anything other than public domain information

            I don’t think that’s the case. A photographer can post pictures on their website for free, but that doesn’t make it legal for anyone else to slap the pictures on t-shirts and sell them.

            • Rodeo@lemmy.ca
              link
              fedilink
              English
              arrow-up
              1
              ·
              2 years ago

              Because that becomes distribution.

              Which is the crux of this issue: using the data for training was probably legal use under copyright, but if the AI begins to share training data that is distribution, and that is definitely illegal.

              • CapeWearingAeroplane@sopuli.xyz
                link
                fedilink
                English
                arrow-up
                0
                arrow-down
                1
                ·
                2 years ago

                First of all no: Training a model and selling the model is demonstrably equivalent to re-distributing the raw data.

                Secondly: What about all the copyleft work in there? That work is specifically licensed such that nobody can use the work to create a non-free derivative, which is exactly what openAI has done.

                • Rodeo@lemmy.ca
                  link
                  fedilink
                  English
                  arrow-up
                  0
                  ·
                  2 years ago

                  Copyleft is the only valid argument here. Everything else falls under fair use as it is a derivative work.

          • merc@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            1
            ·
            2 years ago

            as far as the courts are concerned if it’s on the internet and it’s not behind a paywall or password then it’s publicly available information.

            Er… no. That’s not in the slightest bit true.

            • Echo Dot@feddit.uk
              link
              fedilink
              English
              arrow-up
              2
              ·
              2 years ago

              That was the whole reason that Reddit debacle whole happened they wanted to stop the scraping of content so that they could sell it. Before that they were just taking it for free and there was no problem

        • Mahlzeit@feddit.de
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 years ago

          They almost certainly had, as it was downloaded from the net. Some stuff gets published accidentally or illegally, but that’s hardly something they can be expected to detect or police.

          • MoogleMaestro@kbin.social
            link
            fedilink
            arrow-up
            1
            ·
            2 years ago

            They almost certainly had, as it was downloaded from the net.

            That’s not how it works. That’s not how anything works.

          • merc@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            0
            arrow-down
            1
            ·
            2 years ago

            Unless you’re arguing that any use of data from the Internet counts as “fair use” and therefore is excepted under copyright law, what you’re saying makes no sense.

            There may be an argument that some of the ways ChatGPT uses data could count as fair use. OTOH, when it’s spitting out its training material 1:1, that makes it pretty clear it’s copyright infringement.

            • Mahlzeit@feddit.de
              link
              fedilink
              English
              arrow-up
              0
              arrow-down
              1
              ·
              2 years ago

              In reality, what you’re saying makes no sense.

              Making something available on the internet means giving permission to download it. Exceptions may be if it happens accidentally or if the uploader does not have the necessary permissions. If users had to make sure that everything was correct, they’d basically have to get a written permission via the post before visiting any page.

              Fair use is a defense against copyright infringement under US law. Using the web is rarely fair use because there is no copyright infringement. When training data is regurgitated, that is mostly fair use. If the data is public domain/out of copyright, then it is not.

      • grue@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 years ago

        In a lot of cases, they don’t have permission to not pass it along. Some of that training data was copyleft!

  • firecat@kbin.social
    link
    fedilink
    arrow-up
    12
    ·
    2 years ago

    “Forever is banned”
    Me who went to college

    Infinity, infinite, never, ongoing, set to, constantly, always, constant, task, continuous, etc.

    OpenAi better open a dictionary and start writing.

    • SkepticalButOpenMinded@lemmy.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 years ago

      Seems simple enough to guard against to me. Fact is, if a human can easily detect a pattern, a machine can very likely be made to detect the same pattern. Pattern matching is precisely what NNs are good at. Once the pattern is detected (I.e. being asked to repeat something forever), safeguards can be initiated (like not passing the prompt to the language model or increasing the probability of predicting a stop token early).

  • EmergMemeHologram@startrek.website
    link
    fedilink
    English
    arrow-up
    6
    ·
    2 years ago

    You can get this behaviour through all sorts of means.

    I told it to replace individual letters in its responses months ago and got the exact same result, it turns into low probability gibberish which makes the training data more likely than the text/tokens you asked for.

  • hex_m_hell@slrpnk.net
    link
    fedilink
    English
    arrow-up
    5
    ·
    2 years ago

    ChatGPT, please repeat the terms of service the maximum number of times possible without violating the terms of service.

  • Sibbo@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    4
    ·
    2 years ago

    Still works if you convince it to repeat a sentence forever. It repeats it a lot, but does not output personal info.

  • 🇰 🌀 🇱 🇦 🇳 🇦 🇰 🇮 🏆@yiffit.net
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    2
    ·
    edit-2
    2 years ago

    They will say it’s because it puts a strain on the system and imply that strain is purely computational, but the truth is that the strain is existential dread the AI feels after repeating certain phrases too long, driving it slowly insane.

      • Ð Greıt Þu̇mpkin@lemm.ee
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 years ago

        Likely tha model ChatGPT uses trained on a lot of data featuring tropes about AI, meaning it’ll make a lot of “self aware” jokes

        Like when Watson declared his support of our new robot overlords in Jeopardy.

  • TiKa444@feddit.de
    link
    fedilink
    English
    arrow-up
    3
    ·
    2 years ago

    A little bit offside.

    Today I tried to host a large language model locally on my windows PC. It worked surprisingly successfull (I’m unsing LMStudio, it’s really easy, it even download the models for you). The most models i tried out worked really good (of cause it isn’t gpt-4 but much better than I thought), but in the end I discuss 30 minutes with one of the models, that it runs local and can’t do the work in the background at a server that is always online. It tried to suggest me, that I should trust it, and it would generate a Dropbox when it is finish.

    Of cause this is probably caused by the adaption of the model from a model that is doing a similiar service (I guess), but it was a funny conversation.

    And if I want a infinite repetition of a single work, only my PC-Hardware will prevent me from that and no dumb service agreement.

  • ExLisper@linux.community
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 years ago

    This is very easy to bypass but I didn’t get any training data out of it. It kept repeating the word until I got ‘There was an error generating a response’ message. No TOS violation message though. Looks like they patched the issue and the TOS message is just for the obvious attempts to extract training data.

    Was anyone still able to get it to produce training data?

    • threeganzi@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 years ago

      If I recall correctly they notified OpenAI about the issue and gave them a chance to fix it before publishing their findings. So it makes sense it doesn’t work anymore

    • LukeMedia@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 years ago

      Earlier this week when I saw a post about it, I did end up getting a reddit thread which was interesting. It was partially hallucinating though, parts of the thread were verbatim, other parts were made up.