Skip to main content
Social Sci LibreTexts

6.1: Where does all that Data come from?

  • Page ID
    207242
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    Developers of large language models, such as ChatGPT, often scrape their training data indiscriminately from the web without paying any attention to individual rights. These models are trained on vast swathes of internet data, and often include personal information that has been collected without consent or used in violation of privacy laws. This has raised concerns about the ethical implications of developing AI models that rely on data collected without regard for individual privacy rights.

    The lack of transparency and accountability around the collection and use of personal data in AI development has been a longstanding issue. The vast amount of data required to train these models means that personal information is often collected without explicit consent or knowledge of the individuals affected. Critics argue that developers of large language models prioritise the creation of powerful algorithms over individual privacy rights, and that the industry is not sufficiently regulated.

    These concerns have landed OpenAI in trouble with European regulators, particularly under the General Data Protection Regulation (GDPR) laws. The Italian regulator recently issued a temporary emergency decision demanding that OpenAI stop using the personal information of millions of Italians included in its training data, citing a lack of legal justification for using people’s personal information in ChatGPT. The GDPR rules protect the data of over 400 million people across Europe and apply to personal data that is freely available online. The decision by the Italian regulator highlights the growing concerns around the development of large AI models and the use of personal information in training data.

    In the US, the federal privacy commission is also investigating OpenAI following a claim made against the company that it has been unlawfully using personal and private data.

    undefined


    6.1: Where does all that Data come from? is shared under a CC BY-NC 4.0 license and was authored, remixed, and/or curated by LibreTexts.

    • Was this article helpful?