Artificial Intelligence: Generative AI Training, Development, and Deployment Considerations

Fast Facts

Commercial development of generative artificial intelligence technologies has rapidly accelerated. Industry releases new models multiple times per year, attracting more than 100 million users worldwide.

This acceleration has raised concerns about potential harm—for example, models may produce factual errors that spread misinformation or generate explicit content.

Our technology assessment describes common development practices, including testing for vulnerabilities and the limitations of this testing.

In future reports, we plan to assess societal and environmental effects of AI, federal AI research, and other related topics.

person holding tablet with AI icons superimposed over the image

Highlights

What GAO Found

Commercial developers use several common practices to facilitate responsible development and deployment of generative artificial intelligence (AI) technologies. For example, they may use benchmark tests to evaluate the accuracy of models, employ multi-disciplinary teams to evaluate models prior to deployment, and conduct red teaming to test the security of their models and identify potential risks, among others. These practices focus on quantitative and qualitative evaluation methods for providing accurate and contextual results, as well as preventing harmful outputs.

In addition, commercial developers face some limitations in responsibly developing and deploying generative AI technologies to ensure that they are safe and trustworthy. Primarily, developers recognize that their models are not fully reliable, and that user judgment should play a role in accepting model outputs. In various white papers, model cards, and other documentation, they have noted that despite the mitigation efforts, their models may produce incorrect outputs, exhibit bias, or be susceptible to attacks. Such attacks include prompt injection attacks, jailbreaks, and data poisoning. Prompt injection attacks and jailbreaks rely on text prompt inputs that may change the behavior of a generative AI model that could be used to conduct misinformation campaigns or transmit malware, among other malicious activities. Data poisoning is a process by which an attacker can change the behavior of a generative AI system through manipulation of its training data or process.

Overview of a prompt injection attack on a generative artificial intelligence (AI) model

A picture containing diagram Description automatically generated

Generative AI typically requires a large dataset for training—ranging from millions to trillions of data points. Training information is used to help models learn about language and how to respond to questions. The quantity of data can vary based on the specific type of model. However, information regarding the specifics of training datasets is not entirely available to the public. The commercial developers GAO met with did not disclose detailed information about their training datasets beyond high-level information identified in model cards and other relevant documentation. For example, many stated that their training data consists of information publicly available on the internet.

Foundation models are especially susceptible to poisoning attacks when training data are scraped from public sources. Commercial developers are taking measures to safeguard sensitive information by undergoing privacy evaluations at various stages of training and development. Before training a model, developers can filter and curate training data to reduce the use of harmful content, such as sites that collect personal information.

Why GAO Did This Study

For this technology assessment, GAO was asked to describe commercial development of generative AI technologies. This report is the second in a body of work looking at generative AI. In future reports, GAO plans to assess societal and environmental effects of the use of generative AI and federal research, development, and adoption of generative AI technologies. This report provides an overview of the common practices, limitations, and other factors affecting the responsible development of generative AI technologies and the processes commercial developers follow to collect, use, and store training data for generative AI technologies.

To conduct this assessment, GAO conducted literature reviews and interviewed representatives from several leading companies developing generative AI technologies. GAO selected these companies based on known information about generative AI technologies they have created and the voluntary commitments they made to the White House to manage risks posed by AI. Through these interviews, GAO gathered information regarding common practices these companies use to develop and deploy generative AI technologies. GAO also reviewed relevant publicly available documentation, such as white papers, model cards, and guidance documents to identify further information regarding training data and the types of attacks used to compromise generative AI models.

For more information, contact Brian Bothwell at (202) 512-6888 or bothwellb@gao.gov or Kevin Walsh at (202) 512-6151 or walshk@gao.gov.

Full Report

Full Report (12 pages)

Accessible PDF (13 pages)

GAO Contacts

Kevin Walsh

Director

Information Technology and Cybersecurity

walshk@gao.gov

Brian Bothwell

Director

Science, Technology Assessment, and Analytics

bothwellb@gao.gov

Media Inquiries

Sarah Kaczmarek

Managing Director

Office of Public Affairs

media@gao.gov

Public Inquiries

Topics

Information Technology

Artificial intelligenceCopyright protectionsCybersecurityJoint venturesPrivacyPublicly available informationSensitive dataPoisoningSocial mediaPersonally identifiable information