The Value of Data in Artificial Intelligence

A comparison of original works with AI content demonstrates the extent to which source material may (accidentally or purposefully) sometimes inform machine-generated output. From top-down: (a) The Bedroom - Vincent van Gogh; (b) Girl with Balloon - Banksy; (c) Howl’s Moving Castle - Studio Ghibli.[12]

Artificial intelligence (“AI”), Machine Learning (“ML”), and Large Language Model (“LLM”) products have catapulted to the forefront of the tech world.  However, beneath this surge in commercial interest are questions regarding the use and value of the training data necessary to develop these programs.  High-quality training data that contains inputs reflective of the content the algorithm will be expected to handle is critical to the development of a useful model.  For generative AI products that seek to replicate “human” behaviors and responses (like ChatGPT), a sufficiently robust set of training data ensures that the resulting product can generalize across different contexts. 

Moreover, AI/LLM products not only require high quality data, but also a lot of it.  As AI models grow in complexity and sophistication, the demand for scalable training datasets that have both breadth and depth of knowledge and perspective escalates.  Curating, annotating, and managing large-scale training data repositories represents a significant challenge.   Reports indicate OpenAI’s GPT-3 was trained on a dataset of approximately 45 terabytes of text data (a quarter of the entire Library of Congress). Even more astounding, some estimates suggest nearly 1 petabyte of data was used to develop GPT-4, including web texts, books, news articles, social media posts, code snippets, and more.[1] 

Where does this data come from?

Datasets of the size and caliber required are somewhat rare. Multiple 501(c)(3) non-profit organizations such as the Internet Archive (provider of the popular “Wayback Machine”) have long scraped website data for non-profit use.[2]  The Common Crawl, another well-known organization, maintains billions of pages of regularly updated, open-source web-scraped data provided for free to researchers.[3]

However, generative AI has run headlong into questions of fair use of copyrighted works.  In December 2023, The New York Times (“NYT”) filed suit against Microsoft and OpenAI claiming that generative AI tools developed by both entities were trained “using millions of [NYT’s] copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more,” without permission.[4]  In a blog post about the case, OpenAI has responded asserting fair use and their ongoing efforts to eliminate “regurgitation” of the underlying materials used to train its products.[5]  The case remains ongoing and is likely to establish an important precedent for both LLM developers and data intellectual property (“IP”) rights holders.

In some instances, parties have successfully acquired rights to usable data from willing licensors.  Recently, Shutterstock entered into licensing agreements with Apple, Meta, Google, and Amazon, for use of its millions of images, music, and video files.  The initial agreement terms included payments ranging from $25 million to $50 million.  An investigation done by Reuters into these and other licensing deals found that pricing for AI training data can vary widely based on the type of content and the buyer.  For example, according to Daniela Braga, CEO of AI data firm Defined.ai, companies are generally willing to pay $1 to $2 per image, $2 to $4 per short-form video, and $100 to $300 per hour of film.  Pricing for text averages approximately $0.001 per word.[6]  

From an IP standpoint, Reddit’s recent $60 million/year licensing agreement with Google is one of the more intriguing development, as it represents an early look into how key parties on both sides are viewing the value of training data in AI.[7]  Reddit’s decentralized approach to moderation, hyper specific subforum-based organization, and user-driven “upvote/downvote” system for posts and replies provides a valuable social media dataset.  A licensing agreement worth $60 million annually would represent approximately 7.5% of the company’s 2023 revenue, and 6.2% of the 2024 revenue guidance it provided to investors prior to its March 2024 IPO.[8]

What about value apportionment?

While data is undoubtedly a key contributor to AI products, there are other critical elements that combine to enable the development and deployment of these applications.  From an IP valuation standpoint, we must also apportion between value created by the quality of the training data used to develop the model, and other factors such as model design, and chipsets/ processing power. 

Parameters are sometimes described as the architecture of the neural networks that underpin AI models.  They are defined weights and biases that can be adjusted during the training process to optimize performance.  Simple AI tools may use straightforward linear processes, while more complex programs rely on neural networks with millions of parameters. Parameter sets require tuning to balance trade-offs between depth and complexity of responses, training time, and computational resource requirements.[9]  Already, companies need to consider tradeoffs between advancing the capabilities of their AI models while ensuring the confidentiality of their trade secret-protected machine learning formulas, parameters, and training guidelines.[10]  Apportioning value to such technologies may require consideration of the time and resources necessary for their development.

Abstract microscopic photography of a Graphics Processing Unit resembling a floor plan or fractal art.

Fritzchens Fritz / Better Images of AI / GPU shot etched 1 / CC-BY 4.0

Processing power, meanwhile, could be viewed as the “engine” of machine learning models.  Chipset hardware helps to provide the computational power necessary to satisfy the intense demands of AI training and operation.  These chipset components may include traditional processing accelerators such as GPUs, as well as newer offerings such as Google’s Tensor Processing Units (“TPUs”).  Google’s TPUs were the focus of a recently settled high-profile patent lawsuit in which the plaintiff claimed nearly $2 billion in damages.  In that matter, both parties’ experts evaluated the potential value of the chipset technology based on the cost-based improvements of newer TPU modules compared to other available alternatives.[11]

Conclusion

The use of large, complex, and often IP-protected datasets to train generative AI models presents unique valuation issues to consider.  In many cases, the intrinsic value of the data used may not appropriately reflect its importance in enabling and improving the capabilities of LLMs.  While the model parameters and algorithms that run AI models create useful results, without large clean datasets to train them, what is their value?  As with any IP rights, the commercial context matters.  Experts will likely continue to debate the relative importance of the various components of AI products as well as the data used to train them.

Truest consultants have worked on behalf of multiple national and multinational firms regarding the risks and opportunities posed by artificial intelligence and the technology underpinning it.  Our AI-related experience includes engagements related to the hardware, software, and data elements of AI, in industries ranging from social media to healthcare.   Please reach out to the Truest team at www.truestconsulting.com if you’re interested in learning more about the role of data and intellectual property in the evolution of AI.


Sources:

[1] https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-generative-ai; https://medium.com/predict/gpt-4-everything-you-want-to-know-about-openais-new-ai-model-a5977b42e495.

[2] https://archive.org/about/.

[3] https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize.

[4] The New York Times Company v. Microsoft Corporation, 1:23-cv-11195, (S.D.N.Y.), pp. 1-3.

[5] https://openai.com/blog/openai-and-journalism.

[6] https://www.reuters.com/technology/inside-big-techs-underground-race-buy-ai-training-data-2024-04-05/.

[7] https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/.

[8] https://finance.yahoo.com/news/reddit-sees-more-20-sales-213428151.html.

[9] See, e.g., https://medium.com/@theDrewDag/introduction-to-neural-networks-weights-biases-and-activation-270ebf2545aa; https://h2o.ai/wiki/weights-and-biases/; https://www.turing.com/kb/necessity-of-bias-in-neural-networks.

[10] https://www.mintz.com/insights-center/viewpoints/2231/2022-02-10-benefits-and-best-practices-protecting-artificial.

[11] https://www.reuters.com/technology/google-settles-ai-related-chip-patent-lawsuit-that-sought-167-bln-2024-01-24/.

[12] https://digitalandaiart.com/ai-art-in-the-style-of-vincent-van-gogh/; https://www.dazeddigital.com/art-photography/article/50872/1/ai-bot-taught-to-create-surreal-banksy-esque-artworks-artificial-intelligence; https://www.deviantart.com/eoin127/art/Ai-generated-artwork-Studio-Ghibli-925747100. See also, Memo and Order on Motions to Exclude Expert Testimony, C.A. No. 1:19-cv-12551,  filed December 20, 2023.

Previous
Previous

Extraterritorial Damages: Mitigating Risks and Threats to Business Continuity

Next
Next

What is a Trademarked Color Worth?