Equipment

GeForce GPU giant has been data scraping 80 years’ worth of videos every day for AI training to ‘unlock various downstream applications critical to Nvidia’-

Leaked documents, including spreadsheets, emails, and chat messages, show that Nvidia has been using millions of YouTube videos, Netflix, and other sources to train an AI model to be used in its Omniverse, autonomous vehicles, and digital avatar platforms.

The astonishing, but perhaps not surprising, scope of the data scraping was reported by 404 Media, who investigated the documents. It discovered that an internal project codenamed Cosmos (the same name but different to Nvidia’s Cosmos Deep Learning service) had staff use dozens of virtual PCs on Amazon Web Service (AWS) to download so many videos per day that Nvidia accumulated over 30 million URLs in the space of one month.

Copyright laws and usage rights were repeatedly discussed by the employees, who found some creative ways to prevent any direct violation of them. For example, Nvidia employed the use of Google’s cloud service to download the YouTube-8M dataset, as directly downloading the videos isn’t permitted by the terms of service. 

In a leaked Slack channel discussion, one person remarked that “we cleared the download with Google/YouTube ahead of time and dangled as a carrot that we were going to do so using Google Cloud. After all, usually, for 8 million videos, they would get lots of ad impressions, revenue they lose out on when downloading for training, so they should get some money out of it.”

404 Media asked Nvidia to comment on the legal and ethical aspects of using copyrighted material for AI training and the company replied that it was in “in full compliance with the letter and the spirit of copyright law.”

With some datasets, their use is only permitted for academic purposes and although Nvidia does conduct a considerable amount of research (internally and with other institutions), the leaked materials clearly show that this data scraping was intended for commercial purposes.

Nvidia isn’t the only firm to be doing this, of course—OpenAI and Runway have both been accused of knowingly using copyrighted and protected material to train their AI models. Interestingly, one source of video content that you’d think Nvidia would have no problem using is gameplay footage from its GeForce Now service—but the leaked documents show that’s not the case.

A senior research scientist at Nvidia explained why to other employees: “We don’t yet have statistics or video files yet, because the infras is not yet set up to capture lots of live game videos & actions. There’re both engineering & regulatory hurdles to hop through.”

AI models have to be trained on billions of data points and there’s no way around this. Some datasets have very clear rules for their use, whereas others have fairly loose restrictions, but when it comes to laws on the use of copyrighted materials, it’s very clear what can and can’t be done, even if the application of it to AI training isn’t 100% transparent.

It’s not just about copyright, either, as video content often contains personal data. While there isn’t a single, overriding federal law in the US that is directly applicable here, there are plenty of regulations concerning collecting and using personal data. In the EU, the General Data Protection Regulation (GPDR) is a law that is expressly clear on how such data can be used, even outside of the EU.

One might also wonder what would happen if a company such as Nvidia is found to have breached various regulations whilst training its AI models—if that system is being used across the globe, would it then be blocked in specific countries? Would the likes of Nvidia be willing to make a new model, trained with all permissions granted, just for those locations? Is it even possible to ‘untrain’ a system and start afresh with legally compliant data?

Whatever one feels about AI, it’s clear that there needs to be a more urgent push for transparency, especially when it concerns the use of copyrighted and personal data for commercial purposes. Because if tech companies aren’t held accountable, then data scraping will continue ad hoc.

Related Posts

AI Is Making Buildings More Efficient

By Andrew R. Chow

Heating and lighting buildings requires a vast amount of energy: 18% of all global energy consumption, according to the International Energy Agency. Contributing to the problem is the fact that many buildings’ HVAC systems are outdated and slow to respond to weather changes, which can lead to severe energy waste. 

Some scientists and technologists are hoping that AI can solve that problem. At the moment, much attention has been drawn to the energy-intensive nature of AI itself: Microsoft, for instance, acknowledged that its AI development has imperiled their climate goals. But some experts argue that AI can also be part of the solution by helping make large buildings more energy-efficient. One 2024 study estimates that AI could help b…

Elise Smith Defends DEI as Good Business

By Andrew R. Chow

In recent years, right-leaning leaders in politics and tech like Donald Trump and Elon Musk have attacked the value of DEI (diversity, equity, and inclusion) initiatives. But for Elise Smith, the CEO and co-founder of the tech startup Praxis Labs, learning to navigate cultural differences is simply good business, especially for ambitious multinational companies with employees and clients around the world. “Regardless of what you think about the term DEI, this work will continue, because fundamentally it does drive better business outcomes,” says Smith, 34. “Fortune 500 companies are trying to figure out: How do we serve our clients and customers, knowing that there’s a ton of diversity within them Come from

How ReelShort CEO Joey Jia Used a Chinese Trend to Disrupt the U.S. Entertainment Industry

By Chad de Guzman

The first episode of “The Double Life of My Billionaire Husband” pans out like a daytime TV melodrama: a female protagonist asks her father and evil stepmother for $50,000 to pay for her mother’s kidney dialysis treatment. Cue an evil stepsister, who snarkily says that the payment is assured—if the lead marries the illegitimate son of a prominent family, who is supposedly a grade-A loser. Unsurprisingly, our hero relents.

All of this happens in just over 90 seconds. But while its brevity seems like a cheap trick, “The Double Life of My Billionaire Husband”—a show produced and distributed on the ReelShort app developed by Silicon Valley-based Crazy Maple Studio (one of the TIME100 Most Influential Companies of 2024)—has raked …

The Last of Us Part II Sounds Like a Bloody Revenge Story

By Matt Peckham

Of all the games for which sequels seem like iffy ideas, Naughty Dog‘s The Last of Us comes up on my short list. It’s there with others like Braid and Flower and Ico.

But yes, The Last of Us Part II is now officially a thing. Sony and Naughty Dog unveiled the game, presumably to follow in the original’s footsteps as a harrowing, narrative heavy action-adventure, during its PlayStation Experience event in Anaheim, Calif., this weekend.

In the 4-minute clip, we’re in a very pretty forest, then beside a tree with claw marks, then by the rusted hulk of a car. The camera pulls back to show the fireflies emblem — for the militant group from the first game — emblazo…

Civilization 7 Treasure Fleets Guide

During the Exploration Age of you Civilization VII playthrough, you might stumble across a special kind of ship called the Treasure Fleet. This is only granted to you by procuring a specific type of resource and it’s required to complete some objectives for the Economic Advisor. However, beyond that, the game doesn’t truly explain what the fleets are used for or how they work.

Fortunately, we played enough Civilization 7 to understand the Treasure Fleets, and they are a goldmine if used properly. Below, we’ll show you exactly what to do with Treasure Fleets and how to make boatloads of gold with them.

How to get Treasure Fleets in Civilization 7

For starters, let’s talk about how to acquire your first Treasure Fleet. This type of ship is only given to you if you procure the Silver resource, which is indicated by the silver pot and orange background icon. You need to have this resource within one of your settlement’s borders and then build a mine on the resource t…

Amazon Says Fallout Season 2 Is Ahead of Schedule

Following the premiere of Fallout earlier this year, Prime Video didn’t waste any time renewing it for a second season. Now, after the show earned 16 Emmy nominations including for Outstanding Drama Series and a Best Actor in a Drama nod for Walton Goggins, the early word on Fallout Season 2 is very encouraging.

“Geneva [Robertson-Dworet] and Graham [Wagner] and Jonah [Nolan] and Kilter [Films] have such a vision for this. They are hard at work, we already have scripts in hand for Season 2,” said Amazon MGM Studios television head Vernon Sanders during an interview with The Wrap. “So we’re really far ahead because we knew we had something special all along. It’s premature for me to give you a release date. But I’ll tell you, we are working hard to be back as fast as possible… I think people will be pleased with how quickly we’re able to get the show back. We just want to make sure we deliver everything that is on the page.”

Sanders was also asked about a possible Fal…