
An online activist collective’s claim to have copied tens of millions of songs from Spotify into a 300TB “preservation” archive has emerged as one of the most consequential music and tech stories of the year, raising urgent questions about streaming security, copyright, and the future of AI training datasets.
An organisation known as Anna’s Archive claims to have scraped just under 300TB of material from Spotify, including approximately 86 million audio files and roughly 256 million rows of metadata, such as artist names, albums, and track information. The group, already notorious in publishing for linking to pirated books and academic papers, describes the operation as an attempt to create the “world’s first preservation archive for music” by backing up the tracks people listen to most on the platform.
Spotify, which hosts more than 100 million tracks and has over 700 million users worldwide, has confirmed that unauthorised scraping took place but says the data accessed does not represent its full catalogue. The company says it has shut down accounts it believes were involved, and has stressed that user data was not part of the incident.
Anna’s Archive claims the collection covers about 99.6% of all listening activity on Spotify, effectively mirroring the platform’s most‑played music rather than every obscure upload. The group has indicated that it intends to prioritise releases based on popularity metrics taken from Spotify itself, gradually expanding the set with additional metadata and artwork.
Framing the project as a cultural safeguard, the activists argue that entrusting the bulk of recorded music history to a handful of private streaming services leaves it vulnerable to corporate collapse, policy changes or geopolitical crises. In a recent blog post, Annas Archive explained that by circulating complete copies of widely streamed songs outside those ecosystems, they say they are protecting cultural heritage in the same way that underground text archives claim to preserve knowledge.
- Over-focus on the most popular artists. There is a long tail of music which only gets preserved when a single person cares enough to share it. And such files are often poorly seeded.
- Over-focus on the highest possible quality. Since these are created by audiophiles with high end equipment and fans of a particular artist, they chase the highest possible file quality (e.g. lossless FLAC). This inflates the file size and makes it hard to keep a full archive of all music that humanity has ever produced.
- No authoritative list of torrents aiming to represent all music ever produced. An equivalent of our book torrent list (which aggregate torrents from LibGen, Sci-Hub, Z-Lib, and many more) does not exist for music.
Critics in the music business see something far closer to large‑scale piracy, warning that free, high‑quality torrents of mainstream catalogues could undercut subscription platforms that now underpin artist and label revenues.
Beyond lost streams, one of the biggest concerns is that the 300TB trove could become an enormous unlicensed training dataset for generative AI systems. The combination of audio files, metadata, and more could fuel machine‑learning pipelines without the need for negotiated licences.
Spotify has said it is actively investigating and working to prevent further dissemination of the scraped material, while reiterating that using illicit methods to access protected audio is a breach of its terms and of copyright law.
