Academic Torrents – A Platform for Sharing Data

Academic Torrents is a platform for researchers to share data. It consists of two pieces: a site where users can search for datasets, and a BitTorrent backbone which makes sharing data scalable and fast. The goal is to facilitate the sharing of datasets amongst researchers. The site provides access to over 32TB of data and delivers over 3TB per day of data to researchers all over the world. I started this project with another grad student in 2013. Since then we have founded the Institute for Reproducible Research (a U.S. 501(c)3 non-profit) in order to continue operation of the of the site.

The site provides access to data including popular machine learning datasets such as all of UCI, ImageNet, and Wikipedia. Though some of these datasets are available elsewhere, Academic Torrents stitches multiple hosting locations together so downloading is much faster and also fault-tolerant. For downloaders there are no sign-up or verification processes in the way, and the collection is more comprehensive than anywhere else. Many datasets such as Netflix, where the original hosting location is no longer available, are made available using Academic Torrents.

As data gets bigger, peer-to-peer file transfer becomes increasingly attractive, since it is the only way distribution scales with the number of users. Academic Torrents currently facilitates the transfer of over 3TB/day and over 30000 users/monthly.

The guiding principle of Academic Torrents is to ensure that the data the community needs is always available and can be obtained quickly. In order to ensure that data is always available it needs to be stored in more than one location in case the initial location is not available. Typically, when a user downloads data from a secondary website it is unclear if they they found the correct data. BitTorrent allows data to be mirrored transparently in a peer to peer fashion while maintaining the correctness and authenticity of the data. A speed increase is gained because a user can download from all the mirrors at once.

 

Publications

  • Lo, Henry Z. and Cohen, Joseph P., (2015). Academic Torrents: Scalable Data Distribution. Neural Information Processing Systems 2015 Challenges in Machine Learning (CiML) Workshop. NIPS 2015. http://arxiv.org/abs/1603.04395
  • Cohen, Joseph P. and Lo, Henry Z., (2014). Academic Torrents: A Community-Maintained Distributed Repository,” in Annual Conference of the Extreme Science and Engineering Discovery Environment, XSEDE 2014. http://doi.org/10.1145/2616498.2616528

Use in research

To determine the impact of the project we can look for research papers which reference a torrent on Academic Torrents. By googling “inurl:pdf academictorrents.com/details” a list of pdfs which contain a reference to a torrent is shown.

Here each spike is after the project was discussed on the front page of Hacker News.

Press

2018

2017

2016

2014