Academic Torrents – A Platform for Sharing Data

Academic Torrents is a platform for researchers to share data. It consists of two pieces: a site where users can search for datasets, and a BitTorrent backbone which makes sharing data scalable and fast. The goal is to facilitate the sharing of datasets amongst researchers. It was created by the Institute for Reproducible Research (a U.S. 501(c)3 non-profit).

The site provides access to over 20TB of data including popular machine learning datasets such as all of UCI, Imagenet, and Wikipedia. Though some of these datasets are available elsewhere, Academic Torrents stitches multiple hosting locations together so downloading is much faster and also fault-tolerant. For downloaders there are no sign-up or verification processes in the way, and the collection is more comprehensive than anywhere else. Many datasets such as Netflix, where the original hosting location is no longer avaliable, are made available using Academic Torrents.

As data gets bigger, peer-to-peer file transfer becomes increasingly attractive, since it is the only way distribution scales with the number of users. Academic Torrents currently facilitates the transfer of over 900 GB/day and over 30000 users/monthly.

The guiding principle of Academic Torrents is to ensure that the data the community needs is always available and can be obtained quickly. In order to ensure that data is always available it needs to be stored in more than one location in case the initial location is not available. Typically, when a user downloads data from a secondary website it is unclear if they they found the correct data. BitTorrent allows data to be mirrored transparently in a peer to peer fashion while maintaining the correctness and authenticity of the data. A speed increase is gained because a user can download from all the mirrors at once.

 

Publications

  • Lo, Henry Z. and Cohen, Joseph P., (2016). Academic Torrents: Scalable Data Distribution. Neural Information Processing Systems 2015 Challenges in Machine Learning (CiML) Workshop. http://arxiv.org/abs/1603.04395
  • Cohen, Joseph P. and Lo, Henry Z., (2014). Academic Torrents: A Community-Maintained Distributed Repository (p. 2:1–2:2). New York, NY, USA: ACM. http://doi.org/10.1145/2616498.2616528

Press

2016

2014