What has Academic Torrents been working on? Tools! We are making a bunch of tools! Here is the first of many documents describing our suite of utilities.
One major problem faced while working with data is moving it onto shared computing systems in order to process it. Tools like scp, rsync, and ftp work to move data from point to point and have facilities for partial transmissions. Using the BitTorrent (BT) protocol to transfer data has many advantages that currently are hard to leverage of on shared computing systems. The tool atdown addresses these shortcomings by:
- Downloading from many locations at once. Cache nodes can be set up close to the system to increase speed further.
- Partial transmissions are part of the protocol which makes restarting a download trivial.
- Validation of existing data to ensure integrity and completeness.
This tool is currently still in development but it’s at a working state! Go here to find instructions on downloading and installing it: https://github.com/AcademicTorrents/AcademicTorrents-Downloader
Shared computing systems typically present an SSH/SCP server running a POSIX system. Typically they also include curl, wget, java, gcc, and other build tools. The tool atdown is a pure java download tool that is designed to run on every shared computing system. This utility is tightly integrated into the Academic Torrents index and aims to act as a storage abstracted filesystem. For example below we run:
which lists all collections in the system which can be thought of as folders.
Each collection can have it’s entries listed with the ls command. For example we can run
atdown gnu-radio-rf-captures ls
to list the RF captures that have been included in this collection.
One reason for having a command line tool is to use other command line tools! Here we list all the NOAA datasets with
atdown noaa-datasets ls
and then grep for a specific USAF number of 010980. We will soon be building this grep feature directly into the program to allow selective downloading of files matching a regular expression or simple * selector.
So now we can go over using it to download files. The tool can be run by simply specifying an infohash or a collection url-name such as:
atdown gnu-radio-rf-captures #Download all files in the gnu-radio-rf-captures collection
atdown 30ac2ef27829b1b5a7d0644097f55f335ca5241b #Wikipedia 20130805
atdown e3e68948b2e01b01a415740cb6fa6fe918c971ac #NOAA Weather 2011
The tool will startup and go through the following steps:
- Obtain metadata (will cache locally)
- Check existing data (Using cryptographic hashing to determine if data is corrupted)
- Contact mirror locations to receive needed data
- Update every second inline with domains of most high speed mirrors (works in GNU screen)
- Verify that all data was downloaded correctly
Lets explore a use case using the atdown tool. The dataset known as “Wikipedia English Official Offline Edition (version 20130805)” contains a offline version of Wikipedia in a compressed XML format. Compressed the data is 9.38GB.
The hosting for this dataset is provided on many HTTP, FTP, and BT mirrors which are both Wikimedia and community run.The data is dispersed globally and plotted below as yellow dots. Over about 8 months of 2014 there have been 785 downloads. This is a total of 7.36TB transferred for this single file.
The global distribution of the data grants us two things:
- Persistent availability of the data because we don’t rely on one hosting source. This is transparent because the hosting is always abstracted.
- Faster download speeds via proximity to a mirror location and by aggregation of download sources.
To demonstrate the power of this tool it was used in two locations (Boston, MA and Atlanta, GA) to show how changing the distance to mirror locations is mitigated when using our model.
Note: The tool displays sizes in Bytes not Bits so 8.9MB/s == 71.2Mb/s!
Here is a demo running on the XSEDE shared computer system. We are able to achieve 36.4MB/s (291.2Mbit/s) of speed at the middle of the download:
Please leave your feedback below!
Or submit issues and feature requests here: https://github.com/AcademicTorrents/AcademicTorrents-Downloader/issues