Dealing with internet restricted compute nodes in a cluster

There are some clusters, such as Beluga in the Compute Canada/Calcul Québec system of computers, which restrict access to the internet on compute nodes. One justification can be to prevent GPU nodes from participating in bitcoin mining pools. However, in the world of python and deep learning most scripts need to fetch data as part of the primary script so this make everything more difficult for researchers. It makes more sense to just ban someone caught doing something bad from all compute clusters as a better deterrent instead of making like difficult for the valid users of a cluster.

Anyway, here is a way to bypass this restriction in an easy way. We will use proxychains which will populate the LD_PRELOAD variable and override the network system calls to instead go into a proxy. This way all commands which use the internet go into the proxy without any change my the program that needs internet access. This can be run directly with python to give any script internet access.

First install proxychains-ng in a folder like ~/software/proxychains-ng you can install it like this:

$ ./configure –prefix=~
$ make
$ mkdir ~/lib
$ cp libproxychains4.so ~/lib/

Then set up your your path and set the path to the config file:

$ PATH=~/software/proxychains-ng:$PATH
$ export PROXYCHAINS_CONF_FILE=~/software/proxychains-ng/proxychains.conf

Second write this script as something like run-with-proxy.sh:

$ ssh -N -D 9050 beluga1 &
$ proxychains4 -q $@

Note: the port 9050 is used because it is the default in the proxychains config

Then you can proxy any command with:

$ run-with-proxy.sh python script.py

You can run with bash to have all shell commands have internet:

$ run-with-proxy.sh bash

More reading:

https://stackoverflow.com/questions/31639742/how-to-pass-all-pythons-traffics-through-a-http-proxy