Jupyter Notebook integration

Onedata can be easily integrated with Jupyter Notebooks via the OnedataFS Python library and OnedataFS-Jupyter plugin. The integration can involve 2 separate scenarios:

  • Storing the notebooks directly in Onedata data spaces
  • Accessing Onedata spaces from within the notebooks

This section explains the first of these aspects. The second aspect can be achieved by simply installing the OnedataFS Python plugin on the Jupyter Notebook server.

OnedataFS-Jupyter plugin implements Jupyter Contents API, making a Onedata space a default storage backend for the Jupyter notebooks for a particular Jupyter instance.

Installation

OnedataFS-Jupyter extension can be installed from our provided packages for both Python 2 and Python 3, depending on the Python version which is used to start the Jupyter server.

Ubuntu

wget https://get.onedata.org/oneclient.sh

# For Python2
pip install fs
sh oneclient.sh python-onedatafs-jupyter

# For Python3
pip3 install fs
sh oneclient.sh python3-onedatafs-jupyter

CentOS

Please note that CentOS packages are distributed according to the Software Collections standard.

wget https://get.onedata.org/oneclient.sh

# For Python2
pip install fs
sh oneclient.sh onedata1802-python2-onedatafs-jupyter
scl enable onedata1802 bash

# For Python3
pip3 install fs
sh oneclient.sh onedata1802-python3-onedatafs-jupyter
scl enable onedata1802 bash

Anaconda

Since version 18.02.2 and 19.02.0-rc1 Onedata Jupyter plugin can be installed using Anaconda, from the official Onedata conda repository:

conda install -c onedata onedatafs-jupyter

or to install a specific version of onedatafs-jupyter

conda install -c onedata onedatafs-jupyter=18.02.2

Furthermore, to ensure that conda will not upgrade Python in a given environment, it is possible to provide the exact build number of the Onedata Jupyter packages, which includes the Python version, e.g.:

conda install -c onedata onedatafs-jupyter=18.02.2=py36_0

Usage

In order to configure Jupyter Notebook to work directly in a Onedata Space, add the following lines to the Jupyter configuration file:

import sys

c = get_config()

c.NotebookApp.contents_manager_class = 'onedatafs_jupyter.OnedataFSContentsManager'

# Hostname or IP of the Oneprovider to which the Jupyter should connect
c.OnedataFSContentsManager.oneprovider_host = u'datahub.egi.eu'

# The Onedata user access token
c.OnedataFSContentsManager.access_token = u'MDAzN2xvY2F00aW...'

# Name of the space, where the notebooks should be stored
c.OnedataFSContentsManager.space = u'/experiment-1'

# Internal path within a data space, for instance to a subdirectory where Jupyter 
# notebooks should be stored, must be relative (i.e. cannot start with `/`)
c.OnedataFSContentsManager.path = u''

# When True, allow connection to Oneprovider instances without trusted certificates
c.OnedataFSContentsManager.insecure = True

# When True, disables internal OnedataFS buffering, should be set to False for
# use cases handling larger files
c.OnedataFSContentsManager.no_buffer = True

# With these settings, all data transfers between Jupyter and Onedata will be performed
# in ProxyIO mode, without direct access to the backend storage. For testing
# and use cases with small files this is ok, but for high-performance calculation
# the following 2 lines should be negated
c.OnedataFSContentsManager.force_proxy_io = True
c.OnedataFSContentsManager.force_direct_io = False

# Set the log level
c.Application.log_level = 'DEBUG'

# The following line disables Jupyter authentication, for production deployments
# remove it or provide a custom token
c.NotebookApp.token = ''

When starting Jupyter using a Docker (assuming the container contains all necessary dependencies), the configuration file can be easily mapped to the Jupyter using volume option, e.g.:

docker run -v $PWD/my_jupyter_notebook_config.py:/root/.jupyter/jupyter_notebook_config.py -it onedata/onedatafs-jupyter

If you don't have a config yet, create it using:

jupyter notebook --generate-config