Metadata

In the Onedata system, metadata is organized into 2 levels and regards every file/directory:

  • Filesystem attributes — built-in filesystem metadata such as file size, creation and modification timestamps, POSIX permissions, etc.
  • Custom metadata — arbitrary, user-defined information:
    • extended attributes — a.k.a. basic attributes or xattrs; key-value pairs, compatible with POSIX extended attributes,
    • JSON document,
    • RDF document (XML).

All types of custom metadata can coexist for the same file/directory.

 


See the documentation here.

Extended attributes example

sha256_checksum=3a79b5d0caca97e7da16822f9a6458297c32708ce4b031c4260fa6957bf491f7
md5_checksum=30a19c0212685c096ed8620ebca1e598
checksum_generation_timestamp=1700732456

JSON metadata example

{
  "imageWidth": 1024,
  "imageHeight": 768,
  "imageCategory": "nature",
  "keywords": [
    "mountains",
    "sky",
    "winter"
  ]
}

RDF metadata example

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:si="https://www.mypage.fake/rdf/">

   <rdf:Description rdf:about="file">
      <si:title>Mountains during the winter</si:title>
      <si:author>John doe</si:author>
   </rdf:Description>

</rdf:RDF>

Exercise

Assigning custom JSON metadata to files

In this exercise, we will assign custom metadata to several graphic files. It will include the properties of the graphics they contain.

  1. Go to the file browser of the space alpha-11p, and enter the directory $yourName/FirstDataset/Photos.
  2. Right-click on the file artem.jpg and choose Metadata.
  3. Switch to the JSON editor and enter:
    {
       "imageAuthor": "???",
       "imageCategory": "nature",
       "imageWidth": 1680,
       "imageHeight": 1200
    }
    
    but change the imageAuthor to your first name.
  4. Click the Save button.
  1. Repeat the procedure for a few more files in your directory (at least 4). While doing so:
    • enter a random imageWidth (between 500-2000),
    • enter a random imageHeight (between 300-1500),
    • change the imageCategory to one of the three that best fits the content: nature, space, or city.

The made-up image sizes will be used later in data discovery exercises, try to put down variable numbers.

Exercise

Assigning custom RDF metadata to a directory

  1. Go to the file browser of the space alpha-11p and enter your directory.
  2. Right-click on the directory FirstDataset and choose Metadata.
  3. Switch to the RDF editor and enter:
    <rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:si="https://www.mypage.fake/rdf/">
    
       <rdf:Description rdf:about="directory">
          <si:title>Dataset with a choice of photos</si:title>
          <si:author>Various</si:author>
       </rdf:Description>
    
    </rdf:RDF>
    
  4. Click the Save button.

Question

Proper annotation of data is just half of the story.

What if there are thousands or millions of files and we need to search for ones that meet specific constraints in regard to their metadata?

Harvesters and Data discovery

A harvester acts as a metadata sink that brings together multiple sources of metadata. It continuously scans associated spaces and collects the file metadata, which is fed into a harvesting backend (e.g. Elasticsearch) for indexing.

The Data discovery feature allows browsing the collected metadata and querying indices using a graphical interface or REST API. See the documentation here.

How harvesters work

centered

How harvesters work

  1. A space is attached to the harvester (multiple spaces can be attached to one harvester).
  2. The harvesting process collects metadata from all files in the space and places them in Elasticsearch indices assigned to the harvester.
  3. Any metadata modifications, additions, deletions, or file movements — all these changes are dynamically updated in the indices.
  4. The harvester provides a ready-to-use web application that allows querying the indices and presenting results to the user.
  5. The Web app is pluggable; a custom implementation can be tailored to the needs of a specific user group.
  6. The harvester can be made public. In that case, anyone — including non-Onedata users — can visit the metadata browser and query for the aggregated metadata.

Exercise

Creating a new harvester

  1. Click on the Discovery item in the main menu, then in the top right corner of the sidebar, click on the + icon.
  2. Fill in the Name field with a name of your choice.
  3. Use the default settings for the rest — the use default backend and auto setup toggles checked.
  4. Click the Create button.

Note: Elasticsearch is not a built-in service of Onedata. It needs to be configured separately. Typically, it is deployed alongside the Onezone service.

Harvester indices

  • indices allow performing database-like queries in terms of fields defined by index schema (either provided by a user or an automatically generated one),
  • a harvester may contain any number of indices,
  • each index will only accept objects that match its schema, effectively splitting the harvested metadata into subsets,
  • during harvester creation, Onedata automatically creates the generic-index index associated with that harvester — it is based on a schema that accepts any data format,
  • the Indices view shows the progress of harvesting for each provider-index pair (collectively for all attached spaces).

Exercise

Attaching a space and harvesting metadata

  1. Navigate to the harvester you have just created. Click on the Spaces tab in the sidebar.
  2. The list of harvested spaces is empty. Click the Add one of your spaces button to attach one.
  3. Choose the space alpha-11p from the dropdown and then click the Add button.
  4. Go to the Indices view via the sidebar and click on the generic-index entry to expand it.
  5. Wait until all progress indicators reach 100%.
  6. Click on the Data discovery tab in the sidebar to bring up the metadata browser GUI.

Metadata browser GUI

screenshot centered

Exercise

Querying harvested metadata

  1. Locate the query builder at the top of the metadata browser. Use it to define the following query: __onedata.fileName.keyword is artem.jpg.

  2. Click the Query button next to the query builder.

    The results will contain all entries matching your query. There should be multiple results showing files with the same name. The duplication is caused by every participant using the same set of files.

  3. Try to build another query: imageCategory.keyword is city AND imageWidth > 1000. Click on the Query button to see the results.

    Play around with the constraints in the above query. You can adjust the values in the query builder — simply click on a value to modify. After applying changes, click outside of the editor to finalize.

Exercise

Updating harvested metadata

  1. Go to the file browser of the space alpha-11p and enter your directory.

  2. Open the metadata information for the file baskin.jpg (in FirstDataset/Photos), set a new basic metadata entry: favourite = true, and save changes.

  3. Go back to the metadata browser GUI of your harvester.

  4. Build a query, which will find a file you've just modified (use __onedata.xattrs.favourite.__value.keyword property) and click on the Query button.

    Your file should be present in the results list. If it isn't, try again after a few seconds — the harvesting process aggregates changes with a slight delay. Take a look at the indices tab to see the current aggregation status.

Exercise

Attaching additional spaces

  1. Go to the Spaces view of your harvester.
  2. Click on the "three dots" menu trigger in the top-right corner of the page.
  3. Using the option Add one of your spaces, attach 2-3 more spaces (whichever you like).
  4. Go to the Indices view of your harvester, click on the generic-index item, and wait until all progress indicators reach 100%.
  5. Go to the metadata browser GUI of your harvester and try to discover new files.

    Tip: try building a query with the space property.

Exercise

Making the harvester public

  1. Go to the Configuration tab of your harvester.
  2. Click the Edit button under the form.
  3. Toggle the Public option and then save changes.
  4. Copy the Public URL and open it in a new private window of your browser (to make sure you are not signed in).
  5. Test some queries.

Harvester privileges

Privilege Actions
View harvester view harvester contents and members
Modify harvester edit harvester details
Remove harvester remove harvester
Add space add space to harvester
Remove space remove space from harvester
... manage harvester's memberships in regard to other resources

Space privileges regarding harvesters

Privilege Actions
Add harvester add harvester to space
Remove harvester remove harvester from space

Next chapter:

Datasets, archives — practice