# File popularity

As a prerequisite for understanding this chapter, we recommend that you familiarize yourself with the concept of views.

The file popularity mechanism enables tracking of usage statistics for files in a space. It allows listing File IDs sorted in ascending order by the popularity function, so that the least popular files are at the beginning of the list.

NOTE: Usage statistics can be collected only for local storage supporting the space. It is impossible to obtain file popularity statistics gathered by a remote provider.

The mechanism can be enabled for chosen space in the Spaces -> "Space Name" -> File popularity tab, in the Oneprovider panel GUI (as shown below), or using the REST API.

screen-file-popularity-tab

Internally, the mechanism creates the file popularity view. All notes presented in the Views chapter apply also to the file popularity view.

NOTE: The file popularity view is a special view, therefore it is forbidden to create a view with such name. Furthermore, it is forbidden and impossible to modify or delete the view using Views API.

# Querying the file popularity view

The file popularity view can be queried using the following request:

curl -sS -k -H "X-Auth-Token:$TOKEN" -X GET https://$HOST/api/v3/oneprovider/spaces/$SPACE_ID/views/file-popularity/query

An example of such a request is presented in the file popularity configuration tab of Onepanel GUI. The example request returns the 10 least popular files in the space.

For more information on querying views, see here.

# Advanced topics

# The popularity function

The key that is emitted to the file popularity view is the value of the *popularity function for a given file. The function is defined as follows:

P(lastOpenHour, avgOpenCountPerDay) = w1 * lastOpenHour + w2 * min(avgOpenCountPerDay, MAX_AVG_OPEN_COUNT_PER_DAY)

where:

  • lastOpenHour — parameter which is equal to the timestamp (in hours since 01.01.1970) of the last open operation on the file,
  • w1 — the weight of the lastOpenHour parameter,
  • avgOpenCountPerDay — parameter equal to the moving average of the number of open operations on the file per day — the value is calculated over the last 30 days,
  • w2 — weight of the avgOpenCountPerDay parameter,
  • MAX_AVG_OPEN_COUNT_PER_DAY — upper boundary for the avgOpenCountPerDay parameter.

Entries in the views are modified only when the associated document in the database is modified. It means that an entry in the file popularity view is modified only when the file popularity model document is updated, which happens on each file close operation. The downside of this approach is that the avgOpenCountPerDay may not be recalculated in certain circumstances and the file may be indexed as “popular” forever, contrary to the actual popularity. This is possible when the file has been intensively used for some time but hasn't been opened since then so that no recalculation could be triggered to update its popularity. This is why the lastOpenHour parameter is used in the popularity function — to balance the importance of avgOpenCountPerDay parameter.

# Default parameters

The default values of the file popularity view are as follows:

  • w1 = 1.0
  • w2 = 20.0
  • MAX_AVG_OPEN_COUNT_PER_DAY = 100

The default value of MAX_AVG_OPEN_COUNT_PER_DAY makes all files with avgOpenCountPerDay > 100 be treated as equally popular.

The above values of w1 and w2 cause the below two files to have similar calculated popularity:

  • a file that has been opened just once
  • a file that had been opened about 1000 times in the month preceding the last open and the open was performed a month before opening the former file

These weights were estimated using the following approach:

Assume that we have 2 files: F1 and F2.

F1 was opened at timestamp (in hours) T1.
F1 - lastOpenHour1 = T1
   - number of opens in the month preceding last open: opensCount1 = 1
   - avgOpenCountPerDay1 = avg1 = opensCount1 / 30 = 1 / 30
   
F2 was opened a month earlier than T1 for the last time.
F2 - lastOpenHour2 = T2 = T1 - 30 * 24
   - number of opens in the month preceding last open: opensCount2 = 1000
   - avgOpenCountPerDay2 = avg2 = opensCount2 / 30 = 1000 / 30

Calculate popularity for both files:

P1 = P(lastOpenHour1, avgOpenCountPerDay1)
P1 = w1 * T1 + w2 * min(avg1, 100)
P1 = w1 * T1 + w2 * avg1

P2 = P(lastOpenHour2, avgOpenCountPerDay2)
P2 = w1 * T2 + w2 * min(avg2, 100)
P2 = w1 * T2 + w2 * avg2
P2 = w1 * (T1 - 720) + w2 * avg2

We want P1 = P2:

w1 * T1 + w2 * avg1 = w1 * (T1 - 720) + w2 * avg2
w1 * T1 + w2 * avg1 = w1 * T1 - w1 * 720 + w2 * avg2
w1 * 720 = w2 * (avg2 - avg1)
w1 / w2 = (avg2 - avg1)/720
w1 / w2 = 999 / 21600

We can set w1 := 1 and therefore we have:

w2 = 21600 / 999 ~= 21,62

Finally, to make it simpler, we set:
w1 := 1.0
w2 := 20.0

# Tuning the file popularity function

The three parameters of the function: w1, w2, and MAX_AVG_OPEN_COUNT_PER_DAY can be modified in the file popularity configuration panel.

NOTE: Modification of the popularity function parameters results in modification of the mapping function of the file popularity view. It means that all already indexed files need to be re-indexed. Such operation can be very time-consuming, depending on the number of the files in the space.

NOTE: The same notice applies to disabling/enabling the mechanism. Disabling the view results in its deletion, therefore re-enabling the view results in re-indexing of all files in the space.

MODIFICATION OF THE POPULARITY FUNCTION MUST BE PERFORMED WITH CARE!!!

# REST API

All operations related to file popularity can be performed using the REST API. Refer to the linked API documentation for detailed information and examples.

Request Link to API
Get file popularity configuration API (opens new window)
Update file popularity configuration API (opens new window)