- Querying the file-popularity view
- Advanced topics
The file-popularity mechanism enables tracking of usage-statistics for files in the
selected space. It allows to list files' IDs sorted by increasing order of the
popularity function, so that the least popular files
are at the beginning of the list.
NOTE: Usage statistics can be collected only for local storage supporting the space. It is impossible to obtain file-popularity statistics for a remote provider.
The mechanism for given space can be enabled in Oneprovider's Onepanel, in the space configuration tab.
Querying the file-popularity view
The file-popularity view can be queried using the following request:
curl -sS -k -H "X-Auth-Token:$TOKEN" -X GET https://$HOST/api/v3/oneprovider/spaces/$SPACE_ID/views/file-popularity/query
An example of such request is presented in the file-popularity configuration tab. The example request returns 10 least popular files in the space.
As a prerequisite for understanding this section we advise to familiarize with the concept of Onedata views.
Internally, the mechanism creates the file-popularity view. All notes presented in the Advanced metadata queries section applies also to the file-popularity view.
NOTE: The file-popularity view is a special view, therefore it is forbidden to create a view with such name. Furthermore, it is forbidden and impossible to modify or delete the view using Onedata Views API.
The popularity function
The key that is emitted to the file-popularity view is the value of the
popularity function for given file.
The function is defined as follows:
P(lastOpenHour, avgOpenCountPerDay) = w1 * lastOpenHour + w2 * min(avgOpenCountPerDay, MAX_AVG_OPEN_COUNT_PER_DAY)
lastOpenHour- parameter which is equal to timestamp (in hours since 01.01.1970) of last open operation on given file
w1- weight of lastOpenHour parameter
avgOpenCountPerDay- parameter equal to moving average of number of open operations on given file per day. The value is calculated over last 30 days.
w2- weight of avgOpenCountPerDay parameter
MAX_AVG_OPEN_COUNT_PER_DAY- upper boundary for avgOpenCountPerDay parameter
Entries in the views are modified only when associated document
in the database is modified. It means that entry in the file-popularity view
is modified only when the
document is updated. It happens on each close operation on a file.
It is possible that a file that has been intensively used will not be opened any more.
avgOpenCountPerDay will not be re-calculated and it will stay
on a very high value. If the popularity of the file was estimated only basing on
this parameter such file will stay "popular" forever. To cope with this issue,
lastOpenHour parameter was used in the
The parameter is responsible for "balancing" the importance of
Configuration of the mechanism
The three parameters of the function:
can be modified in the file-popularity configuration panel.
NOTE: Modification of the
popularityfunction parameters results in modification of the mapping function of the file-popularity view. It means that all already indexed files will be re-indexed. Such operation can be very time-consuming as it depends on the number of the files in the space.
NOTE2: The same notice applies to disabling/enabling the mechanism. Disabling the view results in its deletion, therefore re-enabling the view results in re-indexing all files in the space.
MODIFICATION OF THE
popularityFUNCTION MUST BE PERFORMED WITH CARE!!!
The default values of the file-popularity view are as follows:
w1 = 1.0
w2 = 20.0
MAX_AVG_OPEN_COUNT_PER_DAY = 100
The default value of
MAX_AVG_OPEN_COUNT_PER_DAY means that we assume that
files that have
avgOpenCountPerDay > 100 have the same popularity and should
be treated as equally popular.
The above values of
w2 mean that file that has been opened just once
has "similar" popularity to file that was opened about 1000 times in the month
preceding the last open and that the last open was performed a month (30 days) before
open on the former file.
Estimations of the default values of weights are presented below.
Estimation of default weights
In the default configuration, we would like to keep balance between
avgOpenCountPerDay parameters in such a way, that file that has been opened just once has
popularity function equal to file that was
opened for the last time a month earlier and
which was opened about 1000 times in the month preceding the last open.
Assume that we have 2 files: F1 and F2. The file F1 was opened at timestamp (in hours) T1. F1 - lastOpenHour1 = T1 - number of openes in the month preceding last open: NO1 = 1 - avgOpenCountPerDay1 = avg1 = NO1/30 = 1/30 The file F2 was opened month earlier than T1 for the last time. F2 - lastOpenHour2 = T2 = T1 - 30 * 24 - number of openes in the month preceding last open: NO2 = 1000 - avgOpenCountPerDay2 = avg2 = NO2/30 = 1000/30 Lets count popularity for both files: P1 = P(lastOpenHour1, avgOpenCountPerDay1) P1 = w1 * T1 + w2 * min(avg1, 100) P1 = w1 * T1 + w2 * avg1 P2 = P(lastOpenHour2, avgOpenCountPerDay2) P2 = w1 * T2 + w2 * min(avg2, 100) P2 = w1 * T2 + w2 * avg2 P2 = w1 * (T1 - 720) + w2 * avg2 We want P1 = P2: w1 * T1 + w2 * avg1 = w1 * (T1 - 720) + w2 * avg2 w1 * T1 + w2 * avg1 = w1 * T1 - w1 * 720 + w2 * avg2 w1 * 720 = w2 * (avg2 - avg1) w1/w2 = (avg2 - avg1)/720 w1/w2 = 999/21600 We can set w1 := 1 and therefore we have: w2 = 21600/999 ~= 21,62 Finally, to make it simpler, we set: w1 := 1.0 w2 := 20.0
All operations presented in the GUI can also be performed using the REST API. Links to the documentation are presented below.
|Request||Link to API|
|Get file-popularity configuration||API|
|Update file-popularity configuration||API|