Computing a Rolling Median¶
Rolling Median in C++¶
A powerful use case for a skip list is in the computation of a rolling fraction, for example a rolling median.
Here is a reasonable C++ attempt at doing that with the arguments:
data
- A vector of data of typeT
of lengthL
.win_length
- a ‘window’ size. The median is computed over this number of values.result
- a destination vector for the result. This will either end up withL - win_length
values, alternatively is will beL
long and start withwin_length
NaN
s.
Rolling median code using a skip list might look like this, error checking is omitted:
#include "SkipList.h"
template <typename T>
void rolling_median(const std::vector<T> data,
size_t win_length,
std::vector<T> &result) {
OrderedStructs::SkipList::HeadNode<T> sl;
result.clear();
for (size_t i = 0; i < data.size(); ++i) {
sl.insert(data[i]);
if (i >= win_length) {
result.push_back(sl.at(win_length / 2));
sl.remove(data[i - win_length]);
}
}
}
If you are working with C arrays (such as Numpy arrays) then this C’ish approach might be better, again error checking omitted:
#include "SkipList.h"
template <typename T>
void rolling_median(const T *src, size_t count, size_t win_length, T *dest) {
OrderedStructs::SkipList::HeadNode<T> sl;
const T *tail = src;
for (size_t i = 0; i < count; ++i) {
sl.insert(*src);
if (i + 1 >= win_length) {
*dest = sl.at(win_length / 2);
++dest;
sl.remove(*tail);
++tail;
}
++src;
}
}
Multidimensional Numpy arrays have a stride value which is omitted in the above code but is simple to add. See RollingMedian.h and test/test_rolling_median.cpp for further examples.
Rolling percentiles require a argument that says what fraction of the window the required value lies. Again, this is easy to add.
Even Window Length¶
The above code assumes that if the window length is even that the median is at (window length - 1) / 2
. A more plausible median for even sized window lengths is the mean of (window length - 1) / 2
and window length / 2
. This requires that the mean of two types is meaningful which it will not be for strings.
Rolling Median in Python¶
Here is an example of computing a rolling median of a numpy
1D array.
This creates an array with the same length as the input starting with window_length
NaN
s:
import numpy as np
import orderedstructs
def simple_python_rolling_median(vector: np.ndarray,
window_length: int) -> np.ndarray:
"""Computes a rolling median of a numpy vector returning a new numpy
vector of the same length.
NaNs in the input are not handled but a ValueError will be raised."""
if vector.ndim != 1:
raise ValueError(
f'vector must be one dimensional not shape {vector.shape}'
)
skip_list = orderedstructs.SkipList(float)
ret = np.empty_like(vector)
for i in range(len(vector)):
value = vector[i]
skip_list.insert(value)
if i >= window_length - 1:
# // 4 for lower quartile
# * 3 // 4 for upper quartile etc.
median = skip_list.at(window_length // 2)
skip_list.remove(vector[i - window_length + 1])
else:
median = np.nan
ret[i] = median
return ret
This can be called thus:
np_array = np.arange(10.0)
print('Original:', np_array)
result = simple_python_rolling_median(np_array, 3)
print(' Result:', result)
And the result will be:
Original: [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Result: [nan nan 1. 2. 3. 4. 5. 6. 7. 8.]
Handling NaNs¶
Not-a-number (NaN) values can not be inserted into a Skip List as they are not comparable to anything (including themselves).
An attempt to call insert()
, index()
, has()
, remove()
with a NaN will raise an error.
In C++ this will throw a OrderedStructs::SkipList::FailedComparison
.
In Python it will raise a ValueError
.
This section looks at handling NaNs in Python.
Here are several ways of handling NaNs:
- Propogate the Exception.
- Make the Median NaN.
- Forward Filling.
Propogate the Exception¶
Here is a rolling median that will raise a ValueError
if there is a NaN in the input.
def rolling_median_no_nan(vector: typing.List[float],
window_length: int) -> typing.List[float]:
"""Computes a rolling median of a vector of floats and returns the results.
NaNs will throw an exception."""
skip_list = orderedstructs.SkipList(float)
ret: typing.List[float] = []
for i in range(len(vector)):
value = vector[i]
skip_list.insert(float(value))
if i >= window_length:
median = skip_list.at(window_length // 2)
skip_list.remove(vector[i - window_length])
else:
median = math.nan
ret.append(median)
return ret
Make the Median NaN¶
Here is a rolling median that will make the median NaN if there is a NaN in the input. Incidentally this is the approach that numpy takes.
def rolling_median_with_nan(vector: typing.List[float],
window_length: int) -> typing.List[float]:
"""Computes a rolling median of a vector of floats and returns the results.
NaNs will be consumed."""
skip_list = orderedstructs.SkipList(float)
ret: typing.List[float] = []
for i in range(len(vector)):
value = vector[i]
if math.isnan(value):
median = math.nan
else:
skip_list.insert(float(value))
if i >= window_length:
median = skip_list.at(window_length // 2)
remove_value = vector[i - window_length]
if not math.isnan(remove_value):
skip_list.remove(remove_value)
else:
median = math.nan
ret.append(median)
return ret
The first row is the input, the second the output. Window length is 5:
[0.0, math.nan, 2.0, 3.0, 4.0, 5.0, 6.0, math.nan, 8.0, 9.0],
[math.nan, math.nan, math.nan, math.nan, math.nan, 3.0, 4.0, math.nan, 4.0, 5.0],
Forward Filling¶
Another approach is to replace the NaN with the previous value. This is very popular in FinTech and is commonly know as Forward Filling. Here is an implementation:
def forward_fill(vector: typing.List[float]) -> None:
"""Forward fills NaNs in-place."""
previous_value = math.nan
for i in range(len(vector)):
value = vector[i]
if math.isnan(value):
vector[i] = previous_value
if not math.isnan(value):
previous_value = value
def rolling_median_with_nan_forward_fill(vector: typing.List[float],
window_length: int) -> typing.List[float]:
"""Computes a rolling median of a vector of floats and returns the results.
NaNs will be forward filled."""
forward_fill(vector)
return rolling_median_no_nan(vector, window_length)
The first row is the input, the second the output. Window length is 5:
[0.0, math.nan, 2.0, 3.0, 4.0, 5.0, 6.0, math.nan, 8.0, 9.0],
[math.nan, math.nan, math.nan, math.nan, math.nan, 2.0, 3.0, 4.0, 5.0, 6.0],
Another example, the first row is the input, the second the output. Window length is 5:
[0.0, math.nan, 2.0, math.nan, 4.0, 5.0, 6.0, math.nan, 8.0, 9.0],
[math.nan, math.nan, math.nan, math.nan, math.nan, 2.0, 2.0, 4.0, 5.0, 6.0],
There is no ‘right way’ to handle NaNs. They are always problematic. For example what is the ‘right way’ to sort a sequence of values that may include NaNs?