I: Avoiding unnecessary computations with scan
About the planned series
It presents how F# along with .NET Interactive can improve the analytical workflow experience. In particular, I will describe how .NET Interactive works under the cover, how to extend it, and … how to host it.
I‘m starting here by providing some functional programming concepts related to data computations. Next, we will see how easily people can share code with the help of notebooks.
We need non-trivial visualization/animations, so there is an opportunity to extend F# interactive kernel with Fable to combine a huge JS ecosystem, exceptional F# domain modeling, and awesome F# (frontend) libraries.
In the end, I’m going to show how to embed .NET Interactive in a custom website (blog) that is similar to VS Code notebooks (or Github Codespaces) but with a totally owned UI written with Feliz.
Life without notebooks
This first blog post is normal because you can‘t easily run presented examples with no setup. This is how a typical blog post without .NET Interactive looks like. I still need this entry to introduce not too basic - not too advanced F# concept that gradually moves into the full interactive experience.
It is tempting to use discriminated unions or pattern matting as an eye-catching concept. Still, I will focus rather on collection functions which are crucial in day-to-day F# programming activities. Not the one collection function you may expect thought …
Meet the scan (Scanning Canada)
Scan function (located in most collection modules) is one of the less used.
It might be underestimated, being often described solely as the monitoring extension for fold, which is not true (although it justifies the ‘scan’ name).
Indeed scan is almost the same as fold with one subtle difference. Instead of returning the only final result of a state accumulation, it gives all intermediate steps as well:
Let's look at an example of scan function that calculates factorial, so we can monitor how the accumulative number evolves:
One may notice that despite obvious proximity to fold, scan can be even closer to map (although it is harder to decipher it from the definition):
The clue is that the state in the folder function doesn’t have to be accumulated (previous steps reduced to a single value) as it happens in fold.
It can be the plain last computed result of mapping that can play with the current item in any desired fashion. In our case, we may use that result to optimize the next mapping loop.
This still might be unclear so let put it into a valid example.
Scenario: assign areas to addresses
Let's say we have a collection of addresses, and our goal is to assign each address to the corresponding area it belongs to. The area can be anything, but for simplification, I will assume it is some polygon: district, Ward, electoral sector, community, anything you can draw with geojson.io or proprietary geospatial software.
This is an essential task in urban planning and in the smart citizens/ collaborative governance domain that I’m very into.
The count of areas may vary from dozen to hundreds (even thousands), so we may be interested in avoiding unnecessary not-trivial ‘executions’ when possible.
Of course, there are proven algorithms that do that for us (based on branch and bound strategies used to form the spatial index structure), but they have flaws too:
- they don’t bring value if the number of areas is small
- spatial index is costly during instantiation
- most likely third party library is necessary (not than bad overall)
- it might be hard to use it with our other domain
I’ve been playing with those algorithms via NetTopologySuite package, and I’m using it for other purposes like finding n closest city facilities to the particular address.
Here I wanted to use some human-advised clues: there is a very high chance that the following (sorted) addresses from the same street will belong to the same district ( even if they occasionally cross the states). We can guide the program to start address/district matching with the very last matched district.
How many comparisons can we avoid? For instance, Calgary (I’ve never been in North America, but I like their open data volume) has:
- 8105 streets
- 379071 addresses
- 777 census communities
If we look at addresses per street distribution (median 32, mean 47), we can assume a lot of invocations can be omitted despite having a big standard deviation (52)
So our human-guided algorithm is as follows:
- while checking if the point is within a particular area, try the last matched area index first
- in other case, check all areas one by one ( you may think of other optimizations here, but it doesn’t matter in this blog post context)
For using that last matched area for the consecutive address, I’m going to use scan function.
Let's first obtain some core data for our computation: addresses and districts.
Depending on your CPU/cores, you may be interested in limiting addresses to not endlessly wait for the computation results during experiments (or you can temporarily provision VM with dozens of cores, I will show that later — here I’m omitting parallelism at all for clarity). Without the limits~350k addresses and ~800 districts will make your CPU busy. Alternatively, you can choose the Ward dataset that only has 12 areas. In theory, the fewer the areas, the more computation to avoid (more consecutive addresses fit into the single area).
Our districts are plain geojson files, but what we really need in our computation is just their indexes; everything else is opaque for our goal right now. The index number is what we actually pair with the address at the end. If we want more data to project in our UI, we should use that index to extract it on demand.
We also need a function to put the last matched index to the head for comparisons. Unfortunately, there is no good functional approach (please community point me in the right direction if I’m mistaken) to swap the elements in the collection in an efficient manner (and with different collection modules as we may be later interested in using Seq in place of List or Array).
If we wanted the one-liner, it could be similar to the following, but it has a huge performance cost:
To avoid declarative swap, we can implement comparisons directly:
Ok, so how much is left to the actual scanning?
Nothing is left as our ‘hintSearch’ matches folder from scan definition:
The remaining boilerplate code is related to setup the init state. Init state is required for most of the fold family of functions but can be hidden with custom module extension for scenarios when it doesn’t bring to much value as our (but I will keep it here nevertheless)
So how many unnecessary computations we’ve actually avoided? Do we have some bugs in the code or misconceptions in our thinking? Can we visualize our process somehow? Can other people try the same with their home or beloved city? To answer those questions, we must have an environment that helps running experiments and modifications quickly.