Avoiding unnecessary computations with scan (II)

Paweł Stadnicki
8 min readMay 31, 2021

Master .NET Interactive with F# / Master F# with .NET Interactive

About this lesson

Initial scan implementation that will evolve this lesson

We will take the initial scan implementation from the introduction lesson and add a few extensions to see how code can “deteriorate” with functional programming. You can follow and play with all that “deteriorations” with notebooks available here.

How much our initial implementation will grow by:

  • adding simple domain-monitoring code that informs us about unnecessary computations we’re actually avoiding?
  • Introducing parallelism?
  • Supporting missing values?

Answering the aforementioned questions, we will see how .NET Interactive supports programming experiments. This lesson is still more about F# than .NET Interactive, but in the next lesson, we will hit .NET Interactive with full power by writing Fable Kernel Extension.

Last but not least, you don’t have to stick to Calgary open data anymore. Why not play with other cities like Toronto, San Francisco, Stockholm, or Florence? How to support even more beloved cities?

Btw, Florence is twice the hero of this episode. Read the post for all the details …

What is .NET Interactive all about?

Well, first of all, .NET Interactive is hard to explain!

The architecture itself is easy to follow ( I encourage you to check out the code), but potential use cases challenge your imagination. And it by no means is not a new concept. It just makes certain things much simpler than they used to be.

.NET Interactive is only about notebooks? Not at all, too, although it is the most natural application. So what is it?

.NET Interactive application is a frontend that sends commands to the language services (kernels) and awaits outputs. The tricky thing is that all bold words are quite abstract and even optional (not to mention involved transport layers and other middleware).

Mind that this overcomplicated definition is more for people who desire to adjust .NET Interactive to their needs (as I do). For anyone else, .NET Interactive is just unleashed .NET that can be embedded in notebooks, chats (MS Teams), blogs, wherever you can imagine to place it.

What is a notebook? You can think of a notebook as UI/UX extension for the code (run by .NET Interactive) that can be shared. It also enforces some structure and takes care of serialization/interoperability aspects.

I scenario (notebook)

Ok, let’s play again with my beloved scan function!

We’re going to put our both existing and updated requirements into the notebooks. For example, the first notebook covers address-to-area matching without any human-guided optimization. For this scenario, I include full notebook screenshots here. For any following, I’m going only to mark the differences here, but can you see the entire picture in the repository.

Note that I’m not putting too much attention on markdown capabilities for now. Ultimately I have the desire to host those notebooks on a personal blog so all work about aesthetics is postponed until then.

Scenario II: avoid unnecessary computations

Here goes scan implementation of our human-guided optimization. You may notice there is a little boilerplate code that deconstructs the list to satisfy scan init seed value. We will get rid of it later.

Scenario III: count avoided computations

We can notice that the execution time in the first two scenarios is ~250s vs. ~50s. So does a 5x gain in speed means that around 80% of executions have been avoided? Let’s check it.

Start by adding a type for our domain-driven “naive” monitoring.

How much our scan implementation will deteriorate when we include it in our algorithm?

Minimally! And counting avoided computations is as simple as:

So the result is almost exactly 80%.

Scenario IV: introducing parallelism

We have a lot of data, and despite avoiding as much as possible computations, they still take time. It seems that our operation on addresses is isolated, so we may try to apply some parallelism here.

Let me repeat our hint: the following addresses on the same street most likely belong to the same area. It means streets are de facto our logical split for parallelism. So we can potentially speed up the execution with more cores, where each core takes care of each street. In theory, we can temporarily provision a machine with ~400 cores (800 threads). For an average large city of ~2000 streets, thanks to parallelism, we can speed up the process from ~2000 “execution units” to ~3 “units” (if 100$ /hour justifies it!)

Unfortunately, my laptop is not the best one to check out that ( I don’t have too many free CPU resources for my disposal), as I only have 2x time improvement:

This time I stuck to arrays, however, all that destructuring was kept only to satisfy the scan definition. It is blurred, so let’s extract that boilerplate to the domain file and leave it as:

Notice that we have a little side effect here: the addresses within streets are processed in a non-deterministic way due to introduced asynchrony, so we lost indexing track! You can find full implementation in the repository, I will leave it here as it is for the clarity of parallelism gains.

Scenario V: dealing with missing values

When you want to play with open data for Florence or San Francisco, you may encounter one trouble: there might be addresses that don’t match any district!

It may happen for several reasons, and it has nothing to data quality governance in a particular city. It is not always possible to keep them in sync or they don’t match for a justified reason.

Supporting missing values is as simple as making our hint result …an option!

How many missing values were there?

Supporting more cities

To support more cities all you need to do is to gather a few publicly available open data files and create your configuration file similarly to

There are 2 mandatory files (in geojson format):

  1. addresses

2. at least one area split (district, census, neighborhoods, wards or any other)

You should also mind putting appropriate attribution or adjust mapping if addresses are taken from another source that openaddresses.io

Not all data are typically available in geojson format. Often they come as a shapefile or similar. You can convert them with many online or geospatial tools. I’m personally using https://mygeodata.cloud/ for that purpose.

Florence

In the beginning, I said that Florence is two times hero in this post.

I showed Florence open data as an example of how scan will deteriorate while supporting missing values. However, in all examples, I used Domain.fsx script that covers common code, especially for spatial data processing. It works ok with Visual Studio or another editor. However, with notebooks, such a development experience is not good at all. That is why in forthcoming posts, I will move all the functionality from that file to a separate library called … florence.

Summary

The most important outcome of this lesson (not mentioned earlier) should be the way typical functional program or script looks like:

  • you write some domain-specific functionality as a plain function
  • you try to figure out an appropriate collection module to support the required processing of that function (which I initially have named ironically “deteriorations” on purpose).

I’m aware that starting learning functional programming from collection modules can be tricky but it is crucial. If you like the DRY principle (don’t repeat yourself) then collection modules are a must-have to apply it in practice. You may not see it at first glance but collection modules also encourage you to learn function composition, partial application and avoid reinventing the wheel (by at least knowing the definitions).

I also showed that notebooks are good for code sharing because to run samples, you need (almost) no setup. Also, you have the capabilities to add descriptions (markdown), so code snippets are similar to typical lessons. We used .NET Interactive under the cover, but it was opaque in this lesson. This is going to change very quickly. Next, I will show how to write Fable Kernel Extension and how to host .NET Interactive for your own blogging purposes.

This is due to the fact that in this post I still used screenshots. Why not use .NET Interactive between the lines? (Ok, maybe not with Medium as a host). I would like to apply some fancy visualizations and animations as well (to do so I need to incorporate into notebooks the entire js ecosystem with a help of Fable).

I hope you liked the article. If yes, stay tuned for the… actual F#/.NET Interactive superpower details.

--

--