Category Archives: python

Calculate wind from thermals

Category : python shell script web

This post is a followup to the last one about Paragliding data gems.

We have collected lots of flights and their GPS location data. From this, several million thermals were extracted and shown on a heatmap. A step forward is to classify these thermals into meaningful groups. Some parameters are easy to extract. For example:

  • Time of the day
  • The month of the year
  • The year
  • Change in altitude
  • Vertical velocity

I have shown in the previous post how the time of the day affects where a thermal is located. The other parameters are also nice to play around with.

One very interesting parameter would be the wind situation. As every pilot knows, the wind plays a crucial role in where turbulences occur and good conditions can be expected.

Finding the wind in our data

The wind consists of two values: Windspeed and direction. And because both values are surprisingly different depending on the height, the wind could be calculated several times. For the sake of simplicity, I am only calculating one wind speed and direction per thermal for now. This could be improved in a later version, when we know there is enough data to fine-grain the selection even more.

How can we calculate wind speed and direction without having access to wind stations and only based on the GPS track? It is not an easy task because the aircraft can turn in any direction and slow down/speed up. Let’s have a look at how a typical thermal with wind looks like:

You can see part of a flight track starting on the lower left and ending on the upper right. There was significant lift and also significant wind drift to the (south-)east. In this case, simply comparing start and end would provide a good estimation of the wind speed and direction.

However, we can not be sure that the pilot follows the wind. Many other scenarios are possible, for example pushing towards the wind while being into several smaller thermal areas. Have a look at this thermal:

Was this change in location triggered by wind drift or by the pilot’s recentering into the core? We can not know by comparing the entry location with the exit location. But we can compute the speed values and direction for each two GPS points. Plotting speed and bearing for another thermal with strong west-wind gives the following:

Here you see speed differences between ~20 km/h and ~60 km/h based on the bearing. The highest speed is ~120° and the lowest ~270°. If a pilot would steer against the wind for a longer time, these values wouldn’t change. We would merely see more points close to the existing ones in a certain area.

The plot above can be improved visually by using polar coordinates:

Here we have the bearing values mapped around a circle and the distance from the middle is the ground speed. If there was no wind at all, this should be a circle around the middle. In this case above, we can estimate a circle and the center gives us wind speed and direction, while the radius of the best fitting circle would be the aircraft airspeed.

The circle can be approximated with some scientific help of Dr. Koch and the mind-blowing scipy.optimize.leastsq function.

Map it on the heatmap

So let’s calculate this for lots of thermals and see how they are affected by which wind:

First of all, you can see a lot more red for the eastern wind than for north(-east). This could be either because the eastern wind is better or because this wind situation is more likely to happen. The heatmap does not reason about the data but the fact is that there are way more thermals appearing with an east- and a west-wind than all the other directions. The database shows how significant that is:

There are way more thermals drifting to the east or west than for any other direction.

When moving closer to a certain area, we can see how frequently they are flown based on the wind. For example, the Wallberg is a well-known west wind mountain:

North- and south-wind show very little thermal action. Even the east-wind does not come close to what happens with west-wind. For Wallberg, this states the obvious for experienced local pilots. However, it might not be obvious for beginners. And it can be helpful to understand which cross country routes work with which wind direction. Does the Baumgartenschneid (across the valley to the north) work with north-wind?

Wrap up

We can now calculate windspeed from GPS fixes and use that data for filtering and grouping. The calculation is compute-intensive and we keep only one value regardless of the height.

Keep in mind that this data is based on wind calculations from the raw thermals. It is not necessarily the same as the overall wind of the day. And there is no statement about turbulence or difficulty. All we can see is that someone found a thermal lift that happened to be in the computed wind drift. We don’t know whether the drift was caused by a valley wind, some lee side rotor, or the overall weather.

Have a look at the Wind based thermal heatmap for yourself.

Related read: Calculating wind speed from the GPS track

Parameters used:

  • Only thermals with a wind speed >= 5 km/h are shown
  • Each thermal has a minimum altitude gain of at least 300 m
  • Wind directions are grouped into [N, NO, O, SO, S, SW, W, NW]
  • Data is based on ~600.000 thermals
  • Public flight database of DHV, mostly from the german community

Technology used:

  • Python with pipenv
  • PyDev
  • Folium, Numpy, Scipy

Paragliding data gems

Category : python shell script web

Paragliding is my beloved hobby and besides offering stunning views and perfect days outside, it also provides a huge amount of flight data to process and play around with. Sites like xc.dhv.de, XContest contain millions of flights from thousands of pilots. These documented flights are gems of data waiting to be investigated by algorithms.

A recorded flight of 4 hours started in the Stubai valley, going up to 3800 m in height, flying to the Öz valley and back.

Outsiders to this sport tend to believe it is about getting up a mountain and simply glide down from there. This might be true for the first few flights, but it can be so much more than that. People stay in the air for up to 12 hours and cover distances of over 500 km. This is possible for the same reason birds can fly such long distances despite the tiny supply of energy they have in their bodies. It is the thermal activity of the air mass.

We all know the sideway wind on the ground. But the vertical winds can be just as strong as a horizontal breeze. And the upward winds are used for climbing as high as possible (and safe) and then glide to the next thermal.

Many books have been written about which factors are important for upward winds and where/when you can expect them to be optimal. But with lots of flight data at hand, it should be possible to see this by example. They just have to be processed and visualized in the right way. This is what Dr. Maximilian Koch and I worked on over the lockdown period.

Get the data

We started with simple scripts for downloading flight data, including all the geo coordinates. For now, it operates on a limited data set because scraping more than a million flights takes time and the process needs to be evaluated first. We don’t want to spider everything and then figure out that something is missing or handled incorrectly. The following examples are based on ~36.000 flights, with a paraglider or a hang glider.

All the data is processed at first for extracting basic information like the takeoff site, the date, and the pilot’s name. In a second step, all the geo-coordinates are processed for thermal activity. I am using this igc-lib for parsing the flights. Judging from some examples, it is not perfect but works well enough on big data sets.

It evaluates the geo-coordinates and extracts thermals and glides from the flights. This information is stored in a local database. Each thermal contains information about the start and end time, the height at both times, the vertical velocity, etc.

Thermal heatmap

Thermal heatmap based on ~36.000 paragliding and hang gliding flights

What can we use that data for? An obvious use case is to show all the thermals on a map, as in the image above. You can see the typical flying routes marked in red. Areas with lots of data appear completely colored, but this is only because the zoom level is so high.

Thermal heatmap zoomed in. The dots appear close to peaks, only few are above the valleys.

When zooming in further, we can see in more detail where to expect upward winds. As the theory states, it is mostly above the peaks and ridges. So far this is similar to other work in the same direction. For example, the paragliding maps show a similar pattern.

Our basic heatmap can be seen and navigated here.

Time based activity

One factor that changes thermal activity is the time of the day. Depending on the sunlight, different areas of the ground are heated and generate a warm airflow.

Therefore, it is interesting to see how thermal activity changes during the day. Here is a time based heatmap in which you can step through all thermals of the day on an hourly base.

The hourly information can be useful to see when and where it is possible to launch in the morning. For long-distance flights, you need to start as early as possible and gain height.

As mentioned before, the data is not complete and it will always have a bias. There are certain areas and routes which pilots typically take. When stepping through the hours, these routes are made visible. Those routes are often used because they are the best possible options. Therefore, even the limited data set is useful for flight planning as they show the most relevant information on the map.

Fun facts

In this dataset of ~36.000 flights, the strongest thermals from start to end have an average climb rate of 7 m/s. There are some outliers showing more than 10 m/s, but all of them can be explained by hardware issues at the start of a flight. These 7 m/s are an average for the whole time of climbing, so there were seconds of a stronger climb as well.

The maximum height gain is 2282 m to an exit height of over 4100 m. This height was reached with an uplift of just below 2 m/s. Within the 36.000 flights, more than 200.000 thermals with a height gain of at least 100 m are reported. So there is an average of 5,5 such thermals per flight. The number is probably much higher in summer than in winter.

Next steps

The data set is still very limited as mentioned a few times. So one goal is to improve the data and download more flights from the respective sites.

There are other interesting questions to ask and possibly answer:

  • Can we calculate the wind conditions out of the tracks?
  • How does it affect thermal activity?
  • Is there something interesting to see in the glides? Can we figure out the best gliding routes, maybe based on other factors?
  • Is it possible to make the data more relevant during a flight? For example, a pilot would only be interested in thermal data that can make him reach a higher position. At the same time, only thermals that can be reached from the current position are of interest.

Can you think of more interesting questions that might be answered by that data? Send me an email and if it is easily possible, I can have a look into it.


A Pypy Runtime for AWS Lambda

Category : AWS python

Python is a great language for the on-demand style of Lambda, where startup time matters. In terms of execution speed, there are better choices available. Where computational performance matters one improvement is to use Pypy, the Python interpreter with a JIT compiler. It can execute the same code much faster. There is just a slight penalty in startup time compared to CPython.

I was curious how Python and Pypy would compare on AWS Lambda. As Amazon announced recently, it is now possible to provide your own Runtime for Lambdas.

Creating a Custom AWS Lambda Runtime

Before you can start with creating your own runtime you should have a simple Lambda function. I recommend you start by creating a serverless application. This way you not only get a plain Lambda but also an API Gateway and Cloudwatch logs set up. And it is much quicker to edit code, iterate and put it to version control.

Creating a new runtime is based on a shell script you have to provide. This script will do initialization work, call an interface for requesting the next task/incoming request, dispatch it to whatever runtime you are providing and respond to another interface with either a success or an error message.

Starting from the example is easy. You can quickly set up a test project using serverless that will execute the example bootstrap code. The example runs in an endless loop, working on one task in each iteration. AWS must be starting/killing this loop based on how many tasks are waiting for execution and probably some other factors.

Pypy does not run out of the box. The interpreter has to work on Amazons Linux environment. Unfortunately, downloading a compiled binary didn’t just work for me. And I couldn’t find a version specifically for the Amazon Linux. The problem is that a libbz2 library was not available. In fact, it is available in the environment but Pypy does not find it. The recommended solution to create a symbolic link to the library is also not an option, because the environment is read-only (except for /tmp/). To not spend too much time on this I fired up an EC2 instance with the Amazon default image and copied that library next to the Pypy interpreter into my package.

To create a first “Hello World from Pypy” application running you need to call Pypy from within the shell script and send the response back to the Runtime Interface. There is no error handling yet and starting a new Pypy process on every request is far from optimal, but this is already a working solution.

A better way is to move the processing loop from shell script into Pypy. This way there already is a running process, all imports have been done and if parts of the code use initializers or caching this state will be kept for the next request. The bootstrap script looks a lot simpler now:

The logic is now located in Python code and run by Pypy. With the custom runtime code in place, we can switch between a standard Python3.7 runtime and our own in the AWS Lambda web interface:

Comparing Pypy and Python3.7

So how does the simple Pypy runtime compare to the default Python3.7 implementation? Let’s create an example where the Lambda has to use its CPU. I wrote a simple one-liner to calculate prime numbers from 2 to 200000:

For sure there are better algorithms to do the same thing, but it serves the purpose well. Calculating the prime numbers takes considerably longer with CPython than with Pypy. On my machine, it takes around 1 second with Pypy and 3 seconds with CPython.

In an AWS Lambda environment, both runtimes are executing the same handler.py and calculate prime numbers. This is how long they take to run the code:

Calculating the primes takes more than 5 seconds when executed with python3.7 but only slightly above 2 seconds with Pypy. We have found a case where Pypy is a lot better than CPython. This was my hope when starting to build the runtime.

But, as you can see in the first request, the Pypy runtime takes a long time to initialize. This is logged in CloudWatch as “Init Duration: 1641.69 ms”. In this test scenario, it does not matter because a request takes many seconds to finish. With its better computational performance, the Pypy runtime still comes in first. In a more typical scenario, this Init Duration will be much more important. And this brings us to the downsides of this approach.

Downsides of the Custom Runtime

The initialization phase takes way too long. It is not visible what exactly happens during that time. But the bulky size of the code package will most likely be part of it.

Let’s rerun the same test as before but without the heavy calculation and ignore the Init Duration:

The execution time for a “Hello World” application is higher than with Python. I don’t understand why this is the case. Monitoring Pypy runtime_interface gives me sub-millisecond times for what my code executes. Still, the Lambda execution Duration is reported to be somewhere between 10 and 30 ms. In contrast, executing the function with Python3.7 gives Durations close to 1 ms with only a few spikes. This should be more or less equal. There either is a problem in my implementation or in how AWS handles a Custom Runtime. If you have an idea what goes wrong here please add a comment. In any way, this diagram is much closer to real-world usage. And Python is faster here.

Also, the deployment package is big. Nearly 30 MB are uploaded to S3, even though there is hardly any function code inside. For many cases, this is going to be a showstopper. I believe the package size can still be reduced by specifying in more detail which Pypy files are necessary. If Amazon ever considers this as a default choice it would solve the issue, because then you would not have to upload the interpreter within your package.

Wrap-up

Running Pypy on AWS Lambda as a custom runtime is possible and not very complicated. There is a clear advantage over CPython when it comes to long-running computations. Packaging the whole interpreter bloats up your Lambda package and increases your initial startup time. Typically, being lightweight and having a quick startup is more important than raw computational speed. Therefore, I can only recommend this approach for exceptional cases.

If Amazon decides to provide Pypy as a default Runtime, this could be different. You would not have to bundle the interpreter and the startup time might become a lot better than now while the computational advantage of Pypy will still be there.

You can find all the code in my Github repository.


Building a very easy text classifier in python

Category : python

Some of the developers at match2blue are creating a text-interest-matcher. Leaving buzzword bingo aside, that means the software calculates whether a text is interesting based on users’ interests. So basically you, as a user, have to enter some interests and will be presented some pieces of data in order of their relevance. You can also think of it as text classification into either good or bad.

This software has become quite complicated, because it is necessary to have some kind of semantic knowledge about the interests. But there are different methods of text classification. I was curious about how hard it is to write the most simple text classifier that gives you decent results. Well, turns out it is remarkably simple. Creating the classifier took less than two hours. And here is the source code:

import os
j = os.path.join

def train(text):
	"""
	Train a dictionary with the given text. Returns a dictionary of dictionaries,
	that describes the probabilities of all word-folling-ocurrences in the text.

	For example, the string "a test" gives this result:
	>>> train("a test")
	{'': {'a': 1}, 'a': {'test': 1}}
	Meaning that the empty string '' has been followed once by 'a' and 'a' has been
	followed by 'test' as well.

	A longer example leads to a more complex dictionary:
	>>> train("this is a test oh what a test")
	{'': {'this': 1}, 'a': {'test': 2}, 'what': {'a': 1}, 'oh': {'what': 1}, 'this': {'is': 1}, 'is': {'a': 1}, 'test': {'oh': 1}}
	"""
	c = {}
	lastword = ""
	for word in text.split():
		word = word.lower()
		if c.has_key(lastword):
			inner = c[lastword]
			if inner.has_key(word):
				inner[word] += 1
			else:
				inner[word] = 1
		else:
			c[lastword] = {word: 1}
		lastword = word
	return c

def probability_of(dict, lastword, word):
	"""
	Helper function for calculating the probability of word following lastword
	in the category given by dict.

	>>> category = train("this is a test")
	>>> probability_of(category, "a", "test")
	1.0

	>>> probability_of(category, "any", "words")
	0
	"""
	word = word.lower()
	if dict.has_key(lastword):
		inner = dict[lastword]
		sumvalues = sum([v for v in inner.values()])
		if inner.has_key(word):
			return inner[word] / (sumvalues * 1.0)
	return 0

def classify(text, dict):
	"""
	Returns the probability that a text is from the given category. For every pair of
	words the probability_of value is calculated, summarized and divided by the amount
	of words in the text.

	>>> category = train("this is a test")
	>>> classify("a test with some words", category)
	0.2

	>>> classify("just writing test or a doesn't improve the ranking", category)
	0.0
	"""
	lastword = ""
	probabilities = 0
	for word in text.split():
		probabilities += probability_of(dict, lastword, word)
		lastword = word
	return probabilities / (len(text.split()) * 1.0)

if __name__ == "__main__":
	"""
	Calculate the category, that the text in ../test matches best.
	"""
	ranking = []
	os.chdir("..")
	for file in os.listdir("categories"):
		trained = train(open(j("categories", file)).read())
		value = classify(open("test").read(), trained)
		print "test is", file, "with", value, "% probability"
		ranking.append((value, file))

	ranking.sort()
	print
	print "The test text is probably from", ranking[-1][1]
	print "(second guess is", ranking[-2][1] + ")"

How does it work? There are two very simple steps it does. First, the classifier has to be trained with existing textfiles. The result is a dictionary that consists of many inner dictionaries. Let’s feed it with some text to see what happens.

In [5]: train("a test a good test a good good test")

Out[5]:
{'': {'a': 1},
 'a': {'good': 2, 'test': 1},
 'good': {'good': 1, 'test': 2},
 'test': {'a': 2}}

In the result we can see that three words followed ‘a’. Two times it was ‘good’ and one time ‘test’. This is all we need for classifying. Now we can apply the classify function. It goes through the text to classify and looks for known word follow-ups. If there is a known one, the probability of this follow-up is added. So, in our example the probability of ‘a good’ is 2/3 and for ‘a test’ it is 1/3.

In [2]: a = train("a test a good test a good good test")

In [3]: classify("is it a good test", a)

Out[3]: 0.26666666666666666

In [4]: classify("text good but different style", a)

Out[4]: 0.0

The first example has similar words and ordering as the trained text. The second one also has some exact same words, but they are in a different order. Therefore, the probability of this text beeing a is 0.0.

If you want to try it with some longer text, you can download the classifier from Google Code. It is easy to create your own categories by adding a file in the appropriate folder. Just make sure you have a decent amount of data. Three sentences are not enough for a good classification.

In my tests I got quite good results even for classifying authors writing about the same topic.

cd classify/src/
python classify.py

Recent Posts

Tweets