Category Archives: python

A Pypy Runtime for AWS Lambda

Category : AWS , python

Python is a great language for the on-demand style of Lambda, where startup time matters. In terms of execution speed, there are better choices available. Where computational performance matters one improvement is to use Pypy, the Python interpreter with a JIT compiler. It can execute the same code much faster. There is just a slight penalty in startup time compared to CPython.

I was curious how Python and Pypy would compare on AWS Lambda. As Amazon announced recently, it is now possible to provide your own Runtime for Lambdas.

Creating a Custom AWS Lambda Runtime

Before you can start with creating your own runtime you should have a simple Lambda function. I recommend you start by creating a serverless application. This way you not only get a plain Lambda but also an API Gateway and Cloudwatch logs set up. And it is much quicker to edit code, iterate and put it to version control.

Creating a new runtime is based on a shell script you have to provide. This script will do initialization work, call an interface for requesting the next task/incoming request, dispatch it to whatever runtime you are providing and respond to another interface with either a success or an error message.

Starting from the example is easy. You can quickly set up a test project using serverless that will execute the example bootstrap code. The example runs in an endless loop, working on one task in each iteration. AWS must be starting/killing this loop based on how many tasks are waiting for execution and probably some other factors.

Pypy does not run out of the box. The interpreter has to work on Amazons Linux environment. Unfortunately, downloading a compiled binary didn’t just work for me. And I couldn’t find a version specifically for the Amazon Linux. The problem is that a libbz2 library was not available. In fact, it is available in the environment but Pypy does not find it. The recommended solution to create a symbolic link to the library is also not an option, because the environment is read-only (except for /tmp/). To not spend too much time on this I fired up an EC2 instance with the Amazon default image and copied that library next to the Pypy interpreter into my package.

To create a first “Hello World from Pypy” application running you need to call Pypy from within the shell script and send the response back to the Runtime Interface. There is no error handling yet and starting a new Pypy process on every request is far from optimal, but this is already a working solution.

A better way is to move the processing loop from shell script into Pypy. This way there already is a running process, all imports have been done and if parts of the code use initializers or caching this state will be kept for the next request. The bootstrap script looks a lot simpler now:

The logic is now located in Python code and run by Pypy. With the custom runtime code in place, we can switch between a standard Python3.7 runtime and our own in the AWS Lambda web interface:

Comparing Pypy and Python3.7

So how does the simple Pypy runtime compare to the default Python3.7 implementation? Let’s create an example where the Lambda has to use its CPU. I wrote a simple one-liner to calculate prime numbers from 2 to 200000:

For sure there are better algorithms to do the same thing, but it serves the purpose well. Calculating the prime numbers takes considerably longer with CPython than with Pypy. On my machine, it takes around 1 second with Pypy and 3 seconds with CPython.

In an AWS Lambda environment, both runtimes are executing the same handler.py and calculate prime numbers. This is how long they take to run the code:

Calculating the primes takes more than 5 seconds when executed with python3.7 but only slightly above 2 seconds with Pypy. We have found a case where Pypy is a lot better than CPython. This was my hope when starting to build the runtime.

But, as you can see in the first request, the Pypy runtime takes a long time to initialize. This is logged in CloudWatch as “Init Duration: 1641.69 ms”. In this test scenario, it does not matter because a request takes many seconds to finish. With its better computational performance, the Pypy runtime still comes in first. In a more typical scenario, this Init Duration will be much more important. And this brings us to the downsides of this approach.

Downsides of the Custom Runtime

The initialization phase takes way too long. It is not visible what exactly happens during that time. But the bulky size of the code package will most likely be part of it.

Let’s rerun the same test as before but without the heavy calculation and ignore the Init Duration:

The execution time for a “Hello World” application is higher than with Python. I don’t understand why this is the case. Monitoring Pypy runtime_interface gives me sub-millisecond times for what my code executes. Still, the Lambda execution Duration is reported to be somewhere between 10 and 30 ms. In contrast, executing the function with Python3.7 gives Durations close to 1 ms with only a few spikes. This should be more or less equal. There either is a problem in my implementation or in how AWS handles a Custom Runtime. If you have an idea what goes wrong here please add a comment. In any way, this diagram is much closer to real-world usage. And Python is faster here.

Also, the deployment package is big. Nearly 30 MB are uploaded to S3, even though there is hardly any function code inside. For many cases, this is going to be a showstopper. I believe the package size can still be reduced by specifying in more detail which Pypy files are necessary. If Amazon ever considers this as a default choice it would solve the issue, because then you would not have to upload the interpreter within your package.

Wrap-up

Running Pypy on AWS Lambda as a custom runtime is possible and not very complicated. There is a clear advantage over CPython when it comes to long-running computations. Packaging the whole interpreter bloats up your Lambda package and increases your initial startup time. Typically, being lightweight and having a quick startup is more important than raw computational speed. Therefore, I can only recommend this approach for exceptional cases.

If Amazon decides to provide Pypy as a default Runtime, this could be different. You would not have to bundle the interpreter and the startup time might become a lot better than now while the computational advantage of Pypy will still be there.

You can find all the code in my Github repository.


Building a very easy text classifier in python

Category : python

Some of the developers at match2blue are creating a text-interest-matcher. Leaving buzzword bingo aside, that means the software calculates whether a text is interesting based on users’ interests. So basically you, as a user, have to enter some interests and will be presented some pieces of data in order of their relevance. You can also think of it as text classification into either good or bad.

This software has become quite complicated, because it is necessary to have some kind of semantic knowledge about the interests. But there are different methods of text classification. I was curious about how hard it is to write the most simple text classifier that gives you decent results. Well, turns out it is remarkably simple. Creating the classifier took less than two hours. And here is the source code:

import os
j = os.path.join

def train(text):
	"""
	Train a dictionary with the given text. Returns a dictionary of dictionaries,
	that describes the probabilities of all word-folling-ocurrences in the text.

	For example, the string "a test" gives this result:
	>>> train("a test")
	{'': {'a': 1}, 'a': {'test': 1}}
	Meaning that the empty string '' has been followed once by 'a' and 'a' has been
	followed by 'test' as well.

	A longer example leads to a more complex dictionary:
	>>> train("this is a test oh what a test")
	{'': {'this': 1}, 'a': {'test': 2}, 'what': {'a': 1}, 'oh': {'what': 1}, 'this': {'is': 1}, 'is': {'a': 1}, 'test': {'oh': 1}}
	"""
	c = {}
	lastword = ""
	for word in text.split():
		word = word.lower()
		if c.has_key(lastword):
			inner = c[lastword]
			if inner.has_key(word):
				inner[word] += 1
			else:
				inner[word] = 1
		else:
			c[lastword] = {word: 1}
		lastword = word
	return c

def probability_of(dict, lastword, word):
	"""
	Helper function for calculating the probability of word following lastword
	in the category given by dict.

	>>> category = train("this is a test")
	>>> probability_of(category, "a", "test")
	1.0

	>>> probability_of(category, "any", "words")
	0
	"""
	word = word.lower()
	if dict.has_key(lastword):
		inner = dict[lastword]
		sumvalues = sum([v for v in inner.values()])
		if inner.has_key(word):
			return inner[word] / (sumvalues * 1.0)
	return 0

def classify(text, dict):
	"""
	Returns the probability that a text is from the given category. For every pair of
	words the probability_of value is calculated, summarized and divided by the amount
	of words in the text.

	>>> category = train("this is a test")
	>>> classify("a test with some words", category)
	0.2

	>>> classify("just writing test or a doesn't improve the ranking", category)
	0.0
	"""
	lastword = ""
	probabilities = 0
	for word in text.split():
		probabilities += probability_of(dict, lastword, word)
		lastword = word
	return probabilities / (len(text.split()) * 1.0)

if __name__ == "__main__":
	"""
	Calculate the category, that the text in ../test matches best.
	"""
	ranking = []
	os.chdir("..")
	for file in os.listdir("categories"):
		trained = train(open(j("categories", file)).read())
		value = classify(open("test").read(), trained)
		print "test is", file, "with", value, "% probability"
		ranking.append((value, file))

	ranking.sort()
	print
	print "The test text is probably from", ranking[-1][1]
	print "(second guess is", ranking[-2][1] + ")"

How does it work? There are two very simple steps it does. First, the classifier has to be trained with existing textfiles. The result is a dictionary that consists of many inner dictionaries. Let’s feed it with some text to see what happens.

In [5]: train("a test a good test a good good test")

Out[5]:
{'': {'a': 1},
 'a': {'good': 2, 'test': 1},
 'good': {'good': 1, 'test': 2},
 'test': {'a': 2}}

In the result we can see that three words followed ‘a’. Two times it was ‘good’ and one time ‘test’. This is all we need for classifying. Now we can apply the classify function. It goes through the text to classify and looks for known word follow-ups. If there is a known one, the probability of this follow-up is added. So, in our example the probability of ‘a good’ is 2/3 and for ‘a test’ it is 1/3.

In [2]: a = train("a test a good test a good good test")

In [3]: classify("is it a good test", a)

Out[3]: 0.26666666666666666

In [4]: classify("text good but different style", a)

Out[4]: 0.0

The first example has similar words and ordering as the trained text. The second one also has some exact same words, but they are in a different order. Therefore, the probability of this text beeing a is 0.0.

If you want to try it with some longer text, you can download the classifier from Google Code. It is easy to create your own categories by adding a file in the appropriate folder. Just make sure you have a decent amount of data. Three sentences are not enough for a good classification.

In my tests I got quite good results even for classifying authors writing about the same topic.

cd classify/src/
python classify.py

Recent Posts

Tweets