Scaling Python for High-Load Web Sites

David Shoemaker & Jamie Turner

Polimetrix, Inc.

At Polimetrix we conduct surveys and political polls on the web at 
pollingpoint.com.  As such, our traffic increases significantly at 
election time. As the 2006 election approached, we realized that
our web traffic would soon be about 10x what it normally is. We
set about making sure our systems could handle that load. Today
we want to share some things we learned during this process.

The Paradox of Choice

There are a lot of freakin' web frameworks...

It's an interesting--and great--time to be a python web developer..
A few years ago web frameworks were largely underdeveloped
and undocumented. Now we have mountains of frameworks and tools
available, many with excellent documentation and very active
user communities.

New ideas emerge every day about developing websites
in more elegant, more reusable, more efficient ways.  And with each of
these ideas, dozens more frameworks are born.

The Paradox of Choice

... and these are just the ones with cool logos.

...
in fairness, we were going to just list all the ones w/out
cool logos, but this is only a 25' screen.

All that Aside...

Let's focus on methods, not web frameworks


	

Fast Enough

How fast is fast enough?

Premature optimization is the root of all evil. All optimizations should
be based on an observed deficiency and should target a specific 
bottleneck in the application.  Good logging and load testing are required
to diagnose performance issues.  It is much easier to verify that an 
optimization works if one problem is addressed at a time, in serial.

This isn't Fast Enough

Oh. Okay, then.

Well how fast is "Fast"?

 * Note about "orders of magnitude" rule of thumb
 * Faster than 1000+ pages/s is out of scope for this talk, because
   it right at the outer limit of what very, very good single machine
   can do with an RDBMS.  (That gets into db replicates/clusters, 
   reads vs. writes, database partitioning, etc)
 * Maybe refer to brad@lj's talks for more on that (perl, but 
   we'll let that slide)

But "Python Can't Scale"!

Scaling isn't about languages, it's about:

 We've tried to distill the process down to a couple of main ideas.
 Many of the optimization decisions we make are based on these two
 principles.
 For our demo we will employ a few open source tools. You might choose
 to use other tools, proprietary or open source.

The Example Application

Flickr Killr

Show the app in a web browser (CGI version).
We put together a tiny web application that we will use to demonstrate 
and benchmark some of the methods we're discussing today. It's a 
database-backed, dynamically generated page that includes several static
files. The application looks kind of like photo organizing software 
(hence the name).  We've populated the database with over 100k users and 
over 1 million photos.
We have created a benchmarker script that simulates a browser requesting
this page. It makes requests to a list of URLs in rapid succession,
and is multithreaded to simulate concurrency. The benchmarker then
outputs some interesting stats, such as req/sec and average req time.

Flickr-Killr-0.1-py2.5.egg

Flickr Killr 0.1
The CGI Strawman


	

Flickr-Killr-0.1-py2.5.egg

Version 0.1 Architecture

The web server invokes the Python interpreter and runs the CGI script.
The script prints its output to stdout, which is then written back to the
client as an HTTP response.
We realize that this is a somewhat outdated architecture, but PHP coders
work with only marginally better systems on a daily basis.
Do some benchmarking.

Flickr-Killr-0.1-py2.5.egg

What's wrong with this picture?

There is some startup cost associated with the interpreter itself
and loading the modules needed by the application.
Things like database connections, parsed page templates, and other
application-specific initialization can, and should, live longer
than one request.

Flickr-Killr-0.1-py2.5.egg

Let's remedy some of these issues

With the long-running Python process we can do all initialization up
front when the server starts.

Flickr-Killr-0.1-py2.5.egg

Other options

There are a few different multi-programming models to consider here. In
general the threaded model can handle more concurrent connections than
the multiprocess model, and async can handle more than threaded. However,
concurrent connections is very often not the bottleneck. Other factors
should influence this decision. The methods we're trying to hilight in 
this talk apply no matter which model you choose.

flickr-killr-users@flickrkillr.org

Flickr Killr 0.2
A slice of CherryPy


	

flickr-killr-users@flickrkillr.org

Version 0.2 Architecture


flickr-killr-users@flickrkillr.org

What's wrong with this picture?

This is already a huge improvement over the CGI version, but it's far
from perfect. For one, we're running this on a dual core machine, but 
Python's GIL prevents us from using the second core.
Another problem worth mentioning is that we are effectively fetching
the session from the database for each request. Since the session is
such an frequently-accessed resource, we prefer a more efficient session
store. We're not going to demo this today, but Jamie has written a
network-bound session daemon called Pear, which you can check out in
the cheese shop.

flickr-killr-users@flickrkillr.org

Let's remedy some of these issues

We can run two (or more) instances of CherryPy to take advantage of
multi-core and multi-cpu architectures.
For your load balancer you almost certainly want to use the async MPM.
It has to deal with a lot of concurrent connections, and it has to do
so quickly.  
Shared Nothing means that each node is independent and doesn't keep any
application state locally. In the context of a web application this is
important because consecutive requests could be distributed to different
servers.
There is an added benefit of putting nginx out in front of CherryPy. 
Often the response needs to be slowly trickled back to a client on a low
bandwidth connection. With this setup, CherryPy can send it's response
to nginx very quickly across the local network (or on localhost) and
nginx can keep tons of connections open to the slow clients, or even
idle, between-request keepalive clients.
The CherryPy threads are free to handle new requests.

flickr-killr-users@flickrkillr.org

Other options

The multiprocess MPM on the application server also takes advantage of 
multi-core and multi-CPU machines. This setup is not as memory efficient, 
however, and we'll need to get a reverse proxy involved anyway when 
we want to involve different physical machines.

www.flickrkillr-mashups.com

Flickr Killr 0.3
Have your CherryPy and eat it too


	

www.flickrkillr-mashups.com

Version 0.3 Architecture


	

www.flickrkillr-mashups.com

What's wrong with this picture?

OS disk cache will help this problem, but it is still inefficient to have
a Python process reading and writing this data. Our total number of 
threads is finite, so we only want them working on generating dynamic 
content. This is another problem best solved by an asynchronous server.

www.flickrkillr-mashups.com

Let's remedy some of these issues

For this demo we have written a CherryPy tool that inserts static
content into memcached as it is requested. memcached keeps the files
in memory and can handle tens of thousands of requests per second.
We will instruct nginx to look for files there first, and fall back
to CherryPy if the file is not found in the cache. The vast
majority of requests for static content will never even be seen
by CherryPy.

www.flickrkillr-mashups.com

Other options

Another caching option is to put a web cache out in front of the site
that automatically caches certain resources and intercepts requests as
they come in. This method makes your application completely unaware that
the cache even exists.
If you choose to use lighttpd as your reverse proxy and load balancer,
it might make sense to just let lighttpd serve static content from disk.

NASDAQ: FLKL

Flickr Killr 0.4
Cache money


	

NASDAQ: FLKL

Version 0.4 Architecture

Story about pollster.com running on a blade, database and all.

NASDAQ: FLKL

What's wrong with this picture?


NASDAQ: FLKL

In summary


	

NASDAQ: FLKL

Don't forget to index

Version 0.4 has provided a huge improvement over the CGI 
version, but we could quickly be looking at version 0.1 performance if 
we don't have our database properly indexed. In this case, the userid 
column of the photos table is the crucial column.
Drop the index, run the benchmarking, recreate the index.
Story about how a missing index in Gryphon caused our RDBMS
to be CPU bound and we thought there was no way we could handle election 
load.

Google Photo Organizr

Automating deployment will certainly be worth your time

It gets really annoying after just 2 or 3 machines are
involved.
Jamie wrote a really slick system for deploying Python apps on
many machines. It was a fairly big project, but it has paid
huge dividends. Being able to manage deployment from one
machine and quickly add/upgrade nodes is key.

Thanks for listening

All demo code and config files at http://www.polimetrix.com/pycon/demo.tar.gz

Question time

Jamie wrote a really slick system for deploying Python apps on
many machines. It was a fairly big project, but it has paid
huge dividends. Being able to manage deployment from one
machine and quickly add/upgrade nodes is key.