At Polimetrix we conduct surveys and political polls on the web at pollingpoint.com. As such, our traffic increases significantly at election time. As the 2006 election approached, we realized that our web traffic would soon be about 10x what it normally is. We set about making sure our systems could handle that load. Today we want to share some things we learned during this process.
There are a lot of freakin' web frameworks...
It's an interesting--and great--time to be a python web developer.. A few years ago web frameworks were largely underdeveloped and undocumented. Now we have mountains of frameworks and tools available, many with excellent documentation and very active user communities. New ideas emerge every day about developing websites in more elegant, more reusable, more efficient ways. And with each of these ideas, dozens more frameworks are born.
... and these are just the ones with cool logos.
... in fairness, we were going to just list all the ones w/out cool logos, but this is only a 25' screen.
Let's focus on methods, not web frameworks
How fast is fast enough?
Premature optimization is the root of all evil. All optimizations should be based on an observed deficiency and should target a specific bottleneck in the application. Good logging and load testing are required to diagnose performance issues. It is much easier to verify that an optimization works if one problem is addressed at a time, in serial.
Oh. Okay, then.
* Note about "orders of magnitude" rule of thumb * Faster than 1000+ pages/s is out of scope for this talk, because it right at the outer limit of what very, very good single machine can do with an RDBMS. (That gets into db replicates/clusters, reads vs. writes, database partitioning, etc) * Maybe refer to brad@lj's talks for more on that (perl, but we'll let that slide)
Scaling isn't about languages, it's about:
We've tried to distill the process down to a couple of main ideas. Many of the optimization decisions we make are based on these two principles. For our demo we will employ a few open source tools. You might choose to use other tools, proprietary or open source.
Flickr Killr
Show the app in a web browser (CGI version). We put together a tiny web application that we will use to demonstrate and benchmark some of the methods we're discussing today. It's a database-backed, dynamically generated page that includes several static files. The application looks kind of like photo organizing software (hence the name). We've populated the database with over 100k users and over 1 million photos. We have created a benchmarker script that simulates a browser requesting this page. It makes requests to a list of URLs in rapid succession, and is multithreaded to simulate concurrency. The benchmarker then outputs some interesting stats, such as req/sec and average req time.
Flickr Killr 0.1
The CGI Strawman
Version 0.1 Architecture
The web server invokes the Python interpreter and runs the CGI script. The script prints its output to stdout, which is then written back to the client as an HTTP response. We realize that this is a somewhat outdated architecture, but PHP coders work with only marginally better systems on a daily basis. Do some benchmarking.
What's wrong with this picture?
import psycopg2
conn = psycopg2.connect(database='pycon',host='localhost')
There is some startup cost associated with the interpreter itself and loading the modules needed by the application. Things like database connections, parsed page templates, and other application-specific initialization can, and should, live longer than one request.
Let's remedy some of these issues
With the long-running Python process we can do all initialization up front when the server starts.
Other options
There are a few different multi-programming models to consider here. In general the threaded model can handle more concurrent connections than the multiprocess model, and async can handle more than threaded. However, concurrent connections is very often not the bottleneck. Other factors should influence this decision. The methods we're trying to hilight in this talk apply no matter which model you choose.
Flickr Killr 0.2
A slice of CherryPy
Version 0.2 Architecture
What's wrong with this picture?
cur.execute('select username from users where id = %s', (userid,))
This is already a huge improvement over the CGI version, but it's far from perfect. For one, we're running this on a dual core machine, but Python's GIL prevents us from using the second core. Another problem worth mentioning is that we are effectively fetching the session from the database for each request. Since the session is such an frequently-accessed resource, we prefer a more efficient session store. We're not going to demo this today, but Jamie has written a network-bound session daemon called Pear, which you can check out in the cheese shop.
Let's remedy some of these issues
We can run two (or more) instances of CherryPy to take advantage of multi-core and multi-cpu architectures. For your load balancer you almost certainly want to use the async MPM. It has to deal with a lot of concurrent connections, and it has to do so quickly. Shared Nothing means that each node is independent and doesn't keep any application state locally. In the context of a web application this is important because consecutive requests could be distributed to different servers. There is an added benefit of putting nginx out in front of CherryPy. Often the response needs to be slowly trickled back to a client on a low bandwidth connection. With this setup, CherryPy can send it's response to nginx very quickly across the local network (or on localhost) and nginx can keep tons of connections open to the slow clients, or even idle, between-request keepalive clients. The CherryPy threads are free to handle new requests.
Other options
The multiprocess MPM on the application server also takes advantage of multi-core and multi-CPU machines. This setup is not as memory efficient, however, and we'll need to get a reverse proxy involved anyway when we want to involve different physical machines.
Flickr Killr 0.3
Have your CherryPy and eat it too
Version 0.3 Architecture
What's wrong with this picture?
photos = db.execute('''
select p.id, p.filename, p.description
from users u join photos p on u.id = p.userid
where p.userid = %s''', (userid,)).fetchall()
OS disk cache will help this problem, but it is still inefficient to have a Python process reading and writing this data. Our total number of threads is finite, so we only want them working on generating dynamic content. This is another problem best solved by an asynchronous server.
Let's remedy some of these issues
For this demo we have written a CherryPy tool that inserts static content into memcached as it is requested. memcached keeps the files in memory and can handle tens of thousands of requests per second. We will instruct nginx to look for files there first, and fall back to CherryPy if the file is not found in the cache. The vast majority of requests for static content will never even be seen by CherryPy.
Other options
Another caching option is to put a web cache out in front of the site that automatically caches certain resources and intercepts requests as they come in. This method makes your application completely unaware that the cache even exists. If you choose to use lighttpd as your reverse proxy and load balancer, it might make sense to just let lighttpd serve static content from disk.
Flickr Killr 0.4
Cache money
Version 0.4 Architecture
Story about pollster.com running on a blade, database and all.
What's wrong with this picture?
In summary
Don't forget to index
Version 0.4 has provided a huge improvement over the CGI version, but we could quickly be looking at version 0.1 performance if we don't have our database properly indexed. In this case, the userid column of the photos table is the crucial column. Drop the index, run the benchmarking, recreate the index. Story about how a missing index in Gryphon caused our RDBMS to be CPU bound and we thought there was no way we could handle election load.
Automating deployment will certainly be worth your time
It gets really annoying after just 2 or 3 machines are involved. Jamie wrote a really slick system for deploying Python apps on many machines. It was a fairly big project, but it has paid huge dividends. Being able to manage deployment from one machine and quickly add/upgrade nodes is key.
Thanks for listening
All demo code and config files at http://www.polimetrix.com/pycon/demo.tar.gz
Question time
Jamie wrote a really slick system for deploying Python apps on many machines. It was a fairly big project, but it has paid huge dividends. Being able to manage deployment from one machine and quickly add/upgrade nodes is key.