10 tips for speeding up Python programs

There are many ways to boost Python application performance. Here are 10 hard-core coding tips for faster Python.

10 hard-core coding tips for faster Python
Gerd Altmann / Linda Perez Johannessen

By and large, people use Python because it’s convenient and programmer-friendly, not because it’s fast. The plethora of third-party libraries and the breadth of industry support for Python compensate heavily for its not having the raw performance of Java or C. Speed of development takes precedence over speed of execution.

But in many cases, it doesn’t have to be an either/or proposition. Properly optimized, Python applications can run with surprising speed—perhaps not as fast as Java or C, but fast enough for web applications, data analysis, management and automation tools, and most other purposes. With the right optimizations, you might not even notice the tradeoff between application performance and developer productivity.

Optimizing Python performance doesn’t come down to any one factor. Rather, it’s about applying all the available best practices and choosing the ones that best fit the scenario at hand. (The folks at Dropbox have one of the most eye-popping examples of the power of Python optimizations.)

In this article, I’ll discuss 10 common Python optimizations. Some are drop-in measures that require little more than switching one item for another (such as changing the Python interpreter); others deliver bigger payoffs but also require more detailed work.

10 ways to make Python programs run faster

  • Measure, measure, measure
  • Memoize (cache) repeatedly used data
  • Move math to NumPy
  • Move math to Numba
  • Use a C library
  • Convert to Cython
  • Go parallel with multiprocessing
  • Know what your libraries are doing
  • Know what your platform is doing
  • Run with PyPy

Measure, measure, measure

You can’t miss what you don’t measure, as the old adage goes. Likewise, you can’t find out why any given Python application runs suboptimally without finding out where the slowness resides.

Start with simple profiling by way of Python’s built-in cProfile module, and move to a more powerful profiler if you need greater precision or greater depth of insight. Often, the insights gleaned by basic function-level inspection of an application provide more than enough perspective. (You can pull profile data for a single function via the profilehooks module.)

Why a particular part of the application is so slow, and how to fix it, may take more digging. The point is to narrow the focus, establish a baseline with hard numbers, and test across a variety of usage and deployment scenarios whenever possible. Don’t optimize prematurely. Guessing gets you nowhere.

The example from Dropbox (linked above) shows how useful profiling is. “It was measurement that told us that HTML escaping was slow to begin with,” the developers wrote, “and without measuring performance, we would never have guessed that string interpolation was so slow.”

Memoize (cache) repeatedly used data

Never do work a thousand times when you can do it once and save the results. If you have a frequently called function that returns predictable results, Python provides you with options to cache the results into memory. Subsequent calls that return the same result will return almost immediately.

Various examples show how to do this; my favorite memoization is nearly as minimal as it gets. But Python has this functionality built in. One of Python’s native libraries, functools, has the @functools.lru_cache decorator, which caches the n most recent calls to a function. This is handy when the value you’re caching changes but is relatively static within a particular window of time. A list of most recently used items over the course of a day would be a good example.

Note that if you’re certain the variety of calls to the function will remain within a reasonable bound (e.g., 100 different cached results), you could use @functools.cache, which is more performant.

Move math to NumPy

If you are doing matrix-based or array-based math and you don’t want the Python interpreter getting in the way, use NumPy. By drawing on C libraries for the heavy lifting, NumPy offers faster array processing than native Python. It also stores numerical data more efficiently than Python’s built-in data structures.

Another boon with NumPy is more efficient use of memory for large objects, such as lists with millions of items. On average, large objects like that in NumPy take up around one-fourth of the memory required if they were expressed in conventional Python. Note that it helps to begin with the right data structure for a job—which is an optimization in itself.

Rewriting Python algorithms to use NumPy takes some work since array objects need to be declared using NumPy’s syntax. Plus, the biggest speedups come by way of using NumPy-specific "broadcasting" techniques, where a function or behavior is applied across an array. Take the time to delve into NumPy's documentation to find out what functions are available and how to use them well.

Also, while NumPy is suited to accelerating matrix- or array-based math, it doesn't provide a useful speedup for math performed outside of NumPy arrays or matrices. Math that involves conventional Python objects won't see a speedup.

Move math to Numba

Another powerful library for speeding up math operations is Numba. Write some Python code for numerical manipulation and wrap it with Numba’s JIT (just-in-time) compiler, and the resulting code will run at machine-native speed. Numba not only provides GPU-powered accelerations (both CUDA and ROC), but also has a special “nopython” mode that attempts to maximize performance by not relying on the Python interpreter wherever possible.

Numba also works hand-in-hand with NumPy, so you can get the best of both worlds—NumPy for all the operations it can solve, and Numba for all the rest.

Use a C library

NumPy’s use of libraries written in C is a good strategy to emulate. If there’s an existing C library that does what you need, Python and its ecosystem provide several options to connect to the library and leverage its speed.

The most common way to do this is Python’s ctypes library. Because ctypes is broadly compatible with other Python applications (and runtimes), it’s the best place to start, but it’s far from the only game in town. The CFFI project provides a more elegant interface to C. Cython (see below) also can be used to write your own C libraries or wrap external, existing libraries, although at the cost of having to learn Cython’s markup.

One caveat here: You’ll get the best results by minimizing the number of round trips you make across the border between C and Python. Each time you pass data between them, that’s a performance hit. If you have a choice between calling a C library in a tight loop versus passing an entire data structure to the C library and performing the in-loop processing there, choose the second option. You’ll be making fewer round trips between domains.

Convert to Cython

If you want speed, use C, not Python. But for Pythonistas, writing C code brings a host of distractions—learning C’s syntax, wrangling the C toolchain (what’s wrong with my header files now?), and so on.

Cython allows Python users to conveniently access C’s speed. Existing Python code can be converted to C incrementally—first by compiling said code to C with Cython, then by adding type annotations for more speed.

Cython isn’t a magic wand. Code converted as-is to Cython, without type annotatons, doesn’t generally run more than 15 to 50 percent faster. That's because most of the optimizations at that level focus on reducing the overhead of the Python interpreter. The biggest gains come when your variables can be annotated as C types—for instance, a machine-level 64-bit integer instead of Python's int type. The resulting speedups can be orders-of-magnitude faster.

CPU-bound code benefits the most from Cython. If you’ve profiled (you have profiled, haven’t you?) and found that certain parts of your code use the vast majority of the CPU time, those are excellent candidates for Cython conversion. Code that is I/O bound, like long-running network operations, will see little or no benefit from Cython.

As with using C libraries, another important performance-enhancing tip is to keep the number of round trips to Cython to a minimum. Don’t write a loop that calls a “Cythonized” function repeatedly; implement the loop in Cython and pass the data all at once.

Go parallel with multiprocessing

Traditional Python apps—those implemented in CPython—execute only a single thread at a time, in order to avoid the problems of state that arise when using multiple threads. This is the infamous Global Interpreter Lock (GIL). There are good reasons for its existence, but that doesn’t make it any less ornery.

A CPython app can be multithreaded, but because of the GIL, CPython doesn’t really allow those threads to run in parallel on multiple cores. The GIL has grown dramatically more efficient over time, and there's work underway to remove it entirely, but for now the core issue remains.

A common workaround is the multiprocessing module, which runs multiple instances of the Python interpreter on separate cores. State can be shared by way of shared memory or server processes, and data can be passed between process instances via queues or pipes.

You still have to manage state manually between the processes. Plus, there’s no small amount of overhead involved in starting multiple instances of Python and passing objects among them. But for long-running processes that benefit from parallelism across cores, the multiprocessing library is useful.

As an aside, Python modules and packages that use C libraries (such as NumPy or Cython) are able to avoid the GIL entirely. That’s another reason they’re recommended for a speed boost.

Know what your libraries are doing

How convenient it is to simply type include foobar and tap into the work of countless other programmers! But you need to be aware that third-party libraries can change the performance of your application, not always for the better.

Sometimes this manifests in obvious ways, as when a module from a particular library constitutes a bottleneck. (Again, profiling will help.) Sometimes it’s less obvious. For example, consider Pyglet, a handy library for creating windowed graphical applications. Pyglet automatically enables a debug mode, which dramatically impacts performance until it’s explicitly disabled. You might never realize this unless you read the library's documentation, so when you start work with a new library, read up and be informed.

Know what your platform is doing

Python runs cross-platform, but that doesn’t mean the peculiarities of each operating system—Windows, Linux, macOS—are entirely abstracted away under Python. Most of the time, it pays to be aware of platform specifics like path naming conventions, for which there are helper functions. The pathlib module, for instance, abstracts away platform-specific path conventions. Console handling also varies a great deal between Windows and other operating systems; hence the popularity of abstracting libraries like rich.

On some platforms, certain features aren't supported at all, and that can impact how you write Python. Windows, for instance, doesn't have the concept of process forking, so some multiprocessing functionality works differently there.

Finally, the way Python itself is installed and run on the platform also matters. On Linux, for instance, pip is typically installed separately from Python itself; on Windows, it's installed automatically with Python.

Run with PyPy

CPython, the most commonly used implementation of Python, prioritizes compatibility over raw speed. For programmers who want to put speed first, there’s PyPy, a Python implementation outfitted with a JIT compiler to accelerate code execution.

Because PyPy was designed as a drop-in replacement for CPython, it’s one of the simplest ways to get a quick performance boost. Many common Python applications will run on PyPy exactly as they are. Generally, the more the application relies on “vanilla” Python, the more likely it will run on PyPy without modification.

However, taking the best advantage of PyPy may require testing and study. You’ll find that long-running apps derive the biggest performance gains from PyPy, because the compiler analyzes the execution over time to determine how to speed things up. For short scripts that merely run and exit, you’re probably better off using CPython, since the performance gains won’t be sufficient to overcome the overhead of the JIT.

Note that PyPy’s support for Python tends to lag the most current versions of the language. When Python 3.12 was current, PyPy only supported up to version 3.10. Also, Python apps that use ctypes may not always behave as expected. If you’re writing something that might run on both PyPy and CPython, it might make sense to handle use cases separately for each interpreter.

Copyright © 2024 IDG Communications, Inc.