Table of Contents

1 Billion Rows Challenge
#

Comparing Python and Rust for Processing Large Datasets
#

In this post, I will share my experience tackling the 1 Billion Rows Challenge. This consists in reading a dataset with 1 billion rows and 2 columns (city, temp) and computing the minimum, maximum, and mean temperaturefor each city. I approached this problem using both Python and Rust to compare their performance and capabilities. Below, I will walk you through the different strategies I employed for each language and the results of my benchmarks.

Note that all the code and benchmarks are available in the github repo:

RomainEconomics/1rbc

Rust

To generate the file used for the computation, go the 1BRC repo.

Benchmarks
#

As a teaser, here the results of the benchmarks:

Strategy	Language	Mean	StdDev	Median	Min	Max
v6_multi_threading_v1	Rust	4.71	0.17	4.76	4.46	4.88
baseline_ibis_duckdb	Python	11.25	0.52	11.34	10.71	11.82
baseline_polars	Python	20.18	12.76	11.81	11.63	40.37
v5_mmap	Rust	29.85	1.12	29.34	29.27	31.84
v5_pypy_mp_v2	Python	34.77	9.42	32.78	24.41	46.29
v4_parse_temp	Rust	39.34	0.13	39.28	39.24	39.56
v3_fast_float	Rust	41.97	0.07	41.94	41.92	42.08
v4_pypy_mp	Python	42.67	7.67	39.17	34.52	54.08
v2_faster_hash	Rust	44.32	0.15	44.28	44.19	44.5
v1_bytes	Rust	53.73	0.78	54.14	52.4	54.3
v0_buffer_reader	Rust	77.43	5.76	75.23	74.19	87.68
v3_pypy_temp_parsing	Python	96.24	0.55	96.32	95.52	96.81
v2_pypy_bytes	Python	124.39	0.78	124.15	123.7	125.73
v1_pypy_list	Python	181.59	0.37	181.74	181.09	182
v0_builtin_with_pypy	Python	184.76	1.34	184.33	182.98	186.19
baseline_pandas	Python	217.59	4.66	215.74	214.61	225.84
v0_builtin	Python	703.82	12.6	699.51	695.57	725.97

Python Strategies
#

Baseline in Pure Python
#

I started with a baseline implementation in pure Python. This approach was straightforward but not optimized for performance.

import sys

class TempStat:
    def __init__(self, temp: float):
        self.min_val = temp
        self.max_val = temp
        self.sum_val = temp
        self.count = 1

    def mean(self) -> float:
        return self.sum_val / self.count

    def update(self, temp: float):
        self.min_val = min(self.min_val, temp)
        self.max_val = max(self.max_val, temp)
        self.sum_val += temp
        self.count += 1


if __name__ == "__main__":
    file_path = sys.argv[1]

    cities: dict[str, TempStat] = {}

    with open(file_path, "r") as file:
        for idx, line in enumerate(file):
            city, temp = line.split(";")
            if city in cities:
                cities[city].update(float(temp))
            else:
                cities[city] = TempStat(float(temp))

This code runs, but is very slow (more than 700 seconds).

Several problems can be identified:

It uses the python interpreter, which is not the most performant, as we will see.
We read each line as a string, instead of bytes, which is slower.
Each temperature is parsed to float
Only one thread, and one core is used

We will try to improve on these points in the following sections.

But first, even if the goal is to use plain Python and Rust, as possible, it is interesting to compare our results to some more popular, and more optimized libraries.

Using Common Libraries
#

Next, I leveraged some popular Python libraries known for their performance with large datasets:

Pandas: A powerful data manipulation library. The most used one in Python, but not the fastest.
Polars: A fast DataFrame library.
DuckDB: An in-process SQL OLAP database management system.

As an example, the code for DuckDB (using Ibis as a wrapper around it):

import sys
import ibis

if __name__ == "__main__":
    file_path = sys.argv[1]

    data = ibis.read_csv(file_path, names=["city", "temp"], sep=";")

    res = (
        data.group_by("city")
        .aggregate(
            min_val=data.temp.min(),
            max_val=data.temp.max(),
            mean_val=data.temp.mean().round(1),
        )
        .order_by("city")
    )

    df = res.to_pandas()

The advantage of using those libraries, is that it can be really easy to use, and they are already optimized for performance (the code runs in almost 10 sec on my machine). Except for Pandas, for which, if you try to read the file with the default parameters, you will likely run out of memory.

PyPy
#

I also tried running the code with PyPy, a just-in-time compiler for Python, to see if it could offer any performance improvements.

We went from 700 seconds to 184 seconds, which is a good improvement without changing any code. We will thus stick with that interpreter for the following optimizations.

Optimizations
#

Reading File Bytes: Instead of reading the file line by line, I read the entire file as bytes.
- 60 seconds speedup
Parsing Temperature as Integer: Parsing the temperature as an integer instead of a float to save processing time.
- Almost 30 seconds gained, from 124 to 96 seconds
Multiprocessing: Since Python’s Global Interpreter Lock (GIL) limits the effectiveness of multithreading, I used multiprocessing to parallelize the task.
- 62 seconds - from 96 to 34 seconds

Reading the file as bytes and parsing the temperature as an integer were the most significant optimizations before we were able to use multiprocessing.

However, to read a file and process it using multiple core, we need to first ensures each core sees different chunks.

For that, I defined this function:

def find_chunk_boundaries(filename: str, workers: int) -> list[tuple[int, int]]:
    file_size = os.path.getsize(filename)
    chunk_size = file_size // workers
    chunks = []

    def find_new_line(f: io.BufferedReader, start: int):
        f.seek(start)
        while True:
            chunk = f.read(2048)
            if b"\n" in chunk:
                return start + chunk.index(b"\n") + 1
            if len(chunk) < 2048:
                return f.tell()
            start += len(chunk)

    with open(filename, "rb") as f:
        start = 0
        for _ in range(workers):
            end = find_new_line(f, start + chunk_size)
            chunks.append((start, end))
            start = end
    return chunks

This function will return the boundaries of the chunks that each worker will process. Moreover, we need to ensure that the end of a chunk fits exactly the end of a line, to avoid splitting a line between two workers.

Using all the cores on my machine (20) allowed to divide by almost 3 the time needed to process the file.

Coming from 700 seconds, we are now at 34 seconds, which is a good improvement.

But we’re still far from the performance of Polars or DuckDB.

Rust Strategies
#

Baseline in Pure Rust
#

I started with a baseline implementation in pure Rust using a buffered reader.

Optimizations
#

Using Bytes: Reading the file as bytes for faster processing.
Faster Hash Map: Using a more efficient hash map for storing city temperatures.
Faster Float Parsing: Optimizing the parsing of temperature values.
Parsing Temperature as Integer: Similar to the Python approach, parsing the temperature as an integer.
Memory-Mapped Files (mmap): Using memory-mapped files for faster file I/O.
Multithreading: Utilizing Rust’s powerful multithreading capabilities with Arc (Atomic Reference Counting) for shared state.

Results
#

Strategy	Language	Mean	StdDev	Median	Min	Max
v6_multi_threading_v1	Rust	4.71	0.17	4.76	4.46	4.88
baseline_ibis_duckdb	Python	11.25	0.52	11.34	10.71	11.82
baseline_polars	Python	20.18	12.76	11.81	11.63	40.37
v5_mmap	Rust	29.85	1.12	29.34	29.27	31.84
v5_pypy_mp_v2	Python	34.77	9.42	32.78	24.41	46.29
v4_parse_temp	Rust	39.34	0.13	39.28	39.24	39.56
v3_fast_float	Rust	41.97	0.07	41.94	41.92	42.08
v4_pypy_mp	Python	42.67	7.67	39.17	34.52	54.08
v2_faster_hash	Rust	44.32	0.15	44.28	44.19	44.5
v1_bytes	Rust	53.73	0.78	54.14	52.4	54.3
v0_buffer_reader	Rust	77.43	5.76	75.23	74.19	87.68
v3_pypy_temp_parsing	Python	96.24	0.55	96.32	95.52	96.81
v2_pypy_bytes	Python	124.39	0.78	124.15	123.7	125.73
v1_pypy_list	Python	181.59	0.37	181.74	181.09	182
v0_builtin_with_pypy	Python	184.76	1.34	184.33	182.98	186.19
baseline_pandas	Python	217.59	4.66	215.74	214.61	225.84
v0_builtin	Python	703.82	12.6	699.51	695.57	725.97

Analysis
#

Rust
#

Rust consistently outperformed Python in all benchmarks for single threaded code. The fastest Rust implementation, which used multithreading, completed the task in just 4.71 seconds on average.

Python
#

Python, while slower than Rust, offers a rich ecosystem of libraries that can call C, C++, or even Rust code, making it easier to write and maintain. The fastest Python implementation using DuckDB completed the task in 11.25 seconds on average. PyPy provided some performance improvements, but it was still not as fast as Rust.

Conclusion
#

This exercise was a valuable learning experience. It highlighted the performance differences between Python and Rust and provided insights into various optimization techniques. While Rust is clearly faster, Python’s ease of use and extensive library support make it a strong contender for many applications.

Notably, I learned a lot about using multithreading in Rust (without using Rayon, a common library used for multithreading), memory-mapped files, and other optimization strategies. If performance is critical and you are comfortable with Rust, this seems like a good choice. However, for ease of development and leveraging existing libraries, Python remains a powerful tool.

1 Billion Rows Challenge #

Comparing Python and Rust for Processing Large Datasets #

Benchmarks #

Python Strategies #

Baseline in Pure Python #

Using Common Libraries #

PyPy #

Optimizations #

Rust Strategies #

Baseline in Pure Rust #

Optimizations #

Results #

Analysis #

Rust #

Python #

Conclusion #

1 Billion Rows Challenge
#

Comparing Python and Rust for Processing Large Datasets
#

Benchmarks
#

Python Strategies
#

Baseline in Pure Python
#

Using Common Libraries
#

PyPy
#

Optimizations
#

Rust Strategies
#

Baseline in Pure Rust
#

Optimizations
#

Results
#

Analysis
#

Rust
#

Python
#

Conclusion
#