In this insightful interview, Ralf Gommers, co-director of Quansight Labs and a key contributor to NumPy, discusses the importance of sustainability in open source projects. He highlights the balance between community-driven efforts and corporate support while addressing the environmental impact of software. By encouraging better practices, he sees an opportunity to make software more sustainable. Ralf also highlights the need for universities to teach essential software skills and foster collaboration within open source communities, empowering the next generation of developers.
In a recent conversation with Gareth Thomas on The Inspiring Computing podcast, Ralf reflects on his journey with open source, particularly his work with NumPy. He shares insights on the project’s governance model, the value of community funding, and strategies for sustaining long-term projects. This discussion explores how collaboration and informed practices can enhance the impact of open source, benefiting developers and the broader community alike. Listen to their full conversation HERE (or on your preferred podcast platform), or read Ralf responses below.
Q: Can you share a bit about your journey to becoming a key influencer in scientific computing?
Ralf: I’m from the Netherlands, where I grew up and studied Applied Physics. I have no formal training in software engineering—I think I took one course in Pascal, which was incredibly boring, so I didn’t touch programming for a few years after. Then I went into research. I initially intended to get a job, but in the last year of my degree, I discovered that my research was actually very interesting. That’s when I started learning programming because I had to generate and analyze data.
At that time, there was no funding available in the Netherlands for new academic work, so I went to the UK for my PhD. I thought I would then get a job, but my professor encouraged me to apply for a postdoc position at MIT. It was the only one I applied to, and I didn’t expect to get it, but I did, and that took me to the US. Over time, I moved around and worked in a few other countries, gradually getting more involved in software engineering through my research. For example, I worked a lot with lasers, optics, and atomic physics, where precise control of experimental settings was required. This is how I basically taught myself programming.
Initially, I was using MATLAB, but when I moved and didn’t have a license, I quickly realized the drawbacks of relying on proprietary software. So, I decided to switch from Windows to Linux, from MATLAB to Python, and started using Vim as my editor. There were some unproductive and frustrating weeks, but I eventually got the hang of it.
Ralf: I joined Quansight in 2019; it was about a year old then, and since then, we’ve grown to about 65 people. We’re a small consulting company, so we make our money by consulting, mostly around the PyData stack. In addition to that, we have an open source lab called Quansight Labs, which I started leading when I joined Quansight, and I’m now co-leading with Tania Allard. I focus more on numerical work, while she focuses on Jupyter, accessibility, and DevOps. We both do a bit of packaging because everyone needs packaging.
Q: Given your long history with programming, what led you to transition from being a volunteer contributor to making open source your career, particularly with NumPy?
Ralf: It was about solving a problem I was facing. I worked on NumPy for 10 years as a volunteer, and as the user base grew, the demands became more intense. My professional career also became more demanding, and it reached a point where it just wasn’t sustainable anymore. Many long-time open source maintainers face this issue. I had to decide whether to stop contributing or try to make it my job. Around that time, I met Travis Oliphant, the CEO of Quansight, at a conference. We’d known each other digitally for a long time but hadn’t met in person until then. He had just started Quansight with the goal of supporting open source projects, even though he no longer actively contributes himself. He suggested that I lead Quansight Labs to shape it and figure out how to make community-driven open source more sustainable. My job is to make open source more sustainable and provide a home for maintainers to do impactful work for the community.
“My job is to make open source more sustainable and provide a home for maintainers to do impactful work for the community.”
Ralf Gommers
Q: What lessons have you learned from NumPy’s journey, and would you advise others to follow a similar path or take different approaches?
Ralf: I think it’s actually very rare for someone to start a project intending to change the world or reach millions of users. Most success stories are accidental. It usually begins with a personal interest or a problem you want to solve because you think you can improve on existing solutions. When it starts working and gaining traction, the most important thing you can do is reflect and develop conscious strategies around it. One part of that is deciding what to work on and being careful not to expand the project’s scope too much, which is something we still struggle with in both NumPy and SciPy.
Q: What challenges did NumPy face regarding decision-making and governance, and how did formalizing its governance model help improve collaboration among maintainers?
Q: When was the NumPy governance model formalized?
Q: This standardization project has spilled over into other programs. How is NumPy leading standardization efforts to guide other projects and prevent reinventing the wheel?
Ralf: That’s effectively what has been happening since around 2015. The limitations of NumPy started becoming clear when the volume of Python users went up, and they had real needs for working with very large datasets and GPUs, which were becoming important for deep learning. NumPy doesn’t handle parallelism, distributed computing, or GPUs. So, people liked the NumPy API and its fundamental concepts, then they began copying it, making slight tweaks where needed, and creating versions that worked on GPUs, for example. Now, there are frameworks like TensorFlow, PyTorch, JAX, CuPy, and many others that all share similarities with NumPy but address its limitations.
One of the first projects I worked on when I joined Quansight Labs was trying to create a standardized subset of NumPy, which I called R-NumPy. I wrote a detailed email about the rationale behind it, hoping it could help standardize some of these efforts. However, it immediately got filibustered because one person didn’t like the word “standard,” arguing that it would harm NumPy’s development. So, I decided not to pursue it directly within NumPy but to take a step back. In mid-2020, I reached out to developers of all these other libraries and put together a consortium to see if we could standardize a common set of features and identify where different libraries diverged.
One of the big issues was that certain design decisions in NumPy, like how indexing an array works or type promotion rules, were hard to change without breaking compatibility for existing users. After a year and a half of work, we developed the first Python Array API standard. This standardization process was much harder and more work than I expected. But now, we have three versions of the standard, and it’s implemented by NumPy 2.0. We’ve bridged some of the gaps that didn’t work in NumPy but were needed by other libraries. It’s starting to gain traction, with libraries like SciPy, scikit-learn, and Xarray implementing it. It addresses the fragmentation problem, where each library was doing slightly different things.
“After a year and a half of work, we developed the first Python Array API standard. This standardization process was much harder and more work than I expected. But now, we have three versions of the standard, and it's implemented by NumPy 2.0”
Ralf Gommers
Q: How did you build a consortium in the open source community to bring together stakeholders who often work in silos?
Ralf: It was a combination of things. There were a few developers from other libraries that I knew personally. We have a large team working on PyTorch at Quansight. Then, there were other libraries where I didn’t know anyone from, like TensorFlow or JAX. TensorFlow, for instance, is a large team at Google, but they tend to be a bit more insular, so that required active outreach to find someone and connect to the right person on that team.
For that, we needed some concerted effort, so we went through our corporate connections and asked for support. We reached out to companies that could benefit from this, like large finance companies or company-backed projects. We ended up with six funders, with Intel being the first to come on board. The TensorFlow team at Google also pitched in, along with a few others. Their support helped us manage the organizational efforts and the detailed work of writing the standard itself.
Ralf: To me, building and packaging are like two sides of the same coin. You first have to build something into a binary, and then you have to put it somewhere so that everyone can use it. There’s a lot of complexity in the tools and standards around installs and environments to make Python packaging work smoothly and Python packaging is very challenging. It’s gotten a lot better over the years, but there are still plenty of challenges.
For a very long time, pretty much everyone was building their projects with Distutils, which is part of the Python standard library. When that didn’t really work, Setuptools became the go-to extension of Distutils. But all of that was essentially monkey-patching on top of a flawed system or code. Meanwhile, NumPy had its own version of Distutils, which added better support for C++, Fortran, and things needed for numerical computing.
Then, with Python 3.10, the CPython team decided they didn’t want to maintain Distutils anymore. They deprecated it in 3.10 and removed it in 3.12. At that point, my initial reaction was to wait it out and see if Setuptools could absorb it. But after six months, it became clear that wasn’t happening.
So, I started experimenting with other approaches, and quickly narrowed it down to Meson and CMake. I knew that NumPy’s existing setup for numerical computing was better than what either of them had, so I’d have to contribute to those packages as well. Meson had better documentation, was written in Python, and felt nicer to work with compared to CMake’s C++ and its complex domain-specific language (DSL).
I decided to go with Meson and helped develop a package called Meson-Python to create Python packages because Meson, while great for C++ and Fortran, didn’t have much built-in support for Python packaging. It took me about six months to get the first version up and running, starting with SciPy, since it was the hardest challenge. If it worked there, it would work everywhere else. The lockdown in the Netherlands during the pandemic gave me the time to focus on it, and I think that’s what helped me get it done in a reasonable timeframe.
Q: Would you recommend the Meson toolchain over CMake for large packages, or is it mainly suited to scientific libraries like NumPy and SciPy?
Ralf: I think people have their personal preferences, and I don’t think everything will end up using the same toolchain, which is a good thing. In the past, we had everything relying on Distutils, and it wasn’t great. Now, Python packaging has evolved to a place where if you need something better, you can build it, and there are standard hooks and ways to plug it in.
If you don’t have any compiled code, things are pretty straightforward—you’ve got at least five good choices, maybe more. If you do have compiled code or an older package, you might still be using the older toolchains, or you may have recently migrated, and then you’re set. If you’re starting a new project now, I’d say using Meson with Meson-Python or using CMake with scikit-build-core are both great options. You can’t really go wrong—they’re very similar, with different capabilities, but the choice mostly comes down to personal preference. Both are a huge improvement over what we had before.
I enjoy seeing new people use Meson-Python, and I try to help them quickly when they get stuck. My co-maintainer, Daniele, is the same, and the Meson team is quite responsive, so questions get answered quickly. But I also think it’s totally fine for people to use other tools like CMake, especially if that’s what they already know or if they have dependencies in C++.
“If you're starting a new project now, I'd say using Meson with Meson-Python or using CMake with scikit-build-core are both great options. You can't really go wrong—they're very similar, with different capabilities, but the choice mostly comes down to personal preference.”
Ralf Gommers
I always recommend that people start by solving a problem they genuinely care about or enjoy working on. It could be fixing a bug or adding a new feature that doesn’t exist in the language or package they’re using. As long as you’re mindful of the time you spend and it stays fun, it’s worth doing more and learning more
Q: Do you believe the rise of generative AI, like ChatGPT, is leading to less community engagement, or is that concern overblown?
Ralf: We’ve actually been discussing the impact of code generated by ChatGPT and whether it’s a concern if that code ends up in pull requests. My take is that it’s not a big issue today, though it could be in the future. This is similar to what we saw years ago when people would go to places like MATLAB Central, grab some code that wasn’t properly licensed, and just translate it. Honestly, I don’t see a decline in engagement because of AI. The questions ChatGPT can answer are usually the types of questions I don’t want to spend my time answering. I want to engage with more complex, interesting questions, and we’re still far from the point where AI can handle those in a meaningful way.
Q: What’s your favorite feature or achievement in the recently released NumPy 2.0 that you’re particularly proud of?
Ralf: We were almost too tired to celebrate, but it was a major achievement. Getting the whole community aligned was quite a journey. If I think about the new features, we already talked about the Array API standard, which is a highlight for me personally. We finally got NumPy to adopt the standard, which is a big deal. Another aspect I’d like to mention is the effort put into reducing the binary size. That’s not something most users think about because it doesn’t show up in the documentation, but it’s actually super important.
I remember being on holiday in Norway in winter, on a train ride, and I started thinking about the environmental impact of not just me working on NumPy but of the project as a whole. I realized that by shaving off even one megabyte from a NumPy binary, we could save the equivalent of 200 intercontinental flights per year in terms of data transfer. The scale of this impact is surprising when you think about it—NumPy has about 200 million downloads a month, and the package itself is around 15 megabytes. That adds up to about 36 petabytes of data per year, which is significant from both an environmental and sustainability standpoint.
“I realized that by shaving off even one megabyte from a NumPy binary, we could save the equivalent of 200 intercontinental flights per year in terms of data transfer.”
Ralf Gommers
Ralf: If you have a new project and don’t have specific requirements that tie you to C or C++, Rust is a great choice. It’s popular for good reasons, with excellent features and a developer experience I’d love to dive into if I had the time. However, for existing projects like NumPy or SciPy, transitioning to Rust is a massive undertaking. There’s too much legacy code, and some of Rust’s capabilities, like SIMD instructions, are not yet as mature. Plus, using foreign function interfaces in Rust can be tricky, especially if you’re dealing with old Fortran APIs, which might push you into unsafe territory.
Ralf: Both. I think you won’t be able to avoid learning C or C++ in the next decade. But you should also learn at least one modern language—Rust is a great choice. I’m also personally interested in Zig; it’s excellent for cross-compilation and can even help create standalone Rust binaries. So, both Rust and Zig have their advantages, and they’re worth exploring if you want to stay current in the field.
At Quansight Labs, we follow a similar model, balancing work we believe is essential for the community with grant-funded initiatives. We have grants from NASA, a German government agency, and non-traditional funders like the Chan Zuckerberg imaging institute. Additionally, companies, whether large or small, often need to engage with someone directly because they can’t always approach a community. Having a variety of funding sources and durations is crucial so that no one person’s job or a project’s health is at risk if that person becomes unavailable.
“Companies that employ maintainers could offer them additional work time. For example, they could allocate one day a week for maintainers to focus on specific open source projects they believe in.”
Ralf Gommers
Thank you, Ralf. These answers highlight the significant challenges and opportunities within the open source community, emphasizing the need for sustainable practices and collaborative engagement. Ralf underscores the importance of community-driven projects and the role of various stakeholders, including companies and universities, in fostering a healthy ecosystem. By investing in the maintenance and security of open source projects and encouraging a diverse range of contributions, we can ensure their longevity and impact.