An Interview with Tom Caswell, Matplotlib Lead Developer
Given the recent surge in popularity of open source data science projects like pandas, NumPy, & Matplotlib, it’s probably a surprise to no one that the increased level of interest is generating some user complaints about documentation.
To help shed some light on what’s at stake, we thought we’d talk to someone who knows a lot about the subject: Thomas Caswell, Lead Developer of Matplotlib.
Matplotlib, a flexible and customizable tool for producing static and interactive data visualizations, has been around since 2001 and is a foundational project in the scientific Python stack. Matplotlib became a NumFOCUS sponsored project in 2015.
Tom has been working on Matplotlib for the past five years and got his start answering questions about the project on Stack Overflow. Answering questions became submitting bug reports, which became writing patches, which became maintaining the project… and now he’s the Lead Developer! (Fun fact: Tom’s advancement through the open source community follows exactly the path described by Brett Cannon, a core Python maintainer.)
NumFOCUS Communications Director, Gina Helfrich, sat down with Tom recently to discuss the challenges of managing documentation on a project as massive and as fundamental as Matplotlib.
“If it’s not documented, it doesn’t exist.”
GH: Thanks so much for taking the time to talk with us about Matplotlib and open source documentation, Tom! To contextualize our conversation a bit, can you speak a little to your impression of the recent back-and-forth with Wes McKinney about pandas and user complaints about the documentation?
TC: I only kind of saw the edges of it, but I see both sides.
On one hand, I think something Mike Pope said was, “if it’s not documented, it doesn’t exist.” If you are writing open source tools, part of that is documenting them, and documenting them clearly in a way that users can discover and actually use, short of going to the source [code]. It’s not good enough to just dump code on the internet—you have to do the whole thing.
On the other hand, if you’re not paying [for the software], you don’t get to make demands. The attitude I think Wes was reacting to, which you see a lot, is: “You built this tool that is useful to me, therefore I expect enterprise-grade paid support because it’s obviously critical to what I’m doing.”
But I think the part Eric O Lebigot was responding to is the first part. Part of building a tool is the documentation, not just the code. But Wes is responding to the entitlement, the expectation of free work. So I see both sides.
GH: Looking at Matplotlib specifically, which is facing many of the same issues as pandas, I know you have some big challenges with your documentation. I get the impression that there’s this notion out there from new users that getting started with Matplotlib is super frustrating and the docs don’t really help. Can you tell me about the history there and how the project came to have this problem?
TC: So, Matplotlib is a humongous library. I’ve been working on it for 5 years, and probably once a month or every other month, there’s a bug report where my first reaction is, “Wait… we do what?”
And a lot of it is under-documented. The library survived at least 2 generations of partial conversion to standardized docstring formats. As I understand it (because I wasn’t around at the time), we were one of the first projects outside of core Python to adopt Sphinx to build our docs, possibly a little too early. We have a lot of weird customizations since Sphinx didn’t do them yet [at the time]. Other people have built better versions of those since then, but because Matplotlib is so huge, migrating them is hard.
I think if you build the .pdf version of our docs, it’s like 3,000 pages—and I would say the library has maybe half the documentation it really needs.
We are woefully under-documented, in the sense that not everything has good docs. On the other hand, we are over-documented in that what we have is not well organized, and there’s not a clear entry-point, and if you want to find out how to do something, even I have a hard time finding where something is documented. And if I [the Lead Developer] have issues finding it, there’s just no prayer of new users finding it. So in that sense, we are both drastically under-documented and drastically over-documented.
GH: Given that Matplotlib is over 15 years old, do you have a sense of who has been doing the writing of the documentation? How does your documentation actually get developed?
TC: Historically, very much like the code, the documentation has been organically developed. We’ve had a lot of investment in examples and docstrings, and a few things labeled as tutorials that teach you one specific skill. For example, we’ve got some prose on the “rough theory of colormaps” and how to make a colormap.
A lot of Matplotlib’s documentation is examples, and the examples overlap. Over the past few years, when I see interesting examples go by on the mailing list or on Stack Overflow, I’ll say, “Can you put this example in the docs?” and it will go someplace in the examples. So, I guess I’ve been actively contributing to the problem that there’s too much stuff to wade through.
Some of it is, people will do a 6-hour tutorial and then some of those examples will end up in the docs. And then someone else will do a 6-hour tutorial (you can’t cover the whole library in 6 hours) and the basics are probably similar, but they may format it differently.
GH: Wow, that sounds pretty challenging to inherit and try to maintain. What kinds of improvements have you been working on for the documentation?
TC: There’s been an effort over the past couple years to move to numpydoc format, away from the home-grown scheme we had previously. Also, Nelle Varoquaux recently did a tremendous amount of work and led the effort to move from how we were doing examples to using Sphinx Gallery, which makes it much easier to put good prose into examples. This has been picked up by Chris Holdgraf recently, as well. It will go live on our main docs with Matplotlib 2.1, which will be a huge improvement for users. Nelle also organized a distributed Docathon.
We’ve been trying to get better about new features, so that when there’s a new feature, you must add an example to the docs for that feature, which helps make things discoverable. We’ve been trying to get better about making sure docstrings exist, are accurate, and document all the parameters.
GH: If you could wave a magic wand and have the Matplotlib docs that you want, what would it look like?
TC: Well, as I mentioned, the docs grew organically, and that means we have no consistent voice across them. It also means there’s no single point of truth for various things. When you write an example, how far back down the basics do you go? So it’s not clear what you need to know before you can understand the example. Either you explain just enough, all the way back (so we’ve got a random assortment of the basics smeared everywhere), or you just have examples that, unless you’re already a heavy user, just make no sense.
So, to answer the question, having someone who can actually write and has empathy for users, to go through and basically write a 200-page “Intro to Matplotlib” book, and have that be the main entry to the docs. That’s my current vision of what I want.
GH: If you were introducing a new user to Matplotlib today, what would you have her read? Where would you point her in the docs?
TC: Well, there isn’t a good, clear, “You’ve been told you need to use Matplotlib. Go spend an afternoon and read this.” I’m not sure where I’d point people to for that right now. Nicolas Rougier has written some very good stuff, some of which has migrated into the docs.
There’s a lot out there, but it’s not collated centrally or linked from our docs as “START HERE.” I should also add that I might not have the best view of this anymore because I haven’t actively gone looking for it, so maybe I just never found it because I don’t need it. I don’t know that it exists. [This actually came up recently on the mailing list.]
The place we do point people to is, go look at the gallery and click on the thumbnail that looks closest to what you want to do.
Ben Root has presented an Anatomy of Matplotlib tutorial at SciPy several times. There’s a number of Matplotlib books that exist. It’s mixed whether the authors have been contributors [to the project]. Ben Root recently wrote one about interactive figures. I’ve been approached and have turned it down a couple times, just because I don’t have time to write a book. So my thought for getting a technical writer was to get a technical writer to write the book, and instead of publishing it as a book, put it in the online docs.
GH: Is there anyone in the Matplotlib contributor community who “specializes” in the documentation part of things, or takes a lot of ownership around documentation?
Nelle was doing this for Matplotlib for a bit, but has stepped back. Chris Holdgraf is taking the lead on some doc-related things now. Nicholas Rougier has written a number of extremely good tutorials outside of the project documentation.
I mean, no one uses just Matplotlib. You don’t use us but not use SciPy, NumPy, or pandas. You have to be using something else to do the actual work that now you need to visualize. There are many “clean” introductions to Matplotlib in other places. For example both Jake VanderPlas’s new analysis book and Katy Huff and Anthony Scopatz’s book have introductions to Matplotlib that cover it to the degree they felt was needed for their purposes.
GH: I’d love to hear your thoughts on the role of Stack Overflow in all this.
TC: That actually is how I got into the project. My Stack Overflow number is large, and it’s almost all Matplotlib questions. And how I got started is I answered questions, and then, a lot of questions on Stack Overflow are, “Please read the docs for me.” Which, fine. But actually a great way to learn the library is to answer questions on Stack Overflow, because people who have problems that you don’t personally have will ask, “How do I do this?” and now you have to go figure out how to do it. It’s kind of fun.
But sometimes people will ask questions and they’ve actually found a bug. And in determining that they’ve actually found a bug, I start trying to figure out how to fix the bugs. So I started some reports; some, “Here’s a pull request to fix the bug I found.” And then when I started putting in a lot of PR’s, they were like, “You need to start reviewing them now,” so they gave me commit rights and made me review things. And then they put me in charge. [laughter]
I do like Stack Overflow. I think to a large extent, what it’s replaced is the mailing list. If I were going to have any criticism of Stack Overflow, I think it’s convincing people who are answering questions to upstream more of the things.
There are some things on Stack Overflow which are very good examples. Like, here’s a complex thing: you have to touch these 7 different functions, each of which are relatively fully documented, but you have to put them together in just the right way. Some of those answers should probably go in the gallery with annotations from us about how it works. Basically if you go through Joe Kington’s top 50 answers, they should probably all just go in the docs.
Other ones, the question is being asked because the docstring is just not clear. And those, if you could convince people who are answering those questions to use it as basically a survey of, “where is our documentation not clear?” and instead of just answering it there [on Stack Overflow], to move that back [to the docs].
GH: What’s it like managing PRs for documentation as opposed to patches and bugfixes?
We’ve tried to streamline how we do the docs PRs. On the other hand, writing docs PRs is the most painful thing ever in open source, because you get copyediting via pull request. You get picky, proofreading copyediting via GitHub comments. Like, “there’s a missing comma” or “two spaces!” And again, I keep using myself as a weird outlier benchmark, but I get disheartened when I write docs pull requests and then I get 50 comments of picky little things.
What I’ve started trying to push as the threshold on docs is, “Did it make it worse?” If it didn’t make it worse, merge it. Frequently, it takes more time to leave a GitHub comment than to fix it.
“If you can use Matplotlib, you are qualified to contribute to it. ”