Thanks to a successful fundraising year in 2017, NumFOCUS is able to provide funding to help the projects improve usability, grow their communities, and speed up the time to major releases. NumFOCUS intends to distribute $60,000 in small developments grants to our sponsored and affiliated projects in 2018.
This is the second round of grants made this year. You can read about the first round grant recipients in our prior blog post.
Paid Developer Time for Major Code Improvements
Statsmodels: Probability Plots and Generalized Additive Models — $3,000
(NumFOCUS Affiliated Project)
Statsmodels is the main Python package for general statistical analysis and econometrics. It covers large areas of statistics and econometrics including time series analysis, but is still missing several of the basic methods that are usually provided by general purpose statistical packages.
Statsmodels is gaining new model classes each year and current models are extended to cover new functionality. However, statsmodels also has a large number of pull requests that have stalled because of the lack of reviewers and contributors in that specific area.
The proposal is to finish and merge pull requests and address the corresponding open issues for two topics, probability plots and Generalized Additive Models (GAM).
The project proposes to improve the current code and API in probability plots and narrow the gap in missing features by adding generalized additive models and penalized splines.
Both parts of the projects are popular methods in data analysis and applied statistics and are expected to find widespread usage. Generalized Additive models and penalized splines provide new semiparametric methods for many use cases but will also provide the elements and test cases for extending penalized estimation to other models.
SciPy: Maturing a sparse array implementation for SciPy — $3,000
(NumFOCUS Affiliated Project)
Sparse arrays are nd-arrays containing primarily zeros. These kinds of arrays can be represented very efficiently in memory, and processing tasks that involve sparse arrays can be significantly faster than equivalent operations on dense arrays. Consider, for example, the adjacency matrix of a Markov chain. In practical systems, such an adjacency matrix is quite sparse.
The “sparse” project aims to implement a subset of NumPy functionality on sparse arrays. Currently, the project implements ufuncs and reductions, and other utility functions like concatenate and stack. It currently supports the COO and DOK formats. We also plan to work on the CSD format (a generalization of CSR/CSC).
Currently the “sparse” project has arrived in a place where it can be used for some use cases relevant to end users; it is mature enough to have gained a space under the PyData umbrella on GitHub. However, it is still not close to being feature-complete or to being ready for adoption by the SciPy library. While the project is starting to attract new contributors, progress currently depends largely on the time that can be spent on development and maintenance work by Hameer Abbasi. We propose a set of concrete milestones, so that with funding he is able to spend more time working on “sparse”.
In terms of features, we aim to implement the CSD sparse array format as the largest deliverable. Furthermore we aim for a couple of performance improvements and the implementation of smaller features that were requested by the community.
Finally, we will use the funded development time to complete a new release, in order to get the new features in the hands of end users quickly.
The SciPy community has desired to have a robust implementation of sparse arrays for a long time. Several attempts were started and then abandoned due to lack of time. The “sparse” project is the first implementation that promises to be full-featured and robust enough for the needs of the SciPy library, Scikit-learn, Dask, XArray and other consumers.
Julia: BlockBandedMatrices.jl: add support for general array backends (GPU) — $3,000
(NumFOCUS Sponsored Project)
The objective of the grant is to adapt BlockBandedMatrices.jl to support general backend storage types. This generalization makes it possible for the same code-base to work both on the CPU and the GPU, simply by specifying that the underlying storage should be a standard Julia matrix, and live on the CPU, or a GPUArray matrix, living on the GPU. Julia then automatically reroutes all operation to specialized implementations for the type. Indeed, we expect that it will also allow BlockBandedMatrices to be distributed across CPUs, by specifying a distributed storage type. Subsequently, the work undertaken in the grant allows for efficient numerical solution of partial differential equations on the GPU or on distributed arrays, using either finite differences or spectral method discretizations, while hiding from the user the details of the underlying matrix. With the advent of Julia 1.0 this summer, and the increased interest it will generate, we believe this proposal is timely: it will demonstrate how foundational packages can be expanded to push computations across different hardware while retaining the same legible code-base.
The funds will be used to fund Mayeul d’Avezac for a two week sprint. Mayeul is a Senior Research Software Engineer in the Research Computing Service at Imperial College London.
The benefit for the community will be an easy-to-use framework to represent discretizations of partial differential equations, that can live on either a single machine, multiple machines, or the GPU. Other packages like ApproxFun.jl and DifferentialEquations.jl will be able to leverage this technology so that users can achieve high performance while only stating the problem, with minimal details about the underlying discretization or how it is stored.
The package developer, Sheehan Olver, is the primary caregiver for his 18 month old daughter, which limits the hours that he can work on this project. The support of Mayeul on this project will help to mitigate the effect of his child care duties on this project.
Want to help us do more?
Make a donation to NumFOCUS today.
Improvements to Documentation
Pomegranate: Improving Documentation, Examples, and Tutorials — $3,000
(NumFOCUS Affiliated Project)
pomegranate is a package for flexible probabilistic modeling in Python. It is flexible because it allows for users to plug-and-play probabilistic components in a way that other packages don’t, such as dropping in an Exponential Distribution to a Mixture Model to now have an Exponential Mixture Model, or to drop a hidden Markov model into a Bayes classifier to now have a classifier over sequences.
However, while this functionality is available, the documentation on this and usage tutorials are sparse. This is primarily because developer time has been spent adding in useful features rather than on fully explaining the current usage. Unfortunately, this means that many of the great features that are currently available in pomegranate are underutilized simply because they’re unknown.
This grant will enable Jacob Schreiber, as the core developer of the project, to spend a significant portion of time extending the available documentation, writing stand-alone code examples that demonstrate current features, and revamping the tutorials folder to include all features. The following features will be focused on:
- Fast Bayesian Network structure learning using constraint graphs
- Learning or using estimators when data is missing
- Semi-supervised Learning using HMMs and naive Bayes classifiers
- Setting up distributed computation for any of the models
- Stacking models to produce more complicated ones, such as making a mixture of Bayesian networks or a Bayes classifier over HMMs
- Additional installation help and FAQs based on user reports from the past two years
By highlighting these less well known areas, we will be able to significantly improve the lives of new users that are looking for features as well as existing users who are unaware that extended functionality exists for the tool that they’re already using.
Bokeh: Bokeh Docs Modernization — $3,000
(NumFOCUS Sponsored Project)
With Bokeh 1.0 imminent, it is a good time to spend expanding documentation sections that need extra work or are missing and to polish some rough edges around the site and content. This proposal covers specific tasks that have recently been raised by users. It is split into a number of small/medium effort tasks that will all be completed, plus two larger more open ended tasks, either or both of which would be very useful to get meaningfully started.
- Update all Bokeh source files with common boilerplate and header format (started but not yet finished)
- Expand the User guide section for using bokeh with notebooks and jupyterlab
- Expand and reorganize the live plot gallery by adding:
- simple “reference examples”
- more sophisticated “use case” examples
Large effort / open ended
- Choose API documentation tool/standard for BokehJS and integrate into main docs
- Research implementing Bing for site search to replace discontinues GSS”
While the Bokeh documentation collection is fairly large, there are specific points that users have asked for more or better documentation. All of the points in this plan have been asked for recently by different users.
Community Education and Engagement
MDAnalysis: MDAnalysis tutorial and hackathon — $2,500
(NumFOCUS Affiliated Project)
This project aims to host a 2 day tutorial and hackathon for the MDAnalysis project aimed at introducing and instructing new users to the package. This tutorial will be free to attend with a travel grant available and hosted at Northwestern University, IL in the fall of 2018.
A large part of the MDAnalysis user base are academics and therefore tutorials such as these are important for growing our user base. Previous tutorials sessions such as the 2015 CECAM workshop have been successful in this aim. As most users of the package have a scientific background, instruction in software development practices is useful for creating and attracting the next generation of developers.
The morning of the first day will be a hands-on guided tutorial on the basic and intermediate use of MDAnalysis, followed by a more informal afternoon session where users will be helped in producing analysis code relevant to their own research or short tutorials on specific advanced topics. The second day will be focused on bridging the gap between users and developers. Attendees will be introduced to the basics of software development and how to make their first contribution to an open source software project. A hackathon guided by the present developers will follow, focused on fixing small issues blocking a 1.0 release.
Tutorial materials and recordings will be made available under a CC-BY license for any users who might not be able to attend the workshop. These contents will add to the existing materials already available to new and advanced users.
A tangential benefit of this project is to get some developer face to face time to plan for the upcoming 1.0 release. The small development grant we received from NumFOCUS last year to bring two developers together and add Python 3 support was hugely successful and we would like to fully take advantage of this occasion.
Shogun: Shogun website and UX redesign — $2,500
(NumFOCUS Sponsored Project)
The goal of this project to release a new website for Shogun. The current website design and user experience is confusing and fails at achieving the following:
- Targeting new users and enabling them to get quickly started.
- Having a unified experience across the full content, supporting easy navigation (for example, the most content heavy sections—like Examples or API docs—provide no breadcrumbs, and almost all the sections have different design, etc.)
- Specific sections lack basic functionality (like filtering down Examples), which is especially important for new users as well as experienced users who want to find an answer quickly.
- Some important information is unnecessarily hidden or not well communicated, like activity (current development, news) and point of contacts.
As our website is the first point of contact with our users we must do a better job welcoming new users and keeping our community members in the loop. In the last eight years we have changed our website three times: we went from using a complex CMS to an easy to maintain and update website architecture. During these changes we finally managed to come up with a small list of requirements for Shogun’s website:
- Easy to update: there’s absolutely no need for having a complex CMS for our needs. Namely, content based on text files, like Markdown, does a great job for us.
- Easy to navigate: users—both newcomers and advanced users—should find the content that they are looking for within two clicks. For example, how to install Shogun on their own system or how to start as a developer to extend Shogun.
The current state of website achieved (sort of) the first requirement, but not the second. In other words, although the content for “Getting started with Shogun” as well as “How to get started in developing Shogun” is all available on the website, we have experienced that it may not be the friendliest to navigate for quick answers. For example the latest development of the website, the (meta) examples (http://shogun.ml/examples/latest/index.html) is very useful for all the newcomers who’d like to use Shogun, but it is a totally separate module. Hence, once the user navigates herself to that part of the site, she completely “falls out” of the scope of the original website itself. The same could be told about the API section as well.
Although in the back-end we support versioning of the documentation (see http://shogun.ml/examples/6.0.0/index.html, http://shogun.ml/examples/6.1.3/index.html), currently this is not at all exposed via the website. In other words, users cannot switch between different versions of the library’s documentation, like in case of Scikit-learn (http://scikit-learn.org/dev/versions.html), Numpy (https://docs.scipy.org/doc), Python (https://docs.python.org/3/library/index.html) or any of the documentation hosted at readthedocs.io.
Furthermore, there is no direct link on the website to download either the source code or the pre-compiled packages. The “contact us” section is too cluttered at the moment, as well as not clear for the user what is our main point of contact (is it the mailing list? Is it stackoverflow?).
The output of this project is basically a full redesign and re-implementation of the Shogun’s website, based on the preliminary work provided by our designer. The new website should have a unified experience: it should show newcomers applications, the supported technologies, and algorithms, while still provide clear-cut knowledge for the more experienced.