As Fernando Pérez (NumFOCUS Advisory Council member) has stated, when people are expected to work on open source software for free, only the people who can afford to work for free can participate. So long as that is the case, being able to work on open source will remain a privilege for a select few. NumFOCUS awards funds to our sponsored and affiliated projects in order to help make open source work available to more than the select few.
Thanks to a successful fundraising year in 2017, NumFOCUS is able to provide funding to help the projects improve usability, grow their communities, and speed up the time to major releases. NumFOCUS intends to distribute $60,000 in small developments grants to our sponsored and affiliated projects in 2018.
This is the second round of grants made this year. You can read about the first round grant recipients in our prior blog post.
Paid Developer Time for Major Code Improvements
Statsmodels: Probability Plots and Generalized Additive Models — $3,000
(NumFOCUS Affiliated Project)
Statsmodels is the main Python package for general statistical analysis and econometrics. It covers large areas of statistics and econometrics including time series analysis, but is still missing several of the basic methods that are usually provided by general purpose statistical packages.
Statsmodels is gaining new model classes each year and current models are extended to cover new functionality. However, statsmodels also has a large number of pull requests that have stalled because of the lack of reviewers and contributors in that specific area.
The proposal is to finish and merge pull requests and address the corresponding open issues for two topics, probability plots and Generalized Additive Models (GAM).
The project proposes to improve the current code and API in probability plots and narrow the gap in missing features by adding generalized additive models and penalized splines.
Both parts of the projects are popular methods in data analysis and applied statistics and are expected to find widespread usage. Generalized Additive models and penalized splines provide new semiparametric methods for many use cases but will also provide the elements and test cases for extending penalized estimation to other models.
SciPy: Maturing a sparse array implementation for SciPy — $3,000
(NumFOCUS Affiliated Project)
Sparse arrays are nd-arrays containing primarily zeros. These kinds of arrays can be represented very efficiently in memory, and processing tasks that involve sparse arrays can be significantly faster than equivalent operations on dense arrays. Consider, for example, the adjacency matrix of a Markov chain. In practical systems, such an adjacency matrix is quite sparse.
The “sparse” project aims to implement a subset of NumPy functionality on sparse arrays. Currently, the project implements ufuncs and reductions, and other utility functions like concatenate and stack. It currently supports the COO and DOK formats. We also plan to work on the CSD format (a generalization of CSR/CSC).
Currently the “sparse” project has arrived in a place where it can be used for some use cases relevant to end users; it is mature enough to have gained a space under the PyData umbrella on GitHub. However, it is still not close to being feature-complete or to being ready for adoption by the SciPy library. While the project is starting to attract new contributors, progress currently depends largely on the time that can be spent on development and maintenance work by Hameer Abbasi. We propose a set of concrete milestones, so that with funding he is able to spend more time working on “sparse”.
In terms of features, we aim to implement the CSD sparse array format as the largest deliverable. Furthermore we aim for a couple of performance improvements and the implementation of smaller features that were requested by the community.
Finally, we will use the funded development time to complete a new release, in order to get the new features in the hands of end users quickly.
The SciPy community has desired to have a robust implementation of sparse arrays for a long time. Several attempts were started and then abandoned due to lack of time. The “sparse” project is the first implementation that promises to be full-featured and robust enough for the needs of the SciPy library, Scikit-learn, Dask, XArray and other consumers.
Julia: BlockBandedMatrices.jl: add support for general array backends (GPU) — $3,000
(NumFOCUS Sponsored Project)
The objective of the grant is to adapt BlockBandedMatrices.jl to support general backend storage types. This generalization makes it possible for the same code-base to work both on the CPU and the GPU, simply by specifying that the underlying storage should be a standard Julia matrix, and live on the CPU, or a GPUArray matrix, living on the GPU. Julia then automatically reroutes all operation to specialized implementations for the type. Indeed, we expect that it will also allow BlockBandedMatrices to be distributed across CPUs, by specifying a distributed storage type. Subsequently, the work undertaken in the grant allows for efficient numerical solution of partial differential equations on the GPU or on distributed arrays, using either finite differences or spectral method discretizations, while hiding from the user the details of the underlying matrix. With the advent of Julia 1.0 this summer, and the increased interest it will generate, we believe this proposal is timely: it will demonstrate how foundational packages can be expanded to push computations across different hardware while retaining the same legible code-base.
The funds will be used to fund Mayeul d’Avezac for a two week sprint. Mayeul is a Senior Research Software Engineer in the Research Computing Service at Imperial College London.
The benefit for the community will be an easy-to-use framework to represent discretizations of partial differential equations, that can live on either a single machine, multiple machines, or the GPU. Other packages like ApproxFun.jl and DifferentialEquations.jl will be able to leverage this technology so that users can achieve high performance while only stating the problem, with minimal details about the underlying discretization or how it is stored.
The package developer, Sheehan Olver, is the primary caregiver for his 18 month old daughter, which limits the hours that he can work on this project. The support of Mayeul on this project will help to mitigate the effect of his child care duties on this project.
Want to help us do more?
Make a donation to NumFOCUS today.
Improvements to Documentation
Pomegranate: Improving Documentation, Examples, and Tutorials — $3,000
(NumFOCUS Affiliated Project)
pomegranate is a package for flexible probabilistic modeling in Python. It is flexible because it allows for users to plug-and-play probabilistic components in a way that other packages don’t, such as dropping in an Exponential Distribution to a Mixture Model to now have an Exponential Mixture Model, or to drop a hidden Markov model into a Bayes classifier to now have a classifier over sequences.
However, while this functionality is available, the documentation on this and usage tutorials are sparse. This is primarily because developer time has been spent adding in useful features rather than on fully explaining the current usage. Unfortunately, this means that many of the great features that are currently available in pomegranate are underutilized simply because they’re unknown.
This grant will enable Jacob Schreiber, as the core developer of the project, to spend a significant portion of time extending the available documentation, writing stand-alone code examples that demonstrate current features, and revamping the tutorials folder to include all features. The following features will be focused on:
- Fast Bayesian Network structure learning using constraint graphs
- Learning or using estimators when data is missing
- Semi-supervised Learning using HMMs and naive Bayes classifiers
- Setting up distributed computation for any of the models
- Stacking models to produce more complicated ones, such as making a mixture of Bayesian networks or a Bayes classifier over HMMs
- Additional installation help and FAQs based on user reports from the past two years
By highlighting these less well known areas, we will be able to significantly improve the lives of new users that are looking for features as well as existing users who are unaware that extended functionality exists for the tool that they’re already using.
Bokeh: Bokeh Docs Modernization — $3,000
(NumFOCUS Sponsored Project)
With Bokeh 1.0 imminent, it is a good time to spend expanding documentation sections that need extra work or are missing and to polish some rough edges around the site and content. This proposal covers specific tasks that have recently been raised by users. It is split into a number of small/medium effort tasks that will all be completed, plus two larger more open ended tasks, either or both of which would be very useful to get meaningfully started.
- Update all Bokeh source files with common boilerplate and header format (started but not yet finished)
- Expand the User gu