NumFOCUS is pleased to announce recipients of the third and final round of small development grants for 2018. Eligibility for these grants is limited to our Sponsored and Affiliated Projects. This year, NumFOCUS distributed $60,000 in small developments grants to help our projects improve usability, grow their communities, and speed up the time to major releases. (Read about the first round and second round recipients.)
Your Support Makes a Difference
The NumFOCUS Small Development Grants program is made possible thanks to the generosity of our donors. You can support the ongoing development of the projects by becoming a Supporting or Sustaining Member of NumFOCUS.
Learn more about NumFOCUS membership here.
Community Education and Engagement
Open Journals: Open Journals website update
— $2,800
(NumFOCUS Sponsored Project)
The JOSS and JOSE websites were developed in a very short period about two years ago and are showing their age. There are a number of user-satisfaction challenges with the current website layouts including:
- (Poor) responsiveness of website on smaller screen sizes
- Lack of ability to search papers/filter content
- General poor usability/UX challenges with content
- Lack of a clear style guide for generating new content views
We would like to engage with an external contractor to redevelop the JOSS (and JOSE) websites, updating the views to be based on the modern Bootstrap 4 CSS framework and solving the challenges outlined above.
JOSS is now close to accepting its 400th submission. The work we want to get done here addresses many of the general grumbles that we hear from our users about the usability and features of the JOSS website.
This proposal is to redevelop the JOSS website views, cleaning up a bunch of old code, and giving us a good foundation for future growth.
Gensim: FastText tutorials
— $3,000
(NumFOCUS Affiliated Project)
FastText is one of the most popular machine learning algorithms for unsupervised text processing, originally released by Facebook as a successor to Word2Vec and used by thousands of companies world-wide.
Gensim contains the best implementation of FastText out there: super fast, flexible and 100% in Python / PyData ecosystem (got even faster this year, following our successful GSoC 2018 project). Other implementations (Java, C#, Spark) are orders of magnitude slower.
But the Gensim implementation has warts around the API, and people are regularly confused about its I/O (there exist multiple formats, by Facebook, native Gensim, text format…) and its use of out-of-vocab inputs. This project proposes an API cleanup and a new set of tutorials focused on FastText.
Lead Developer Gatherings
Conda-forge: conda-forge sprint at SciPy 2019 — $3,000
(NumFOCUS Sponsored Project)
conda-forge developers and maintainers are spread over around the globe, with members from Australia, Brazil, UK, and the various states in the USA. That makes it quite difficulty to have a face-to-face meetings. The SciPy conference is one exception because most of the members can make it to the conference every year.
This proposal goal would be to help members who never attended SciPy or are having difficulty gathering funds to attend, to go and sprint with the other core maintainers that will be present. There are a number of issues that could benefit from a live code sprint like:
- improving recipe re-generation automation
- creating an easy users interface to browse/query the graph metadata
- plan the next compiler migration (gcc7 to gcc8)
The community will benefit from faster and more stable package releases. conda-forge, like any other OSS project, suffers from a sluggish development pace due to the short time the volunteers developers have to dedicate to the project, short sprints at conferences can remedy that and boost the project development “months ahead” in a just a few days.
Cantera: The 3rd Annual Kinetics Code Conference: Charting near- and long-term directions for Cantera software development — $3,000
(NumFOCUS Sponsored Project)
With membership spread across five states spanning the continental US, the Cantera steering committee has relied on virtual meetings (using Google Hangouts video conferencing), dubbed “The Kinetics Code Conference (KinCodeCon),” each of the past two years (2016–2017). While these virtual workshops have provided a convenient and efficient means to discuss pressing concerns and to outline broad, long-term development priorities, their efficacy is hampered by the virtual nature. Participants, who remain at their home institutions and therefore must also attend to local matters and responsibilities, are generally only able to participate on an intermittent basis over the course of the one-day virtual conference. The ability to directly follow up on identified development and software management priorities (with collaborative code sprints, for example) is generally limited.
Therefore, while virtual and electronic communication and collaboration have been generally quite successful for the development and maintenance of Cantera on a day-to-day basis, there is also a need for periodic, intensive, face-to-face collaboration between steering committee members. The proposed NumFOCUS Small Development Grant (SDG) will support travel and incidental costs for steering committee members participating in the 3rd Annual Kinetics Code Conference (KinCodeCon 2018), to be held November 16-18, 2018, in Cambridge, MA.
The dedicated in-person meeting and code sprint/development time will allow significant progress to be made on both the Cantera roadmap and codebase. This will allow us to identify important development objectives and then, importantly, follow up on these directly with concrete action. The follow-on activities will lay the groundwork for and implement new features which will increase Cantera’s utility, while also improving the useability of the software for a broad and diverse user base. This will include new software capabilities to implement material phase models for new and diverse research communities, improve the operation of existing capabilities, and in general keep refining the software implementation and documentation for a more user-friendly experience.
We will also discuss strategies for diversifying the Cantera contributor base and leadership, in part by finding ways to convert existing users into contributors. This will have immense benefits to the user community by lowering the barrier to entry for potential Cantera developers and contributors. This, in turn, will improve the software’s functionality by creating a broader class of developers able to address issues and contribute new features.
Paid Developer Time for Code Improvements
SymPy: MatchPy C++ code generator for SymPy/symengine — $3,000
(NumFOCUS Sponsored Project)
MatchPy is a pattern matching library which distinguishes itself in its supports for associative and commutative matching expressions, in a similar way to Wolfram Mathematica. It supports efficient matching of multiple patterns at one time by using a discrimination net data structure. The awareness of associative and commutative nodes in the expression tree makes it suitable for matching of mathematical expressions.
SymPy has added an experimental dependency on MatchPy, which has mostly been used to port RUBI (Rule-based integrator) into Python. The main problem encountered is the slowness of MatchPy in loading sets with many rules. In order to overcome this problem, MatchPy currently has a code generator for Python, which converts sets of matching rules into a decision tree. This would potentially make the requirement of loading all rules and building MatchPy’s data structures a one time task. Unfortunately the generated code is buggy and the decision tree in the code is currently not always returning correct results.
In this project we propose to fix the Python code generator of MatchPy and add a second generator targeting the C++ programming language. In particular, we expect that the code generator into C++ will allow to easily port libraries written in Wolfram Mathematica into any library supported by the symengine bindings.
PyTables: Better support for native HDF5 files
— $3,000
(NumFOCUS Sponsored Project)
PyTables revolves around advanced capabilities in handling very large heterogeneous datasets (tables). The possibility of adding indexes for out-of-core queries and sorting tables that does not fit in-memory are two appealing features that the community appreciate a lot.
However, most of the described functionality only works for HDF5 files that have been created with PyTables, so it would be nice if these features could be used for a broader range of HDF5 files: the only requirement is that these files would contain 1-dimensional datasets of compound types.
The scope of this project is to support the advanced features that PyTables provides for table entities (namely, advanced indexing, querying and sorting) for general HDF5 files created with other tools than PyTables, so that the whole HDF5 community, and not only PyTables users, can benefit from them.
Julia: Multi-Dimensional Bisection Method for finding the roots of non-linear implicit equation systems
— $3,000
(NumFOCUS Sponsored Project)
In the proposed project an efficient root finding algorithm will be implemented in Julia language, which can determine whole high-dimensional submanifolds (points, curves, surfaces…) of the roots of implicit non-linear equation systems, even in cases, where the number of unknowns surpasses the number of equations.
The bisection method – or the so-called interval halving method – is one of the simplest root-finding algorithms which is used to find the zero solutions of continuous non-linear functions. This method is very robust and it always converges to the solution if the signs of the function values are different at the borders of the chosen initial interval.
In many application, this 1-dimensional intersection problem must be extended to higher dimensions, e.g.: intersections of surfaces in a 3D space (volume), which can be described as a system on non-linear implicit equations.
In higher dimensions, the existence of multiple solutions becomes very important, since the intersections of two surfaces can create multiple intersection lines.
The proposed algorithm will handle automatically:
- multiple solutions
- arbitrary number of parameters (typically: 3-6)
- arbitrary number of implicit equations
- arbitrary number of constraints
- degenerated functions
Furthermore it will provide
- first order interpolation in higher dimensions
- the gradients of the equations at the roots
- an error estimation of the solution
Pomegranate: Adding compatibility with user-defined Python models
— $3,000
(NumFOCUS Affiliated Project)
pomegranate has a focus on being a flexible tool for probabilistic modeling. A key component to this flexibility is its modularity, with probability distributions being implemented as objects that can be “dropped” into more complex models such as hidden Markov models. The models themselves mostly serve as implementations of various algorithms, making calls to these distribution objects rather than being hard-coded. This modularity enables three key features; (1) users can use any probability distribution in their models instead of just Gaussians, (2) users can specify different probability distributions for different features, and (3) users can build complex models out of stacks of simpler ones. To our knowledge, these features make pomegranate uniquely flexible.
Unfortunately, due to the manner in which the current Cython backend is coded, users are limited to using the built-in distributions and meshing between models. This grant will support work that focuses on revamping the internal API to allow for the backend to fall back to a Python level API when no Cython API is available. Essentially, it would allow users to define their own models and distributions that would be compatible with the entirety of the current pomegranate framework as long as they exposed a few methods.
The primary challenge of this grant is to re-code the internals of pomegranate to allow this fall-back to happen with minimal effort on the part of the user. Ideally, the user should be able to define their own model or distribution in pure Python that can be plugged in without worrying about adding complex flags or inheritance. Additionally, pomegranate should continue to use the speedy Cython operations when possible, even when interacting with user-defined Python objects.
SciPy: An Efficient, High-Level Implementation of Linear Programming
— $2,000
(NumFOCUS Sponsored Project)
Linear programming, the optimization of a linear objective function subject to linear equality and inequality constraints, is a fundamental tool for scientific research and engineering. As the Python programming language continues to grow in popularity and is particularly well suited to scientific computing, it seems natural that there should be an efficient linear programming suite that is actually implemented in the Python language. Surprisingly, this is not the case. While there are several existing open-source linear programming solvers, the majority are implemented in low-level languages. While conda-forge has made some of these easier to access on non-UNIX operating systems, installation obstacles remain, and the ability of the general user to inspect and modify the source is impeded. Furthermore, the licenses are generally copyleft, limiting their benefit to industry and businesses. This would explain the popularity of `scipy.optimize.linprog`, one of SciPy’s most popular functions – despite its significant shortcomings.
The original implementation of the simplex method in `scipy.optimize.linprog` was inefficient, as it performed elementary row operations on NumPy arrays via Python for loops, adding substantial overhead to what should be swift, vectorized operations. It was also fraught with bugs that frequently caused the algorithm to terminate without a solution to problems known to be feasible, return a suboptimal solution, and even report solutions as optimal despite their inability to satisfy problem constraints. To partially remedy this problem, I contributed a Python implementation of an interior-point algorithm that was initially released with SciPy 1.0. It is certainly faster and more reliable than the simplex implementation, but it is still not the default routine when a user invokes `linprog` without a `method` argument because it is approximate by nature, whereas the simplex method is theoretically exact and always returns a solution at a vertex of the polytope defined by the constraints, which is often desirable for sensitivity analysis.
To fully remedy this, I have submitted a pull request for a Python implementation of the “revised simplex” method, which is also theoretically exact. This implementation is already faster and more reliable than the original “tableau-based” simplex implementation. In addition, the revised simplex method is particularly well-suited to large problems as it has the potential to exploit problem sparsity. The pull request passes the extensive battery of tests written for `linprog` and is ready to be merged. However, the merge is delayed as maximum impact on the community can only be realized if the new method is released in concert with the following, which I propose to complete and merge into SciPy with the support of this NumFOCUS Small Development Grant:
- Add callback function support to interior-point and revised simplex solvers, so that users can run custom code to monitor solver progress after each iteration of the algorithm, adding transparency to their operation.
- Enable users to provide an initial feasible solution to the linear programming problem. This gives the user greater control over the progression of the algorithm, improves solution time by eliminating the need for the first phase of the simplex algorithm, and provides users with an efficient mechanism for analyzing the sensitivity of the solution with respect to changes in the objective function.
- Carefully clean the documentation of `linprog`, which, despite recent work, still suffers from mistakes and unprofessional presentation that may hinder use of the code itself.
- Overhaul the test suite, strengthening weak tests to ensure that solved issues do not arise again, and pruning redundant tests from the rather large (>500 tests) suite for better speed.
- Using the standard NETLIB Linear Programming problems [21], benchmark the interior point and revised simplex solvers against one another and the other open source linear programming solvers that can be installed with conda-forge, and publish the results in SciPy documentation.
- Change the default solver of `scipy.optimize.linprog` to the revised simplex method.