This spring the NumFOCUS Board of Directors awarded targeted small development grants to applicants from or approved by our sponsored and affiliated projects. In the wake of a successful 2016 end-of-year fundraising drive, NumFOCUS wanted to direct the donated funds to our projects in a way that would have impact and visibility to donors and the wider community. Each grant will help the recipient project to produce a clear outcome, achievable within 2017.
Just over $13,000 was awarded in grants to support the following projects:
Widening platform availability for MDAnalysis: Full Python 3 Support
This project aims to include full Python 3 support for MDAnalysis; at the moment, only Python 2.7 is fully supported. Although about 80% of code passes unit tests in Python 3, we urgently need to close the remaining 20% gap in order to support our user base and to safeguard the long term viability of the project. MDAnalysis started almost 10 years ago when Python was around version 2.4 and interfacing with existing C code was mostly done with writing C-wrappers that directly used CPython. This legacy code has hampered a speedy full transition to Python 3 and consequently MDAnalysis lags behind the rest of the scientific Python community in fully supporting Python 3.
The grant will support a final focused drive to complete support of Python 3 while also remaining compatible with Python 2.7 for as long as it is officially supported (2020). To do this, two core developers (Richard Gowers (RG) & Tyler Reddy (TR)) will visit Arizona State University to work on the issue full-time for 2 weeks. The output of this project will be merged into the development branch and will be included in the existing Travis CI build matrix. This will then be one of the key features to be included in the upcoming 0.17 release which is targeted for September 2017 (to coincide with the inclusion of the anticipated outputs from Google Summer of Code 2017 projects). For MDAnalysis it is vital to fully support Python 3 in order to maintain and grow its user and developer base. The work supported by the grant will put MDAnalysis on track with the rest of the scientific Python community, increase package interoperability, and promote the overall move towards Python 3.
h5py backend for PyTables
The goal is to define a new way to access I/O that would allow a new version of PyTables to use different backends. The main priority is for interfacing h5py so as to allow HDF5 access through it. This way PyTables can leverage h5py to access the most advanced features of HDF5 while still delivering features like advanced table management, fast table queries and easy access to advanced Blosc meta-compressors.
The goal is to define a new way to access I/O that would allow a new version of PyTables (probably v4.x) to use different backends. As h5py is a great interface for HDF5, the main priority is for interfacing h5py so as to allow HDF5 access through it. This way PyTables can leverage h5py to access the most advanced features of HDF5 while still delivering features like advanced table management, fast table queries and easy access to advanced Blosc meta-compressors (and with it, to a wide array of codecs, like LZ4, Snappy and Zstandard). You can see a more detailed blog about our vision here. In fact, work has already started on that front: in August 2016 a handful of PyTables core developers gathered with the goal to start this precise task, and although they certainly made a lot of progress on the Table object (the fundamental one in PyTables), there is still quite a bit of work to do. This grant will allow PyTables to continue the job done till now and release an alpha release with the basic Table, CArray, EArray and VLarray objects working, plus hopefully get some traction for promptly releasing a stable version unifying the best of PyTables and h5py packages. The grant work is meant to address project 1) here.
With this approach, PyTables and h5py will be close to complementary instead of having overlapping functionalities. This overlapping leads to redundant effort for both core developers and community users of PyTables and h5py; moreover, there are two places where bugs could be reported, two places where nasty unicode issues could come up, two handles to your HDF5 files in memory, and so on. The grant will allow a more uniform API for HDF5 files.
Text Analytics Introductory Course for Social Scientists
Text mining and machine learning are not taught to social scientists at Slovenian universities, and few students and professors in this area know about their potential for research. The workshop will be focused on teaching the participants the core data mining methods and how to combine them with text analytics. The entire workshop will be hands-on — we will use our own tool, Orange, that offers components for text mining, visualization and deep learning-based embedding within an easy-to-use visual programming environment. Sections of Orange were specifically designed for teaching, and while they have been tested in workshops for engineers and biomedical researchers, this will be the first time we will prepare the course for social scientists.
At the workshop, participants will actively construct analytical workflows and go through case studies with the help of the instructors. They will learn how to manage textual data, preprocess it, use machine learning, data projection and visualization techniques to expose hidden patterns and evaluate the resulting models. At the end of the workshop, the participants will know how to use visual programming to seamlessly construct data analysis workflows with textual data.
The workshop will extend our existing hands-on course materials to cover digital humanities, and two case studies prepared for the course will be made available on Orange’s YouTube channel.
The goal of this project is to support additions to the `numexpr` module. NumExpr is a core module within the PyData ecosystem. It compiles Python code passed as strings into a program which is then run through a virtual machine written in C. The virtual machine efficiently blocks and threads NumPy-like array calculations on modern, multi-core processors. Due to limitations of the original NumExpr module, starting in late 2016, we began a re-write of NumExpr which is currently under development as the NumExpr-3.0 (NE3) branch.
At present the version 3.0 development branch of NumExpr (NE3) is in an alpha state and is not ready for production use. In spite of that, several individuals have already tried to use the alpha, due to the large number of improvements offered. R.A. McLeod proposes the following pushes be undertaken to move NE3 into a state suitable for public use:
- Analysis of the NumExpr program to determine the broadcasted size of the output array. This will be implemented within the C-module similar to `numpy.broadcast` but without the generation of Python objects, for speed reasons.
- Fixing of bugs found by the automated test submodule, and working continuous integration on Appveyor (Windows) and Travis CI (Linux and OS-X).
- Documentation generated through Sphinx and pushed automatically to ReadTheDocs.org —