Skip to content

ANN wrapper (v0.2) now LGPL, OS X egg available

With help and suggestions from Rob Hetland, I’ve made many changes to the API of our Approximate Nearest Neighbor wrapper for scipy. The API is now hides much of the SWIG-yness of the old version and feels (I hope) more pythonic. On Rob’s suggestion we’ve also added C++ code to make querying multiple points much faster. Because the ANN library is LGPL, I’ve relicensed our wrapper as LGPL to avoid any LGPL/BSD conflicts and have moved the wrapper to the scigpl namespace from scikits (too bad, scikits looks so much flashier).

I’ve also made an egg for OS X that statically links the ANN library. You should be able to install via (one line):

easy_install -f http://rieke-server.physiol.washington.edu/~barry/python scigpl.ann

Of course, you can still get it via the scikits SVN.

Here’s how the new API looks:

>>> import scigpl.ann as ann
>>> import numpy as np
>>> k=ann.kdtree(np.array([[0.,0],[1,0],[1.5,2]]))
>>> k.knn([0,.2],1)
(array([[0]]), array([[ 0.04]]))
>>> k.knn([0,.2],2)
(array([[0, 1]]), array([[ 0.04,  1.04]]))
>>> k.knn([[0,.2],[.1,2],[3,1],[0,0]],2)
(array([[0, 1],
[2, 0],
[2, 1],
[1, 2]]), array([[ 0.04,  1.04],
[ 1.96,  4.01],
[ 3.25,  5.  ],
[ 1.  ,  6.25]]))
[  1.00000000e+000,   6.25000000e+000,   1.79769313e+308]]))
 >>> k.knn([[0,.2],[.1,2],[3,1],[0,0]],3)
array([[ 0,  1,  2],
[ 2,  0,  1],
[ 2,  1,  0],
 [ 1,  2, -1]]), array([[  4.00000000e-002,   1.04000000e+000,   5.49000000e+000],
[  1.96000000e+000,   4.01000000e+000,   4.81000000e+000],
[  3.25000000e+000,   5.00000000e+000,   1.00000000e+001],

scikits.ann part deux

I’ve updated our Python wraper for David Mount and Sunil Arya’s Approximate Nearest Neighbor (ANN) library. It now handles searching the tree for the k-nearest neighbors of a set of points. Since it’s all done in C, this should be much faster than looping in Python for large sets of points. Along the way, I was able to clean up the API significantly–I got rid of the SWIG-isms and the whole thing feels much more Pythonic now.

If you need to do k-nearest neighbor searches, have a look. It’s in the scikits SVN.

scikits.ann

Our Python wrapper for David Mount and Sunil Arya’s Approximate Nearest Neighbor (ANN) library is now in the scikits repository at scipy.org. The scikits.ann module is a SWIG-generated Python wrapper for the ANN library. It provides a numpy-compatible immutable kd-tree implementation which can perform k-nearest neighbor and approximate k-nearest neighbor searches. It currently builds on unix/OS X and I’m working to incorporate Jose Martin’s contributions to get things building on Windows with MinGW.

ANN is licensed under the LGPL and we’ve licensed the scikits.ann wrapper under the BSD license. If you need a kd-tree implementation for Python/numpy, check it out.

What is a UI?

This year, I’d like to say a big thank you to the writers of numpy and scipy, the numerical and scientific libraries for Python. We use these open source projects very heavily in our work. The combined efforts of all of the contributors to these projects has made Python a premier language for numerical computation. 

The author of numpy, Travis Oliphant recently moved from Utah to Texas to work with Enthought. One of the many contributions Enthought has made to the scientific software community is developing a whole suite of tools (in Python) that make developing cross-platform scientific applications much easier. In a recenet post, Travis talks at length about what makes a “good” UI. Travis has obviously thought a lot about UI’s for scientific computing and his discussion is very interesting. Definitely worth a read. Briefly, Travis notes that the user interface of an app is not just the buttons and menus, but the entire application framework—persistence, undo/redo, safe exploration, workflow, etc. 

In scientific software, I think “workflow” is the most important part of an application’s user interface. Application frameworks such as Apple’s Cocoa, Microsoft’s .NET, Trolltech’s Qt, etc. have solved many of the applicaiton framework UI issues that Travis mentions (in fact, this is why we use Cocoa and Qt for most of our work at Physion). Apple’s Cocoa framework is particularly impressive in this regard. An application with undo/redo, persistence to an SQLite database, automatic network discovery of distributed computing resources, you-name-it, is virtually code-free for the developer. But, none of these frameworks have solved the scientific workflow problem. That’s because we know how undo/redo should work. We’ve known for decades how a database should work (for the most part), but science is about the new and very often discovering something new involves creating the entire workflow anew. 

One reason why new workflow and new experiments often go together is because a workflow implicitly defines a world model—an idea about what objects in the world mater and how they interact—and new experiments also define a new world model. Therefore, new experiment, new workflow. Only by matching the workflow to the experiment, can we make software that becomes invisible to the researcher while facilitating new discoveries.

We certainly haven’t solved the workflow problem either, but we’re working on it. Here’s to one more year of trying.

Database vs. scientist

One of the founding principles of Physion Consultants is to bring cutting edge technology to scientists. When the computational tools cease to be a bottleneck in the scientific process, we feel that we’ve done our job. For many scientific applications, a relational database system (RDBMS) is the appropriate data repository. After all, great database engineers have already solved many of the data storage and query problems typically faced by custom-written scientific software. 
When I start a project, I often first consider the data repository. As a scientist, the data is everything so it makes sense to start there. The project requirements often include searching, indexing, and retrieving data collected across many experiments or observations. Clients can often tell me pretty clearly what entities they expect to be measuring. “Well,” I think, “a RDMS sounds ideal.” Inevitably, I run right into the RDBMS brick wall when I realize that the flexibility a scientist wants is directly at odds with the rigid certainty that a RDBMS needs to do its job.
As a researcher, I want to have unlimited flexibility in the parameters of my experiment. These parameters, of course, have to go into whatever data store I choose. We can’t store arbitrary key=>value pairs in a table in the database because the database engine needs to know the type (and size) of the key and value columns. Sure we could dump a huge list of key=>value pairs into a binary BLOB in the database, but then blamo, we’ve lost all ability to easily query those parameters in the database engine. And the database engine is the right place to do the query. Darn. 
I spent the weekend working on a dictionary-like construct for Apple’s CoreData framework. It defines a class/entity cluster headed by an entity called a KeyValuePair that, you guessed it, stores a key and a value. The idea to provide an API for the app developer that allows creation of an NSDictionary (key=>value map) from a set of KeyValueEntities and visa versa. In addition, the developer/user can query KeyValuePairs directly via a SUBQUERY expression in a CoreData predicate. The only hitch is that the query must specify the type of the value (like ‘key == “myKey” AND intValue==3′). It’s a hack at this point, but I haven’t found any better solution out there. If anyone knows other ways to solve this problem, I’m all ears. In the mean time, we’ll get this code cleaned up make it available in case it saves some folks a few days of hair pulling. Stay tuned.