Acts as Subversioned: The Future.
The world is spinning when it comes to the Acts as Subversioned plugin. Yesterday, I was about to release version 0.2; I couldn’t, but it may not have been a bad thing.
So far (this includes v. 0.2), the plugin does two things: Lets you revision ActiveRecord objects using a Subversion repository, and lets you call Subversion-only functions that add value to your Ruby on Rails project. What’s really implied in that statement is this: The plugin alters the MySQL adapter for ActiveRecord at runtime, and adds in code that “does stuff” to the repository. Since, in “doing stuff,” the repository becomes the heavy-hitter for data within the models that are subversioned, the MySQL database simply turns into a cache holding the data that is the most up-to-date. Although this sounds like a good thing, is it really needed?
Through some emails with Bill Horsman, I stumbled upon a new direction that I think (and want) the plugin to move toward. Instead of “adding code” to the MySQL adapter at runtime (which, unfortunately, makes this a MySQL-and-similar only plugin), I would make my own ActiveRecord adapter that would connect directly to Subversion. What this means is that, along with the MySQLAdapter, and the PostgresSQLAdapter, and the OracleAdapter… there would be a SubversionAdapter. There would be no more of this “adding code” mantra; instead, it would be Subversion through and through.
To give you more of an idea what I’m talking about, instead of saying “adapter: mysql” inside of your database.yml, you’d now say “adapter: subversion”. The model for programming using a Subversion repository would be different, but I’m pretty sure I can replicate all the “find_by_X” functions as well as the id-based model structure, in (hopefully) O(1) time.
The problem with doing things using the Subversion adapter is that you’d only have access to one type of data store — this means you have to scrap the MySQL database and use Subversion only. This could lead to speed decreases (I’d guess that read/write speeds for Subversion are slower than that of MySQL), as well as a loss in functionality (complicated queries are most likely out the window). On the other hand, the benefits this would give to wiki-based programs would be enormous, as would the ability to see (and possibly measure) how data changes in your program over time.
How does this affect you/change your expectations of what this program is, and/or was? Is this “adapter style” more to what you were expecting in the first place, or less? Does it come with too many side effects?
My inclination is to go along this adapter-route, and see where it leads. Choosing this path would most likely 1) totally change the face of the plugin, and 2) affect what I release now, because #1 would make releasing less worthwhile.
Any thoughts? I don’t claim to be an expert at either Subversion, MySQL, or ActiveRecord, so any insight you can give me would be greatly appreciated.
PS: I’m still working things out with RubyForge. Even if things are going to change drastically, I’ll still release the new code as soon as I am able to.
You said –
“but I’m pretty sure I can replicate all the “find_by_X†functions as well as the id-based model structure, in (hopefully) O(1) time”
I don’t know if you remember or not, but I think when you were at the AFRL that summer, I built a little tool to export the structure of an Oracle database to a series of HTML files. The tool used metadata queries to grab a current version of the schemas, tables, fields, and constraints. The first time I did it, I think I managed to do it in O(n^2), but I was querying O(n) times. The program completed its task in about 6 minutes.
After talking with Nick Watts (our database guy) a bit, I found a way to get all the data I needed from a single query, but rebuilding the structure from the query operated in O(n^3) time. This completed in under 5 seconds.
I might be off-base here, since I don’t know Ruby or ActiveRecord, but keep in mind that the big thing with this type of code isn’t really making sure that your online processing time is low, it’s minimizing the number of queries performed on your offline structure (in this case, subversion). However, If you can keep your query count low, and have constant time, then you’re golden.
Hopefully my (somewhat) unsolicited advice is useful. If not, feel free to tell me to buzz off
As for the solicited advice, I think the best way to tell which is faster would be to profile them both. In some ways, removing a middle-man is a good thing, but I think your obvious concern is that MySQL might be a beneficial caching mechanism. Unfortunately, the only way to determine what’s faster is empirical analysis.
I (once again) don’t know ActiveRecord at all, but I think the *best* (yet hardest to code) solution would be to write in a layer that identifies frequently queried data, and caches it based on frequency of access and an upper memory bound. If you wanted to get really fancy, you could do an FFT on your data’s update timestamps, an FFT on the query timestamps, and then cache the data that is both at the high-end of the query spectrum, but the low end of the update spectrum. I think there are some ‘directional correlation’ algorithms that will handle this well. Of course, you’d have to implement a smart pedigree mechanism – but if you’re correlating FFTs on subversion records, I think you can handle that.
Wow, I babble a lot… I hope this was useful, and congrats again for graduating.
I was just thinking, and I did a bit of research.
For the purpose of what I said above, and this comment, I’m calling the SQL equivalent of ‘insert/update’ an update, and the SQL equivalent of ‘select’ a query.
I’m not sure if this will work or not, but if you call ‘update’ one dimension, and ‘query’ another dimension, you can probably do a multi-dimensional FFT. It just so happens that there’s a nice module written in C to do just this sort of thing from Ruby.
http://ruby.gfd-dennou.org/products/ruby-fftw3/doc/ruby-fftw3.html
I think in order to treat them like orthogonal components, you need to prove them to be statistically independent. However, if this is possible, it removes the directional correlation problem and reduces caching to a multi-dimensional FFT and a knapsack problem (filling your online buffer).
We can talk more in person – I don’t want to keep filling up your blog :-p
Bazaar seems also good