On progress reports

Leif Johnson — 08 Apr 2009, 22:04

This week has been a good mix of hectic, stressful, and exhausting. Today I turned in a progress report for my computational linguistics project, yesterday I saw an RPE, and over the weekend I played disc golf and saw Itzhak Perlman play his violin.

Project progress

This week was spent almost completely working on my progress report for computational linguistics. I'm not entirely happy with how it turned out, but it's a progress report, and I think it does reflect my progress on the project. Most of the dissatisfaction is because I failed to allocate enough time for editing the paper. This came up clearly today when Lucia was asking me questions about tokenization, and I realized that I'd completely missed several important aspects of this process in my report. Ultimately I'm likely to axe the tokenization process anyway, since almost no comp ling papers these days pay attention to such details, but I found this oversight to be representative of my misallocation of editing time.

On the positive side, it's almost impossible to get a boring graph out of the dataset I'm using for the project. I built an overall language model for the dataset, and one language model for each day's slice of the dataset. Then I calculated the symmetric KL divergence between the overall model and each of the daily models and threw it up on a graph. Although the divergence doesn't tell us much about what I want to learn from the dataset, it's fascinating just watching the KL divergence decline as more data appears (the dataset grows superexponentially over time).


This weekend I managed to take some time for my own interests. On Saturday I got together with some CS folks and played a round of disc golf at Pease Park, which turned out to be surprisingly high-level for our group of players. Somehow all of our games were on that day, but we battled for the lead through the round, which made it pretty fun. I'm growing ever more fond of the Wraith driver that Soren got me for Christmas—it has a good left hook on it, and it's a good flier, but it's really the amazing skip at the end that's gotten me out of many a pickle.

On Sunday I was lucky enough to see Itzhak Perlman in concert on campus. The man is clearly a master many times over, and his show was fantastic, even though I'm woefully ignorant of the pieces he played.


The next month is going to continue to be really busy as the semester closes out and my project deadlines approach. This coming week I'm going to focus all of my energy on research, with a few breaks here and there for disc golf and cooking. (My housemates got me some baking pans for my birthday !!)

For comp ling, I still need to write up and run the code to identify communities in the (gigantic) user graph. I finally got the graph to actually load on the green machine, which I'm happy about, so I can probably do the graph processing by the end of the week. I need to flesh out the probabilities for my event model, which will take a little time (and probably a lot of debugging) and then write it up in R and rjags to do the parameter estimation. It will be exciting to see what sorts of results I get !

For my independent research, I've written up the Ritter bimodal SOM library and need to test it on our toy line segments dataset. This shouldn't take too long, and I'm looking forward to getting some quantitative results. I will probably work on that this weekend. After that's done, I will need to get the robot arm simulator and use it to generate a more realistic multi-angle dataset.

In parallel systems we're honing in on a project to implement ; I'm keen on parallelizing Gibbs sampling for LDA, as most of the existing codebases run serially, and it shouldn't be too difficult to convert the estimation process to a parallel environment. (In fact, David Newman at UCI has already done this, but I didn't see any details of his parallel implementation ... just speedup figures.)

One day I'd really like to implement a parallel version of BUGS or JAGS that will run on the TACC supercomputers—this will be massively helpful for people using these tools to do parameter estimation. I'd also like to look into writing a parallel version using MapReduce, but I think it will be less clear how one would go about doing that. Anyway, that's all future work.

I'm going to try to play disc golf again this weekend, too.