DIASIM: Dialogue Similarity - Research in Computational Linguistics
Introduction
I was checking my email one morning and came across an email from the Department of Psychology at my University. They were calling for Research Assistants to work on a project involving text mining and computational linguistics. Needless to say, I was intrigued. I promptly emailed in my resume and a cover letter explaining my desire to assist in the research.
After an interview process, I found myself hired as their newest research assistant. I was working in the Speech Perception and Production Lab under Dr. Kevin Munhall and his PhD candidate Nida Latif.
The Job
Through a study they had gathered many hours of conversations between friends as they tried to solve a task together. These conversations had been transcribed into text files. Their goal was to establish whether or not interlocutors modify the phrasal structure and word choices to align with each other.
This is where DIASIM comes in. DIASIM is a library designed to parse and analyse corpora. They wanted to use it to parse their transcripts, generate phrasal structure, and run a moving window across the data to compute the overall alignment between and across the speakers.
My role in all of this was normalizing the transcriptions, importing the data, and running the tests. This was a greater challenge than it seemed because the transcriptions had inconsistent formatting and some unwanted symbols (such as coughing, etc.) which confounded the data. Furthermore, the format of the files was not the format that the library required.
The Development Begins
I thought this would be a relatively simple undertaking, until I looked at the source code of DIASIM - tens of thousands of lines of code with little modularity and virtually devoid of documentation. Any insight into the workings of this monolithic library would need to be gleaned from reading the source code directly.
After writing some Python scripts to normalize and reformat the data, I
had the transcripts in a format very close to one that the library could
import. Using a small extension of one of the existing models in the
library I was able to import the transcripts into a serialized .corpus
file usable by the library.
Next on the agenda I needed to parse the phrasal structure of the data.
This was accomplished using the Stanford Parser and many - many hours of
debugging. At this point, the .corpus
file contained all of the
experimental data and its phrase structure. I was finally able to run
tests on the data now.
Or so I thought.
Running the Tests
The logic which ran tests was 800+ lines of code, with conditional statements nested 4 and 5 deep. Mixed in there were statements for outputting results to Excel, and - of course - no comments. Fast forward a week, and I had written a wrapper to hide away all that nastiness and provide an easy interface for configuring, queuing, and running sets of tests.
After running tests, I quickly ran into two new issues: First, the library for writing to an excel spreadsheet was outdated, and due to the volume of data, was unable to create the required number of columns in the spreadsheet. This caused the program to crash. After upgrading the excel library to the latest version (a few major versions later) and a great deal of modification to the testing class, this issue was solved.
The second major roadblock I had hit was memory limitations. My development machine had 8Gb of memory, but was running out of memory when running the test. After looking at the process manager I noticed not all of the memory on the machine was being utilized. It was capped at 2Gb. This turned out to be a problem with running the library on a 32-bit JVM. After installing the 64-bit variant and re-running the tests things looked good. As the tests progressed the memory usage steadily increased. 30 minutes into the tests I went for coffee and lunch. I came back to see the tests had run out of memory again! All 8Gb were being utilized, however it was not enough. I beefed up my machine with a new set of RAM (16 Gb), and a 16 Gb swapfile. I was not going to run out of memory this time!
When I saw the program had exited without errors, I was ecstatic. I delivered the results of the experiments and ran a whole other suite of experiments my colleagues had requested. The researchers in the lab were very happy to have their results.
Conclusion
I learned a great deal about research during this project, as well as the importance of good code modularity and documentation. All of that hard work had paid off. The researchers offered me co-authorship of the paper which will contribute to Latif’s thesis. It will be published in early 2017 under the title “Interactive alignment across multiple levels of natural conversation”.
The code is available on GitLab