Sa Xiao

Sa Xiao

E-mail: sa 'at' saxiao.net

Hi! I'm a high-energy particle physicist turned software engineer. At MIT, I worked on the Alpha Magnetic Spectrometer experiment for detecting cosmic rays with Peter Fisher. Today, I'm building features to make our service desk application smarter. I'm passionate about building infrastructure to collect large amount of data and squeeze insights from them to build useful applications. I also like various web technologies and have been writing Android and web applications on the side.

Projects

Similar Questions: Applying Latent Dirichlet Allocation to find similar questions in an online forum.

The purpose of this project was to find documents with similar content to the given document. The corpus used in this project was the collection of questions asked by customers in a forum of an enterprise software provider. LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words [BleiNgJordan2013]. I used LDA to find the topic distribution of each document, and then used the cosine distance between topic distribution of documents to identify their similarity. I found text preprocessing had a big effect on quality. I used stopword removal, stemming, and removing words with very low and high frequencies. It would probably improve the results to also use spell-correction, synonyms. I only used the unigram model for tokenization.

This project was done in R, using the R LDA package. The R LDA package uses Gibbs Sampling to fit the model. The top left figure shows the likelihood of words stabilizes after about 50 iterations of sampling. The model requires the number of topics as an input parameter. I used the perplexity of the held-out data to determine the optimal number of topics. The top right figure shows the perplexity vs the number of topics. I picked 120 for the topic number used in the rest of the analysis. The upper table in the bottom figure shows some sample topics, representing by the top words for each topic. The bottom table shows an example of similar documents found by the algorithm. This group of documents has heavy weights on the topics in blue and orange, therefore they are recognized as similar.

EncryptMe: Store your passwords safely with Android and Dropbox

EncryptMe is an android application that helps you easily store and retrieve secret account information (e.g. passwords) in a cryptographically safe way. You need to have a Dropbox account to use it. The application stores an encrypted text file in your Dropbox, under the directory: Apps/EncryptMe/. The text is encrypted using AES. Users can decrypt and look up information, edit, encrypt and save the file with the app.

There is a complementary html + javascript utility, which can encrypt or decrypt the information in the same way as EncryptMe. This utility provides an alternative way to read and edit the encrypted file from PC.

The app apk file and the source code can be downloaded from my githup repository .

PinYin: Chinese pronunciation guides with Chrome and Kindle

Pin Yin is a Chrome extension that aims to provide a more convenient way for people to learn Chinese. It detects the largest block of Chinese text on a page, gets rid of everything else, and adds pin-yin pronunciation annotations. It also provides a link to convert the more readable HTML to a Kindle friendly format for learning Chinese on the go. Just e-mail it to your Kindle account. The Kindle-compatible mode works for Kindle 3, Kindle Keyboard, Kindle Touch and Kindle Fire.

The extension can be downloaded from chrome web store.

The source code can be downloaded from my githup repository .