About Stack Roboflow
Stack Roboflow uses a recurrent neural network to generate artificial questions about programming. It does this by using a language model that was trained on a subset of the questions asked on Stack Overflow, a community of coders helping other coders. Why I made it ------- I'm currently working my way through the [fast.ai](https://fast.ai) *Practical Deep Learning for Coders v3* course. I tend to learn best by working on my own projects and creating Stack Roboflow was a way to apply my learning to a novel problem. What makes it cool ------- I was surprised at how well the `AWD_LSTM` model was able to adapt to technical writing! If you Google sentence fragments you can see that it's not just regurgitating memorized phrases. It appears to actually be creating new questions that have never been seen before. Obviously, some of them are higher quality than others. But that's the case for the input data as well! I was surprised to learn in my data exploration that the median score of a StackOverflow question is 0. How it was created ------- Fast.ai makes it really easy to do transfer learning. The base `AWD_LSTM` model is trained on a dump of English language Wikipedia pages. When creating a new language model you start from that base, adapt the vocabulary, and train from there. This means that the model doesn't have to learn English from scratch. It can focus on learning what's unique about the specific data you're feeding into the model. The language model is trained to predict the next word in a sequence. It took a couple of days on a NVIDIA 2080 Ti graphics card to get the model to 68% accuracy. This means that it was predicting the exact right word (out of a vocabulary of 60,000 words) correctly over 2/3 of the time! I now have that trained model running on my computer making more and more "predictions" (each question is a "prediction" starting from an empty string) and uploading them to this website in realtime. Challenges Faced ------- Originally, I wanted to predict the number of upvotes and views questions would get (which, intuitively, I thought would be a good proxy for their quality). Unfortunately, after working on this for about a week straight I've come to the conclusion that there is no correlation between question content and upvotes/views. I tried several different models (including adapting an `AWD_LSTM` classifier, a random forest on a bag of words, and using Google's AutoML) and none of them produced anything better than random noise. I also tried using myself as a "human classifier" and given two random questions from StackOverflow I can't predict which one will be more popular. This was a surprising result for me and I wasted a lot of time trying to tackle what turned out to be an intractible problem. It might be possible that with other data besides the question text a machine learning model may be able to do a better prediction (the date posted, for instance, seems like it might be a valuable datapoint) but that's beyond my abilities for right now. I also had trouble training the model within my 32 GB system memory. I ended up training on a subset of the data after calculating that it would take about 700 GB to fit everything into memory (and then it would take several weeks to train on my graphics card). That didn't seem tractible but it did give me an opportunity to play around with cloud GPU instances. I ultimately abandoned the idea of cloud training due to cost considerations. Reproducing the results ------ * **Training your own language model** You can train your own language model by downloading [the StackOverflow Data Dump via Archive.org](https://archive.org/details/stackexchange). Following the basic steps from [fast.ai Lesson 4](https://course.fast.ai/videos/?lesson=4) should get you going on the right track. * **Download the pre-trained model** Alternatively, I've made my trained weights and vocabulary [available for download here](/models/index.html). If you use it for something cool let me know; I'd love to see it. I'd also appreciate it if you could [pitch in to help offset the costs](https://payhip.com/b/G2Ru) of training the models. I'm hoping to collect enough to pay for the cloud compute time necessary to train on the full dataset (which I will make available for download as well). Future development goals ------- * **More Complete Language Model** My local machine wasn't able to handle training on the entire corpus of StackOverflow questions. In fact, it's only trained on about 1/16 of the data I had available! It peforms quite well regardless but it could be a lot better. Unfortunately, the extremely beefy cloud instances needed to train a model of that size are pricey so this might not happen. * **Training on Other Corpuses** The Stack Exchange data dump contains a lot of text data from other topics besides programming. They could be used to create other domain-specific language models that could be adapted to do other things. For instance, a model trained on [philosophy.stackexchange.com](https://philosophy.stackexchange.com) might "invent" some interesting theories. * **Teaching it to Code** I was surprised how quickly the model started to pick up a variety of different coding syntaxes. It seems to know the difference between SQL, C#, and JavaScript and is able to switch contexts based on the content of the questions! I think it'd be really neat to try training on some code off of GitHub and seeing if it could start to write code that actually runs. * **Tagging** In the near term, I'd like to automatically tag and categorize generated questions. It would be really neat to be able to browse all generated questions about "HTML" or "JavaScript" and see what they have in common. I want to see how this does in comparison to StackOverflow's * **Custom Prompts** Locally I can prompt the generator with a subject line and let it fill in the rest. This is pretty fun and enlightening to play with. Unfortunately, since I haven't productionized the model yet this isn't feasible to do at the moment. I'd love to get inference working on an AWS Lambda so that it'd be "infinitely scalable" -- but I'm also a bit worried about the potential server bill if I did that. * **Answering Questions** Right now the model only generates questions. In version 2 I want to train it to *answer* questions. If I could get this working it'd actually become a useful tool instead of a fun toy. Stay Up to Date ------ I plan on creating more cool projects soon! [Follow me on twitter @braddwyer](https://twitter.com/braddwyer) for the latest.