Daniel
Watson,
Master’s
candidate
David
R.
Cheriton
School
of
Computer
Science
Public software repositories such as GitHub make transparent the development history of an open source software system. Source code commits, discussions about new features and bugs, and code reviews are stored and carefully attributed to the appropriate developers. However, sometimes governments may seek to analyze these repositories, to identify citizens who contribute to projects they disapprove of, such as those involving cryptography or social media. While developers who seek anonymity may contribute under assumed identities, their body of public work may be characteristic enough to betray who they really are. The ability to contribute anonymously to public bodies of knowledge is extremely important to the future of technological and intellectual freedoms. Just as in security hacking, the only way to protect vulnerable individuals is by demonstrating the means and strength of available attacks so that those concerned may know of the need and develop the means to protect themselves.
In this work, we present a method to de-anonymize source code contributors based on the authors’ intrinsic programming style. First, we present a partial replication study wherein we attempt to de-anonymize a large number of entries into the Google Code Jam competition. We base our approach on Caliskan-Islam et al., 2015, but with modifications to the feature set and modelling strategy for scalability and feature-selection robustness. We did not achieve 0.98 F1 achieved in this prior work, but managed a still reasonable 0.71 F1 under identical experimental conditions, and a 0.88 F1 given more data from the same set.
Second, we present an exploratory study focused on de-anonymizing programmers who have contributed to a repository, using other commits from the same repository as training data. We train random-forest classifiers using programmer data collected from 37 medium to large open-source repositories. Given a choice between active developers in a project, we were able to correctly determine authorship of a given function about 75% of the time, without the use of identifying meta-data or comments. We were also able to correctly validate a contributor as the author of a questioned function with 80% recall and 65% precision. This exploratory study provides empirical support for our approach.
Finally, we present the results of a similar but more difficult study wherein we attempt de-anonymize a repository in the same manner, but without using the target repository as training data. To do this, we gather as much training data as possible from the repository’s contributors through the Github API. We evaluate our technique over 3 repositories: Bitcoin, Ethereum (crypto-currencies) and TrinityCore (a game engine). Our results in this experiment starkly contrast our results in the intra-repository study showing accuracies of $35\%$ for Bitcoin, $22\%$ for Ethereum, and $21\%$ for TrinityCore, which had candidate set sizes of 6, 5, and 7 respectively. Our results indicate that we can do somewhat better than random guessing, even under difficult experimental conditions, but they also indicate some fundamental issues with the state of the art of Code Stylometry. In this work we present our methodology, results, and some comments on past empirical studies, the difficulties we faced, and likely hurdles for future work in the area.