Android App Clusters from AnDarwin


This dataset is the Android app clusters that we computed using AnDarwin and which we reported on in our ESORICS 2013 paper. These clusters were drawn from 265,359 apps that we crawled from 17 different Android markets including Google Play and numerous third-party markets. Please see our paper for details on which markets we crawled, how many apps we crawled from each market, and how the tool computes app similarity.


The dataset consists of two files: one for full app similarity detection and the other for partial app similarity detection. Each of these files has the same format. Every line in the files represents a cluster which is comprised of a comma-separated list of the SHA1 hashes of the apps in the cluster. For the full app similarity clusters, each app appears in at most one cluster. For the partial app similarity clusters, an app may appear in more than one cluster. In both cases, clusters must be at least of size 2.

The full app similarity clusters are the ones that were used for the ESORICS submission. The partial app similarity clusters have been recomputed over the same dataset with a minimum feature cluster size of 20 features.

Note: We are unable to share the actual apps themselves at this time.


If your papers or articles use our dataset, please cite our ESORICS 2013 paper:

Jonathan Crussell, Clint Gibler, and Hao Chen. "AnDarwin: Scalable Detection of Semantically Similar Android Applications." Computer Security-ESORICS 2013. Springer Berlin Heidelberg, 2013. 182-199.

In BibTex format:
      title={AnDarwin: Scalable Detection of Semantically Similar Android Applications},
      author={Crussell, Jonathan and Gibler, Clint and Chen, Hao},
      booktitle={Computer Security--ESORICS 2013},

Contact Us

If you have questions about this dataset or would like to discuss ideas for its applications, please email us.

Last modified: May 14th, 2014.