Master Thesis: Exploration of Similarities between smart contracts on the Ethereum Blockchain using NLP techniques
Exploration of Similarities between smart contracts on the Ethereum Blockchain using NLP techniques
The smart contract is a unique feature provided by the Ethereum blockchain. It is a deterministic application that runs without the possibility of downtime, censorship or third-party intervention. It is responsible for a variety of decentralized applications that run on the Ethereum blockchain in diverse domains, for example, finance, decentralized storage, identity management, to name a few. Smart contracts are compiled into bytecodes by a stack-based Turing complete virtual machine called Ethereum Virtual Machine (EVM). After compilation bytecodes are stored in Ethereum blockchain and are immutable. Moreover, it is cumbersome to know what type of smart contract a particular bytecode corresponds to, whether it is a gaming contract, or a token contract, or just a Ponzi scheme.
The goal of the thesis is to explore smart contracts by performing static analysis on their bytecodes. In order to streamline the smart contract development, a certain set of rules were formulated known as ERC20 standards. In the scope of this thesis, the exploration is limited to finding, whether the smart contract follows ERC20 standards or not. The thesis comprises a compilation of background study, the development of a concept, prototypical implementation, and evaluation of the developed framework over a dataset containing transactions in the Ethereum network.
This thesis envisioned a framework that translates smart contract bytecodes into machine level operational codes (op-codes) and analyzes them using Natural Language Processing (NLP) techniques. After analysis, clustering is performed on the smart contracts with a goal to find a way to extract a cluster with only ERC20 compliant contracts. In order to perform the clustering, three clustering algorithms are used, namely, k-means, Agglomerative Hierarchical and Affinity Propagation. Furthermore, the quality of the obtained clusters is estimated using Silhouette index which is an internal cluster validation technique as well as Purity which is an external cluster validation technique. Average Purity score of the obtained clusters is 11 % which means that out of all the contracts clustered an average of 11 % are identified to be ERC20 compliant. In addition to that, Purity decreases as the number of clusters increases. Furthermore, it is also found that with the increase in the number of clusters, the average Silhouette index of the clusters decreases. This indicates that among clusters, the ERC20 compliant contracts are not matched well with other members of its cluster or there are too many clusters present.
Supervisor: Tobias Eichinger 
Type: Master Thesis
Duration: 6 months
10587 Berlin, Germany
Phone: +49 30 8353 58811
Fax: +49 30 8353 58409