direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Master Thesis: Exploration of Similarities between smart contracts on the Ethereum Blockchain using NLP techniques


Exploration of Similarities between smart contracts on the Ethereum Blockchain using NLP techniques


The smart contract is a unique feature provided by the Ethereum blockchain. It is a deterministic application that runs without the possibility of downtime, censorship or third-party intervention. It is responsible for a variety of decentralized applications that run on the Ethereum blockchain in diverse domains, for example, finance, decentralized storage, identity management, to name a few. Smart contracts are compiled into bytecodes by a stack-based Turing complete virtual machine called Ethereum Virtual Machine (EVM). After compilation bytecodes are stored in Ethereum blockchain and are immutable. Moreover, it is cumbersome to know what type of smart contract a particular bytecode corresponds to, whether it is a gaming contract, or a token contract, or just a Ponzi scheme.

The goal of the thesis is to explore smart contracts by performing static analysis on their bytecodes. In order to streamline the smart contract development, a certain set of rules were formulated known as ERC20 standards. In the scope of this thesis, the exploration is limited to finding, whether the smart contract follows ERC20 standards or not. The thesis comprises a compilation of background study, the development of a concept, prototypical implementation, and evaluation of the developed framework over a dataset containing transactions in the Ethereum network.

This thesis envisioned a framework that translates smart contract bytecodes into machine level operational codes (op-codes) and analyzes them using Natural Language Processing (NLP) techniques. After analysis, clustering is performed on the smart contracts with a goal to find a way to extract a cluster with only ERC20 compliant contracts. In order to perform the clustering, three clustering algorithms are used, namely, k-means, Agglomerative Hierarchical and Affinity Propagation. Furthermore, the quality of the obtained clusters is estimated using Silhouette index which is an internal cluster validation technique as well as Purity which is an external cluster validation technique. Average Purity score of the obtained clusters is 11 % which means that out of all the contracts clustered an average of 11 % are identified to be ERC20 compliant. In addition to that, Purity decreases as the number of clusters increases. Furthermore, it is also found that with the increase in the number of clusters, the average Silhouette index of the clusters decreases. This indicates that among clusters, the ERC20 compliant contracts are not matched well with other members of its cluster or there are too many clusters present.

Supervisor: Tobias Eichinger

Type:  Master Thesis

Duration: 6 months

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

Service-centric Networking
Telekom Innovation Laboratories
TEL 19
Ernst-Reuter-Platz 7
10587 Berlin, Germany
Phone: +49 30 8353 58811
Fax: +49 30 8353 58409