#Edge Activations
Consider the graph below where edge weights indicate distances.
Lets use a Vietoris Rips complex: denoted as \(Rips_𝛼(X)\), consists of simplices \([x_0,...,x_k]\) satisfying the condition that the distance between any pair of points, \(d_x (x_i, x_j)\), is less than or equal to a for all (i, j).Min scale = 0 (we could start from 0.5 or any other value as well)
Max scale = 100 (we could go longer, but it would not change anything)
Multiple questions:
Question 3: What happens if the graph is directed?
#Node Activation
Consider the graph below where node features indicate ages.We will use sub or super level filtration.
Min scale = 10 (we could start from 15 or any other value as well)
Max scale = 80 (we could go longer, but it would not change anything)
Multiple Questions:
#Simplex vs Clique
Math: Mathematically, a k-dimensional simplex is defined as the convex hull of (k + 1) affinely independent points in k-dimensional space.
CS: In graph theory, a k-clique refers to a subset of k vertices in a graph where every vertex is directly connected to every other vertex in the subset.#K-dimensional holes
A k-dimensional hole requires vertices with at least degree k+1 (Vietoris-Rips here).
Node f will never contribute to a k>0 dimensional hole (See our Neurips’22 CoralTDA article) |
#Power of PH
Persistent homology is powerful because it allows us to track the birth and death of dimensional features at various scales.
Insight from Yulia R. Gel of UTD Math: This PH information is similar to what we get in graph motif analysis, but with finer granularity.
Figure: Three-node graph motifs |
Li, Yitao, et al. “Dissecting Ethereum Blockchain Analytics: What We Learn from Topology and Geometry of the Ethereum Graph?.” Proceedings of the 2020 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2020 |
Account Based blockchains have two types of “transactions”
#Ether transactions
Ether transactions involve one input address and one output address. An address spends coins from a balance, keeps the change. Each transaction of an address has an order (called nonce). The nonce is the number of transactions sent to the network by the address. A later transaction needs to wait for earlier transactions to be mined.
#Internal transactions
A transaction can transfer both currencies and tokens. An internal transaction can create multiple edges, although this is rare on Ethereum
#Trading Tokens - a timeline
#Token Networks
The largest connected component in Storj network on 13-1-2018 |
We model account based blockchains as directed, weighted, multi-graphs
The network of a single token is usually sparse and devoid of community structure. Daily networks may contain many disconnected components.
#Topological Data Analysis of Blockchain – Ethereum Case
Let 𝐺=(𝑉,𝐸,𝜔) be a weighted graph, with the node set 𝑉 and edge set 𝐸 and \(𝜔:𝐸→𝑅^+\) is a function encoding dissimilarity between two nodes connected by an edge.
To account for dissimilarity between two disconnected nodes, we introduce the weight ̃\(𝜔:𝑉×𝑉→𝑅^+\) \[𝜔_{uv}=\left\{ \begin{array}{ll} \omega_{uv} & (u,v)\in \mathcal{E}\\ \infty & (u,v)\not\in \mathcal{E} \end{array} \right.\] In the context of a weighted network, we define \(𝜔_{𝑢𝑣}\) as \[𝜔_{uv}=[1+\frac{(A_{uv}-A_{min})\cdot(a-b)}{(A_{max}-A_{min})}]^{-1}\]
where \(𝐴_{𝑢𝑣}\) is the weight of the edge (total amount of tokens traded) between nodes 𝑢 and 𝑣. Values of a and b create a scale.
\(𝐴_{𝑚𝑖𝑛}\) and \(𝐴_{𝑚𝑎𝑥}\) are the smallest and the largest edge weights, respectively.
#Topological Data Analysis of Ethereum Networks
In this context, we introduce a novel notion of Betti functions which relate these counts to the scale parameter viewed as continuum. The Betti-𝒑 function \(ℬ_𝑝:𝑅^+→\{0,1,2,3,…\}\), 𝑝=0,…,𝑑, associated with \(\{𝒞_𝜖 \}_{𝜖\in𝑅^+ }\) is defined as \[ℬ_𝑝:𝜖→ℬ_𝑝(𝒞_𝜖)\] Sequence of Betti numbers are finite dimensional realizations of Betti functions.The Betti functions can be regarded as a functional summary statistic of the network’s topological structure.
Due to the functional dependency among Betti numbers at different scales, it is important to view \(\{\mathfrak{B}_p(\epsilon_k)\}_{k=1}^n\) as a function as opposed to a vector in \(𝑅^𝑛\). This point of view allows us to utilize methods from functional data analysis such as a concept of functional data depth.Consider Betti functions \(\{ℬ_(𝑝,𝑡) \}_(𝑡=1)^𝑇\) associated with an evolving token transaction network over days 𝑡=1, 2,…,𝑇. Although each day visually looks different, some days present a clear anomaly in terms of their shape.
We use a notion of rolling band depth: \[𝑅𝐷_𝑤 (ℬ_{𝑝,𝑡}):=𝑀𝐵𝐷(ℬ_{𝑝,𝑡}|ℬ_{𝑝,𝑡},ℬ_{𝑝,𝑡−1},ℬ_{𝑝,𝑡−𝑤+1})\]. We introduce a concept of Betti signature which is defined as the deepest or most central Betti function.
#Predictive Models
Problem Definition: Given the transaction network of an Ethereum token and time series of the token price in fiat currency, predict whether the token price will change more than 𝛿 in the next ℎ days. Identify the maximum horizon value ℎ such that the prediction accuracy is at least 𝜌.
A histogram of absolute price returns of 31 tokens \(𝑅_𝑡=(𝑃𝑟𝑖𝑐𝑒_𝑡−𝑃𝑟𝑖𝑐𝑒_{𝑡−1})/(𝑃𝑟𝑖𝑐𝑒_{𝑡−1})\). | Number of (price) anomalous tokens in time. |
Predictive Models - A Fusion of Network Analysis and Time Series
We predict price anomalies in 31 token networks, where a total of 9042 days are predicted as anomalous (Anomaly:true). To predict whether the Ethereum token price will change more than 𝛿 within the next ℎ days horizon, we combine the graph topological features with traditional network summaries. We start by defining the price return of a token on day 𝑡 as \[R_t=\frac{Price_t-Price_{t-1}}{Price_{t-1}}\]
Then, we label a day 𝑡 as anomalous if there is a significant change in token’s price. More specifically, if |\(R_𝑡\)|≥𝛿 where 𝛿 is a user-defined threshold (i.e., magnitude of a price shock), then 𝑡 is considered as an anomalous day. We build one predictive model for each token and examine performance for different prediction horizons ℎ>0.
Our Random Forest models use 2/3 of a token’s lifetime for training, and the remaining 1/3 for test.A Venn diagram for the number of predicted anomalies in all token networks (for h = 1). Intersecting regions indicate agreement on predictions |
\(𝑅𝐷_7 (ℬ_𝑝)\): Rolling modified band depth of Betti-p function in the last 7 days.
Edge: Number of edges in the daily token network.
\(Price: PN_𝑡=\frac{𝑃𝑟𝑖𝑐𝑒_𝑡}{max{𝑃𝑟𝑖𝑐𝑒_1,…,𝑃𝑟𝑖𝑐𝑒_{𝑇_𝑘}}}\)
For up to three-day horizons, all models have accuracy >0.7. The figure shows that compared to other models, the M4 (full) model has the least deteriorating performance as horizon increases from 1 to 7. The accuracy results offer evidence that Betti models are more conservative in making anomalous day predictions, and their accuracy is better than the baseline model M1.
\(Accuracy: \frac{𝑇𝑃+𝑇𝑁}{𝑇𝑃+𝐹𝑃+𝑇𝑁+𝑇𝑃}\) |
Abay, N.C., Akcora, C.G., Gel, Y.R., Kantarcioglu, M., Islambekov, U.D., Tian, Y. and Thuraisingham, B., 2019, November. Chainnet: Learning on blockchain graphs with topological features. In 2019 IEEE international conference on data mining (ICDM) (pp. 946-951). IEEE. |
#Transaction output (TXO) based blockchains
Next, if address b wants to spend its received 2B, it needs to show proof of funds:
“Use the 2B I received from Block 1, transaction 1 and to pay 1.5B to c and 0.3B to d”.#Blockchain Graph – Substructure mining
Rather than individual edges or nodes, we use a substructure as the building block in our Bitcoin analysis. We use the term chainlet to refer to such substructures.Forecasting Bitcoin Price with Graph Chainlets, Akcora et al. PaKDD 2018. |
Definition [K-Chainlets]: Let k-chainlet \(G_k = (V_k, E_k, B)\) be a substructure of G with k nodes of type {Transaction}. If there exists an isomorphism between Gk and G’, G’ ∈ G, we say that there exists an occurrence, or embedding of \(G_k\) in G.
If a Gk occurs more/less frequently than expected by chance, it is called a Blockchain k-chainlet. A k-chainlet signature \(f_G(G_k)\) is the number of occurrences of \(G_k\) in G.
#Blockchain Chainlets
Chainlets have distinct shapes that reflect their role in the network. We aggregate these roles to analyze network dynamics.
Three distinct types of first order chainlets! |
#Graph features – occurrence and amount
Occurrence: How many transactions were created with a given type of chainlet?
Amount: How many bitcoins were transferred with a given type of chainlet?
Let’s take one-to-two chainlets, i.e., c+{1→2}
Occurrence: 2 chainlets occur: \(t_1\) and \(t_3\).
Amount: 7.8Ƀ+6.6Ƀ=14.4 bitcoins are
transferred by them.
Rows and columns indicate number of inputs and outputs, respectively.
Information about one-to-two chainlets, i.e., c_{1→2}, are given in the matrix cell [1,2]
#Topological features
Topological Data Analysis (TDA) provides methods to systematically study the topological and geometric structure underlying data. These structures are commonly analyzed via the multi-scale-based framework of persistent homology. The primary idea is to assess which topological features remain persistent over a larger set of scales and hence are likely to play a significant role in its functionality.
#Persistent homology for blockchains
We propose a novel approach that computes the Betti sequences on a network of N × N nodes where N is the size of the chainlet matrix A. For each of the \(N^2\) unique chainlets (e.g., \(C_{2→3}\)), we create a node in a new network, where edge distance between two nodes is computed with a suitable ‘distance’ d.A graph from N2= 400 chainlet nodes |
In constructing the new network, we use and hence retain the amount information from the Blockchain network. This way, we combine distance (computed from transferred coins) with edge connectedness while restricting the network size.
We describe the main steps as follows:
Given a heterogeneous Blockchain network with transferred bitcoins on edges,
As a result, we obtain the filtration of VR complexes \(VR_1\) ⊆ \(VR_2\) ⊆… ⊆ \(VR_S\) We then compute \(x_t = \{β_0(ɛ_1), . . . , β_0(ɛ_S); β_1(ɛ_1),… , β_1(ɛ_S)\}\).
#Experiment
Problem Statement: Let \(x_t\) ∈ R d be a set of features computed on the Bitcoin blockchain. Let (\(x_1, y_1\)),… ,(\(x_t, y_t\)) be the observed data where Y = {\(y_1,..., y_t\)} are the corresponding Bitcoin prices in dollars. At a time point t, estimate the Bitcoin price \(y_t\)’ where t’ > t.
We downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Using a time interval of 24 hours, we extracted daily transactions on the network and created the Bitcoin graph. Our Bitcoin price (USD) data is downloaded from blockchain.com which aggregates prices from worldwide online exchanges.
In addition to FL and Betti related features: past price, transaction count, mean degree of addresses, number of new addresses, mean and total coin amount transferred in transactions and address network average clustering coefficient.
We used ARIMAx, Random Forest, XGBT, Gaussian Process based Regression, and Elastic Net.
We use a time window-based approach in price prediction.image
#Baseline Experiments
We assess model performance with root mean squared error in the predicted price.Window = 3 | Window = 5 | Window = 7 |
The simplest baseline for ChainNet can be constructed by training models on past price and past total transaction count in a sliding window prediction scheme.
#Experiments
We report the percentage predictive gain, or decrease in RMSE for a specific machine learning model m w.r.t. its baseline model \(m_0\) as \(∆_𝑚(𝑤, ℎ) = 100 × 1 − (𝑅𝑀𝑆𝐸_𝑚 (𝑤, ℎ)/𝑅𝑀𝑆𝐸_{𝑚_0} (𝑤, ℎ))\), where \(𝑅𝑀𝑆𝐸_{𝑚_0}(w, h)\) and \(RMSE_m(w, h)\) are delivered by a baseline model m0 and a competing model m, respectively.Window = 3 | Window = 5 | Window = 7 |
Figure: Gain over the best model is given for three windows, and multiple horizons.
Next Part
>>Part
3