Topological Data Analysis (TDA) serves as the core toolkit utilized by mathematicians in the field of Data Science.
TDA is primarily driven by the belief that topology and geometry offer a robust approach to extracting qualitative and sometimes quantitative information about the underlying structure of data. The objective of TDA is to establish mathematically rigorous, statistical, and algorithmic methods for inferring, analyzing, and leveraging the intricate topological and geometric structures present in data, often represented as point clouds (each dot of the pot) in Euclidean or more general metric spaces.
The Torus shape given as a point cloud | A Pot in the Euclidean (3D) Space |
A metric space (M, ρ) is a set M with a function \(\rho ∶M × M →R_+,\) called a distance, such that for any x, y, z ∈ M:
#1
The input data is considered to be a finite collection of points, and there exists a measure of distance or similarity between these points. This distance can be determined by the metric of the surrounding space, such as the Euclidean metric if the data is embedded in \(ℝ^d\), or it can be defined by a pairwise distance matrix intrinsic to the data.
The specification of the metric for the data is typically provided as an input or guided by the specific application. However, it is crucial to acknowledge that the choice of metric plays a significant role in uncovering noteworthy topological and geometric characteristics of the data
#2a
A continuous shape is constructed to emphasize the underlying topology or geometry of the dataTopology: This looks like a pot! |
#2b
This shape is typically a simplicial complex or a sequence of nested simplicial complexes known as a filtration, which captures the data’s structure at various scales.simplicial complex |
#2c
This shape is typically a simplicial complex or a sequence of nested simplicial complexes known as a filtration, which captures the data’s structure at various scales.Filtration |
#2d
Simplicial complexes can be seen as higher-dimensional extensions of neighborhood graphs commonly used in standard data analysis and learning algorithms.
The task at hand is to define these structures in a way that is mathematically proven to reflect meaningful information about the data’s structure and can be efficiently constructed and manipulated in practical applications.#3
The structures constructed on the data provide a means to extract topological or geometric information.It may involve generating rough summaries or approximations that require specific methods, such as persistent homology, to extract relevant information.What do you see in this filtration? |
#4
The topological and geometric information extracted offers novel sets of features and descriptors for the data. These features can enhance the understanding of the data, especially through visualization techniques. They can also be combined with other types of features to enable more comprehensive analysis and facilitate machine learning tasks. Furthermore, this information can be utilized to design data analysis and machine learning models that are well-suited to the specific characteristics of the data. An essential consideration at this stage is demonstrating the added value and complementarity of the information provided by TDA tools compared to other features.
Understanding how the extracted information complements existing features and showcases its advantages is a key aspect of this process.
#Simplicial complexes - 1
Simplicial complexes can be regarded as an extension of graphs to higher dimensions. They are mathematical entities that possess both topological and combinatorial characteristics, which makes them particularly valuable for Topological Data Analysis.
Two important aspects of these mathematical objects:
#Simplicial Complexes - 2
Given a set \(X = \{x_0,···, x_k\} ⊂ R^d\) consisting of k + 1 affinely independent points, the k-dimensional simplex \(σ = [x_0,···,x_k]\) spanned by X can be defined as the convex hull of X.
Affinely independent points refers to a set of points in a vector space that do not lie on the same affine subspace, except when the set consists of a single point. In other words, a set of points \(\{x_0, x_1, x_2,···, x_n\}\) is said to be affinely independent if there is no non-trivial linear combination of the points that results in the zero vector, except when all the coefficiants are zero.
In layman’s terms can be understood as a set of points that are not in a straight line or a flat plane.
#Simplicial Complexes - 3
Given a set \(X = \{x_0,···, x_k\} ⊂ R^d\) consisting of k + 1 affinely independent points, the k- dimensional simplex \(σ = [x_0,···, x_k]\) spanned by X can be defined as the convex hull of X.
The convex hull of a set of points is the smallest convex shape or region that completely encloses all the points. In simpler terms, it is like finding the “outer boundary” that wraps around a given set of points.
Mathematically, an n-dimensional simplex is defined as the convex hull of (n + 1) affinely independent points in n-dimensional space. In other words, it is the smallest convex shape or region that contains these points. For example, a 1-dimensional simplex is a line segment, a 2-dimensional simplex is a triangle, a 3-dimensional simplex is a tetrahedron, and so on.#Simplicial Complexes - 4
Given a set \(X = \{x_0,···, x_k\} ⊂ R^d\) consisting of k + 1 affinely independent points, the k- dimensional simplex \(σ = [x_0,···, x_k]\) spanned by X can be defined as the convex hull of X.
Mathematically, an n-dimensional simplex is defined as the convex hull of (n + 1) affinely independent points in n-dimensional space. In other words, it is the smallest convex shape or region that contains these points. For example, a 1-dimensional simplex is a line segment, a 2-dimensional simplex is a triangle, a 3-dimensional simplex is a tetrahedron, and so on.#Simplicial Complexes - 5
A geometric simplicial complex K in \(R^d\) is a collection of simplices satisfying the following conditions:
The face of a simplex refers to a lower-dimensional simplex that is a subset of the original simplex. In other words, it is a subset of the vertices of the simplex that defines a smaller simplex within it.
#Simplicial Complexes - 6
For example, in a 3-dimensional simplex (tetrahedron), the faces are triangles, and in a 4-dimensional simplex, the faces are tetrahedra.
The faces of a simplex are simplices themselves, but of lower dimension.#Simplicial Complexes - 7
The collection of simplices in a simplicial complex K forms a subset of \(R^d\) known as the underlying space of K, which inherits the topology of \(R^d\). Consequently, K can be regarded as a topological space based on its underlying space. It’s worth noting that once the vertices of K are determined, the combinatorial description of the collection of simplices, adhering to certain incidence rules, fully characterizes K.
#Building simplicial complexes from data
There exist many ways to build simplicial complexes. We present here a few classical examples that are widely used in practice. An initial example is an immediate extension of the concept of an a-neighboring graph.
Suppose we have a set of points, X, in a metric space (M, p), and a real number α ≥ 0. The Vietoris-Rips complex, denoted as \(Rips_a(X)\), consists of simplices \([x_0,···, X_k]\) satisfying the condition that the distance between any pair of points, \(d_x(x_i, x_j)\), is less than or equal to α for all (i, j).
Each circle (called a closed ball) has a data point at its center, and the circle radius is the distance from the data point
However, it’s important to note that in general, even when X is
a finite subset of \(R^d\), \(Rips_a(X)\) may not have a geometric
realization in \(R^d\). In other words,
it may have a dimension higher than d.Can you
figure out why?
Closely connected to the Vietoris-Rips complex is the Čech complex, denoted as \(Cech_a(X)\). It is defined as the set of simplices \([x_o,···, X_k]\) such that the k + 1 closed balls, \(B(x_i, α)\), have a non-empty intersection. It’s worth noting that these two complexes are related as follows: \(Rips_a(X)\) is a subset of \(Cech_a(X)\), and \(Cech_a(X)\) is a subset of \(Rips_2a(X)\). Additionally, if X is a subset of \(R^d\), then \(Cech_a(X)\) and \(Rips_2a(X)\) share the same 1- dimensional skeleton, meaning they have the same set of vertices and edges.
#Main Directions in TDA - simplified classification
Persistent Homology (PH): study of topological shapes that can fit into your dataTDAMapper: the overall shape of your data
#What is Homology?
Homology, in a nutshell, is a fundamental concept in algebraic topology that provides a powerful tool for formalizing and analyzing topological features of spaces or simplicial complexes in an algebraic manner. It enables us to capture and quantify the presence of different-dimensional “holes” within a given space.
In homology, each dimension k corresponds to a vector space \(Н_k\), where the dimension of \(H_k\) intuitively represents the number of independent features of that dimension. For instance, the O-dimensional homology group \(H_0\) describes the connected components of the complex, the 1-dimensional homology group \(H_1\) represents the 1-dimensional loops or cycles, and the 2-dimensional homology group \(H_2\) captures the 2-dimensional cavities or voids within the space.
Homology provides a rigorous framework for understanding and characterizing topological structures by associating vector spaces to each dimension, giving us valuable information about the presence and properties of various topological features within a space or simplicial complex.
In simple terms, we are defining topological structures (i.e., k-dimensional holes) and counting them in a simplicial complex (SC).
What are these holes?
O-dimensional holes are connected components in a SC:
1-dimensional holes are loops in a SC:
#Homology
To extract summaries of such topological features at a mesoscopic level, we use Betti numbers. Betti-p number of a simplicial complex C of dimension d, denoted by \(ẞ_p(C)\), is defined asBetti numbers at increasing dissimilarity scales |
Anand, D.V., Meng, Z., Xia, K. and Mu, Y., 2020. Weighted persistent homology for osmolyte molecular aggregation and hydrogen-bonding network analysis. Scientific Reports, 10(1), pp.1-17. |
#Persistent Homology
We can increase k as much as we want, however we do not see k>2 holes often in our experiments.
Formal definitions for the holes can be found in works such as
#What is Persistent in PH?
Persistent homology in a nutshell:
#What is filtration?
A filtration of a simplicial complex K is a collection of subcomplexes \((K_r)_{t\in T}\) that are nested, meaning that for any \(r_0\) and rp in the set T, if r is less than or equal to \(r_0\), then \(K_{r0}\) is a subset of Kro.
The entire complex K is obtained by taking the union of all the subcomplexes \(K_r\) for r in T. In a more general sense, a filtration of a topological space M is a collection of subspaces \((M_r)_{r\in T}\) that are nested. Similar to the simplicial complex case, if r is less than or equal to \(r_0\), then \(M_r\) is a subset of \(M_{r0}\). The entire space M is obtained by taking the union of all the subspaces \(M_r\) for r in T.#Persistence Diagrams
In practical situations, the parameter \(r \in T\) can often be interpreted as a scale parameter.
We alternate between scale and distance to refer to r. We mostly use the term distance and use Є or α to refer to the distance threshold in our articles.
Filtration tracks the appearance of shapes along the distance threshold α.
The information gathered through filtration is stored in persistence diagrams (PDs), which consist of a multiset of points supported on the open half-plane \(\{(α_b, α_d) \in R^2: α_b <\)
#PD for 1-dimension holes
#Vectorizations for 0 and 1-dimensional holes
Next Part
>>Part
2