## What is a Graph?

It is important to understand what is a graph, what is not a graph and not to get confused with charts. It is also important to understand graph and graph of a function. Charts represent the *graph* of a function.

The Seven Bridges of Königsberg is a historically notable problem in mathematics. Its negative resolution by Leonhard Euler in 1736 laid the foundations of graph theory and prefigured the idea of topology.

## Mathematical Definition:

A graph is a collection of vertices and edges, which represent ordered pairs of nodes.

- V: a set of vertices
- E: a set of edges

E = {e1, e2, e3, e4}

e: an edge designates a pair of vertices

E = {(v1,v5), (v1,v3), (v6,v3), (v4,v5)}

## How a graph is represented?

A graph is represented as an abstract data type

- that has a data structure to represent the mathematical graph
- and supports operations (on that graph) such as:

• Add_edge

• Add_vertex

• Get_neighbor (and others)

One common way to represent the graph is using matrix:

## Use Cases

- Social Media Analytics
- Gene-Phenotype-Disease Networks
- Human Information Network Analytics
- Analysis and Planning for Smart Cities

## Graph Analytics

Graph Analytics is analytics where the underlying data is natively structured as or can be modeled as a set of graphs.

## Graphs and the V’s of Big Data

- Volume
- Velocity
- Variety
- Valence

### Volume

The volume represents the size in terms of # of nodes and edges. The analysis gets difficult with the increase in volume as it increases algorithmic complexity. Data-to-analysis time is too high if the volume is big. Examples:

- Is there a simple path between “Alzheimer’s Disease” and “Colorectal Cancer”? (Well-known hard decision problem)
- How many simple paths exist between “Alzheimer’s Disease” and “Colorectal Cancer”? The size of the result is exponential in the number of nodes.

### Velocity

- Velocity represents the rate at which a graph increases in size and complexity (Streaming edges in graphs). A stream of updates (stream of edges) makes graph analytics difficult. A continuous stream does not fit in memory.
- The question arises, how can these metrics be computed on the edge-stream with limited memory. This also makes difficult to compute a metric such as 1) shortest distance between two nodes and counting of strongly connected groups, e.g. a Facebook group.

### Variety (Heterogeneity)

- There are differences in the types and sources of data, which are combined to form graphs. This makes graph non-uniform and complex.
- Graph data is often created through integration such as relational, XML/JSON, graph-structured data and document data.
- Different kinds of graphs may have very different meanings which can make things more complex if they are combined. Few examples are social networks, citation networks, interaction networks, semantic web/linked data, ontologies.

### Valence

Valence represents the degree of connectedness or interdependence. Higher valence implies that the data elements are strongly related. This relatedness is significantly exploited in analytics. In many cases, valence increases with time. Parts of the graph become denser. The average distance between arbitrary node pairs decreases.

## Example 1: Social Media Analytics

In Twitter, tweets capture Human Interactions. Many kinds of nodes represents:

- Users
- Tweets
- URLs
- Hashtags
- Media

Many kinds of edges are for example:

- User creates Tweets
- A tweet is in response to another
- A tweet retweets another
- User mentions User
- Tweet contains Hashtag
- User follows User

### Usage:

**Behavioral Psychology:**A branch of psychology that studies people’s behavior**Data Science questions**of a behavioral psychologist: Do players of violent online

games show more confrontational behavior in their tweet conversations? Can we tell how “addicted” a user is to a game?

### Why Graphs?

- Extracting Conversation Threads
- Finding Interacting Groups
- Finding Influencers in a Community

## Example 2: Biology

**Biological Entities interact **

- Networks arising from experimental observations such as:
- Gene-protein relationships
- Gene-gene interactions
- Cell-cell signaling
- Gene-phenotype-disease relationships

- Human Knowledge
- Anatomical knowledge
- Taxonomy of the animal kingdom

- Networks arising from

• Computational Techniques

• Bioinformatics algorithms

• Literature mining

**Data Integration**

- Data sets must be assembled
- More data integration = larger data volume

### Why Graphs?

- Discovering Unknown Relationships
- Connecting the dots
- Indirect associations between diseases
- Exploratory analysis

## Example 3: Human Information Network Analytics

### Combining

- Professional information (e.g., LinkedIn)
- Personal information made public (e.g., FB, Google Plus)
- Calendar information
- Contact/Relationship information
- Public co-activities with other people
- Organized by time and events
- Annotated with events

### Behavioral information

- Financial and business transactions
- Performance in activities
- Fitness habits
- Location information

### Why Graphs?

- Matchmaking problems (Job-candidate pairing)
- Topical influencer analysis (Who would have maximal reach for task X?)
- Situation detection, assessment (Threat detection)

## Example 4: Smart Cities

Cities have multiple interacting networks over the same spatial domain

- Transportation networks (Multiple modalities)
- Water and sewage network
- Power transmission network
- Broadband IP and M2M networks

Networks should be modeled as multiple functionalities of each network such as

- Physical infrastructure
- Commodity flow
- Material
- Energy
- People

It is needed to create “network models”

- Analytical traffic model
- Congestion models

### Why Graphs?

- Planning for “smart hubs”
- Estimating congestion patterns
- Demand Response Analyses
- Energy-optimal routing

## Basic principles and techniques of graph analytics

How to utilize the mathematical properties of data and provide efficient algorithmic solutions for large and complex graph structure problems.

- Graph applications normally have the set of
**node types**and the mapping function that assigns types to nodes. Node type assignment is not mandatory. In addition to types, a node also has**attributes**and**values**. - In the case of Tweet example, the text is the name of an attribute that refers to the textual body of the Tweet whose value is a character string written by the author of the Tweet. For a specific case, as in Tweet example, one has a fixed set of attributes as decided by Twitter. This collection of attributes is called a
**node schema**. For a general graph, a node schema may have an arbitrary number of attribute value pairs. - Similarly, edges have an
**edge type**, also called an edge label. Also just like a node, an edge may have an**edge schema**consisting of attribute value pair. - An attribute called
**interaction type**can have a set of possible values. This set of possible values is called the**domain of the attribute**.

When you think that your application needs graph analytics, you should determine **Informational model** of the graph your application needs. Document the information model, in terms of the elements.

Many applications put different kinds of numeric knowledge into edges of a graph in the form of **edge points**. If we do not put weights in an adjacency metrics, an edge is just represented by placing a one in the appropriate cell. However, if we do use a weight, the weight value can be placed in the adjacency matrix.

### What do the weights mean?

That depends on the application.

- Let’s consider road map example where the nodes are road intersections and the edges represent stretches of the street or highway between these intersections. The edge weight can represent the distance of a particular segment of the road.
- Let’s consider personal communication network example, an email network, we can count the average number of emails per week sent from person A to person B and use it as a proxy for the strength of their connection. So more emails mean a stronger connection.
- Let’s consider biological network, one often has to assess whether an interaction that can occur is actually likely to occur given the concentration of the reactants, the chemical environment at the site of the reaction and so forth. This is represented as a weight that designates the likelihood of interaction.
- Finally, let’s consider a knowledge network where nodes represent entities like people, places, and events. And edges represent relationships like a person visited a place or movie actor A is dating movie actress B. Now this kind of information may be important for some news media. However, if the information does not come from an authentic source, itself, it is more prudent to put a certainty value on it. This certainty value may be treated as a weight on the edge.

The **structure of the graph** often contains valuable insights to a graph. Structural properties of graphs:

**Loop:**A loop, which is an edge from a node to itself.A protein can interact with another protein of the same kind.

People send emails to themselves.

A road segment circles back to the same intersection.

A website has a link to its own URL.The existence of loops and the nodes that have such loops can be very informative in some applications and can be problematic for other applications.**Multi-graphs:**The occurrence of multiple edges between the same node pair. The graphs with this feature are called multi-graphs.Two genes can have five different types of interactions between them. Where each interaction has a different value for the attribute interaction type.A person can be a spouse, a co-performer in music and financial adviser, all at the same time.Many analytics algorithms are not natively designed for multigraphs, and often need a little customization to handle them.

### Categories of Graph Analytics:

**Path Analytics:**How to traverse to the nodes and edges of the graph?**Connectivity Analytics:**How to explore the connectivity pattern of the graphs? The connectivity pattern refers to the structure and organizations of the edges of the graph.**Community Analytics:**How to discover the behavior of communities which are closely interacting entities in a network.**Centrality Analytics:**How to detect and characterize significant nodes of a network with

respect to a specific analysis problem.

## Path Analytics

What is the path? How to find your way as it travels along the nodes and edges of the graph?

- A
**walk**is an arbitrary sequence of nodes and edges that starts from some node and

ends on some node. - In many applications, we do not want to consider arbitrary walks but consider a walk where we do not repeat nodes unless we need to come back to the starting point. Such a constrained walk is called a
**path**. - A network with no cycles is called
**acyclic**. A graph that is both directed and acyclic is called a**directed acyclic graph**, in short, a DAG. - A
**trail**is a concept similar to a path, it is a walk with no repeating edges. - A common notion in directed graphs is called
**reachability**. Reachability is not symmetric. - What is a leaf node of the graph?
**Graph Diameter:**The diameter of a graph measures the maximum number of steps, also called hops, requires us to traverse to go to the most distant node in the graph. That means, if you go from an arbitrary node to any other arbitrary node in the graph, following only the shortest paths roots, what is the maximum number of steps you have to take?

### How to find the best path from node one to node two?

To specify the best path, we need to define when one path is better than another. This is usually expressed in terms of an optimization problem. Where we need to minimize and maximize our subfunction subject to constraints.

- What kind of constraints? Two common criteria for graphs are inclusion and exclusion conditions.
**Inclusion criteria**may specify which nodes we have to include in the path.**Exclusion criteria**specify which nodes and edges should be excluded from the path.- In addition, one specifies a
**preference****criteria**that act as a softer or less strict constraint.

- The
**least weight path problem**is an important problem to solve for the benefit of the commuters. A widely applied algorithm that is applied to shortest path problems is called**Dijkstra’s algorithm**. Originally, Dijkstra considered a variant of the problem where the task is to find the shortest path from a single source node to all other nodes.

## Connectivity Analytics

Another fundamental property of graphs is called Connectivity. Two important kinds of graph analytic questions that are based on connectivity:

**Robustness of the network:**Suppose we have a computer network with many servers and a hacker is trying to bring the system down, you set a small set of servers that the hacker can target to disrupt the network. Disrupt the network means to disconnect a part of the network from the rest.- A similar problem may occur in the power distribution network. In that case, an attacker may be able to attack one or two central components so that large portions of

the network loses power. - Many biological networks have built-in redundancy, so that even if you disrupt one important gene, there are other genes in the network which will support the biological

function, possibly through other routes. - Therefore, a network is robust, if removing one or more edges or nodes still keeps it connected.

- A similar problem may occur in the power distribution network. In that case, an attacker may be able to attack one or two central components so that large portions of
**Network comparison in terms of their overall connectivity:**Suppose there are two graphs which are very different in their structure. What are some parameters by

which we can compare them?**The concept of connectivity:**A graph is connected if we can reach any node from any other node. Even though the graph is not connected, we can identify parts of the graph that are themselves connected. These islands of connected parts are called**components**or**connected components of a graph**.- For directed graphs, we need to be a little more specific about connectivity. The directed graph is
**strongly connected**, if we follow the direction of the edges and still reach every node from every other node. A weaker form of connectivity is if we do not care about the direction of the arrows and can reach every node from every other node.

### Big Graph challenge:

- Given an arbitrarily large graph, can I find the connected components of the graph efficiently?
- Given an arbitrarily large graph, can we find its sub-graphs that are strongly connected?

### Disconnecting a Graph

There are two ways to break a graph. Identifying the smallest node set which if removed will disconnect the graph.

**Network Robustness:**If node B is reachable from node A originally, it should remain reachable, even if the network “attacked”.- Higher degree nodes make the network more vulnerable. An attacker would target such nodes.

### Connectedness: Indegree and Outdegree

The **degree** of the node is equal to the number of edges connected to it. This specifies if a node is more connected than another. Indegree and outdegree are the counts of the incident and the outgoing edges of a node respectively.

The degree versus count table is a degree histogram of the graph. We can compare two graphs by computing the vector distance between them. One simplistic measure is

just the Euclidean distance. The degree histogram based on comparisons of the histograms help find the graphs if they are similar. We can also compute histograms of just the indegree or just the outdegree of the graph.

How to assess the similarity of two histograms? Link

## Community Analytics

How to identify and track groups of interacting entities in a network?

Communities are highly interacting clusters in a graph. That is, they form pockets of denser subgraphs that are more connected to each other, than to members of the other clusters. Communities of humans, or otherwise, are interesting things to study because it gives us an insight into the interaction patterns. And how they change with time. Here, are some analytics questions about communities based on three categories:

**Static Analysis:**Analytics questions that do not depend on time, are called**static**.- What are the communities at time T?
- Who belong to a community?
- How tight-knit members are connected, and so forth.

**Temporal / Evolution Analysis:**This involves the formation and evolution of the community. Communities can form temporal, for example, around an event like a school shooting. Or some communities, despite the comings and goings of members, sustain themselves well. A Facebook group, a political party, fans of a music band, are likely to continue over time and hence are non-transient. One can also be interested in the history of the formation of a community, like a criminal network.- How did this community form?
- Which communities are stable?
- Find strong transient communities – why did they form or dissolve?

**Predictive Analysis:**Analysts would like to predict- Is this community likely to grow?
- Will these nodes continue as a community in the future?
- Whether it’s composition of members might change.
- Or whether there are emerging power shifts within the community.

### Detecting a Community based on local property

How to identify communities in a large network? We need to realize, the fact that there are more connections within the community and fewer connections between two communities.

C – connected subgraph of graph G

We can compute

- The internal and external degree of a vertex
- Internal – within C
- External – outside C

- The internal and external degree of the cluster C
- Sum of internal / external degrees of the vertices of C

- for C to be community
- δint should be high and δext should be low

We can sum up the internal degrees of all nodes in a cluster. And call it the **internal degree of the whole cluster**. And similarly, sum the external degrees of the nodes, to compute the **external degree of the cluster**.

*Intra-cluster density δint* is the ratio of # of internal edges of the cluster divided by the number of possible connections in the cluster(# number of pairwise combination of nodes within the cluster).

Similarly,* inter-cluster density, δext* is the # of inter-cluster edges divided by

the possible pairings between nc, a node in the cluster, to n- nC, the # of nodes outside the cluster.

There are two kinds of methods used for finding communities in the network.

- One of them focuses on local properties, that is, properties for which one only looks at a node and its neighbor. For an ideal community in a network is a subgraph, where every node is connected to every other node in the subgraph. Such a structure is called a
**clique**. To find a perfect community structure as a clique, one can try to find the largest clique within a graph. That is a computationally challenging problem. It’s much simpler to find cliques if we know the value of k. **Near cliques:**In the real world, perfect cliques larger than three or four are harder to find. So there are two types of relaxations, those based on distance, and those based on density. Two distance based definitions are**n-clique**and n-**clan**.**n-clique**is a subgraph, such that the distance between each node pair in that subgraph is n or less.**n-clan**, where, to belong to the n-clan, the shortest path between any members, without involving outsiders, is n or less. n-clique and n-clan are distance based measures for finding cohesive groups of communities.

- K-core is a density-based method for finding communities: Maximal subgraph in which each vertex is adjacent to at least
*k*other vertices of the graph*.*

### Detecting a Community based on global property – Modularity

Modularity is a global measure of cluster quality. It tries to estimate the quality of clusters of communities in the graph.

- The fraction of edges within the given groups minus the expected such fraction if edges were distributed at random.
- the value of the modularity lies in the range [1/2, 1)

The modularity measure thus estimates the quality of the clusters in the graph by evaluating this difference of the actual minus the random edge fraction. So this is the mathematical formulation,

let’s find clusters in the graph so that Q is maximum and then we’re done. Maximizing Q is very hard. So we need to find an approximate solution.

Following YouTube video illustrate a very popular method of finding this modularity based community detection.

There are six large categories of evolution steps that can happen within a community.

- Growth
- Merging
- Birth
- Contraction
- Splitting
- Death

## Centrality Analytics

Are all nodes equally important in the network? What makes some nodes more important/valuable than others?

- Influencers in the social network
- A junction station in a transport network
- A house-keeping gene in a biological network
- A central server in a computer network

### Key player problems

- Given a social network find a set of
*k*nodes- which, if removed, would maximally disrupt communication among the remaining nodes
- Given an infectious disease in a city which subpopulation should be immunized so as to maximally hinder the spread the infection.

- That is maximally connected to all the nodes
- Given a community of 1000 members find members find 5 members who would be able to convince others to vote for a candidate

- which, if removed, would maximally disrupt communication among the remaining nodes

### Centrality and Centralization

- Centrality
- The measure of the importance of a node (or edge) based on its position in the network.
- Different ways to measure centrality

- Centralization
- The measure of a network, not a node.
- The degree of variation in the centrality scores among the nodes.

**Notes:** *Centrality* is for a given node. Essentially the more connected the node, the more central. *Centralization* is for a network. As the number of nodes in a network with high centrality increases, then the centralization of a network decreases – because there is less variation in the centrality values of the nodes in the network.

We can compute the network centralization by considering the sum of the difference between the maximum centrality and the centrality of the node divided by the maximum centrality.

### Different measures of centrality:

- Degree centrality
- Closeness centrality
- Betweenness centrality
- Eigenvector centrality
- Katz centrality
- … and many more …

### Degree Centrality

- Count the # of edges incident upon a given node normalized by the possible number of edges (# of edges) / (N – 1)
- “hub” (maximally-connected) nodes

### Group Degree Centrality

- Consider group as a single entity
- Count of the number of edges incident upon the group normalized by non-group members
- # of edges in the group / # of non-group members

### Closeness Centrality

- Sum of shortest-path distances from all other nodes (normalized)
- Low raw closeness means node has short distance from other nodes

- For an information flow network
- Low closeness nodes receive information sooner than other nodes
- Same for other
*if the flow happens through shortest paths* - How about gossip?

- Same for other
- A low closeness can influence many others, directly or indirectly.

- Low closeness nodes receive information sooner than other nodes

### Betweenness Centrality

- Ratio of pairwise shortest paths that flows through node
*i*and count of all shortest paths in the graph - Measures fraction of shortest-path commodity flow passing through a node
- Not applicable to non-flow situations e.g. rumor, infection.
- Not applicable when shortest path routes are not taken.