Breaking out of a social network cul de sac
In this somewhat technical adventure, a small crawler traversing a big social network dataset gets stuck in a dead end and needs to find a way to escape
This article is about how my logic traversing the relationships of a multi-million record dataset of hypothetical social network profiles got stuck in what I realized was a social network cul de sac, and which I resolved with a simple fix for selecting which profile was chosen for traversal next.
I recently came across this phenomenon when I was parsing a public dataset comprising more than 2 million hypothetical social network profiles and filtering for specific criteria when I ran out of new and unique records after only finding a thousand matches initially.
This prompted me to ask:
Tl;dr - I responded later in the thread with an initial stab at a definition I think still holds up:
Who cares?
At first glance, this discovery and solution to my problem may only seem relevant to web crawler developers, but this initial discovery opened up several new threads to tease out about the deeper implications of social network cul de sacs.
Specifically, how people in a social network can serve as a “bridge node” connecting otherwise insular groups or communities, how these insular groups can emerge in the first place, and how we might quantify the potential economic value of the “bridge nodes.”
Admittedly, these are some of the initial implications that are fun to tease out, but are best left to a future follow-on writeup.
What happened?
This unexpected event is probably so mundane to professional data scientists, engineers, and probably even intern analysts, but un(?)fortunately I am none of those things so it was surprising to me:
After traversing a graph of “people” entities by the initial 20 “people” records, filtering on my arbitrary criteria, and accumulating a new list of around 1,000 records, my script then ran out of new unique records to my list.
I thought this was odd because as I mentioned, the dataset is around 2 million items, and even with the filtering I was doing, the values I was filtering on were common enough that I could see running out of new unique entities at the 500,000 or even 250,000 item count, but running out of new unique entities after only 1,000 items? How could this be?
If we visualize the dataset as a simple network graph, each node (fancy data science talk for each circle on the graph) maps to a social network profile, and the connectors are relationships, or in data science talk, “edges.”
I realized from visualizing this that my script logic was selecting the most recently added profile to the new list, and in doing so, I was increasing the chances of not finding any new unique profiles that were not already in my list, because those profiles were all following each other, and after looping through that particular group enough times, there were no profiles “external” to that group’s following.
Effectively, my script logic had inadvertently fallen into a social network culdesac, endlessly looping through records whose related entities already existed in my items list.
I visualize it like this:
This simplified graph visualizes how some nodes in the graph are “bridge nodes” that connect one group of insular nodes to each other-- note how the red loop becomes a culdesac signifying there are no new found connections
My (primitive) solution
Recall that my business problem driving the creation of this logic in the first place was to identify all items in the dataset that had specific attributes and to add them to my new list.
I’m not familiar with any of the fancier graph tools, but I found that if I selected the seven most recently added items and then randomly selected one of those seven to traverse the relationship with, I was far more likely to continually find more items that met my criteria.1
This solution is generalizable to any code or no-code piece of logic and visualized as a process flowchart like this:
So what did we learn? Well, if nothing else, we got some additional supporting evidence that abstract problems can be solved fairly quickly if visualized and that most problems can be hacked through and resolved without an especially fancy approach.
Thanks to & Inspired by
Thanks to Seyi, Katherine, Sean, and Olena for reviewing my initial draft and their helpful feedback
Inspired by previous network thinkers like the original three network doers and thinkers, Sarnoff, Metcalfe, and Reed, and this dandy summarization of their discoveries by NƒX in their Network Effects bible
In fact, I have not encountered this issue again after implementing this solution