<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>The Virtuosi (Posts about search engines)</title><link>https://thephysicsvirtuosi.com/</link><description></description><atom:link href="https://thephysicsvirtuosi.com/categories/search-engines.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2019 &lt;a href="mailto:thephysicsvirtuosi@gmail.com"&gt;The Virtuosi&lt;/a&gt; </copyright><lastBuildDate>Thu, 24 Jan 2019 15:05:00 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>The Skeleton Supporting Search Engine Ranking Systems</title><link>https://thephysicsvirtuosi.com/posts/old/the-skeleton-supporting-search-engine-ranking-systems/</link><dc:creator>DTC</dc:creator><description>&lt;div&gt;&lt;p&gt;
&lt;img src="https://thephysicsvirtuosi.com/images/skeleton-search/Skeleton_image_1.jpg" width="100%" alt="octopus google" title="Octosearch!" style="float:center"&gt;
&lt;/p&gt;

&lt;p&gt;A lot of the research I’m interested in relates to networks – measuring
the properties of networks and figuring out what those properties mean.
While doing some background reading, I stumbled upon some discussion of
the algorithm that search engines use to rank search results. The
automatic ranking of the results that come up when you search for
something online is a great example of how understanding networks (in
this case, the World Wide Web) can be used to turn a very complicated
problem into something simple.&lt;/p&gt;
&lt;p&gt;Ranking search results relies on the assumption that there is some
underlying pattern to how information is organized on the WWW- there are
a few core websites containing the bulk of the sought-after information
surrounded by a group of peripheral websites that reference the core.
Recognizing that the WWW is a network representation of how information
is organized and using the properties of the network to detect where
that information is centered are the key components to figuring out what
websites belong at the top of the search page.&lt;/p&gt;
&lt;p&gt;Suppose you look something up on Google (looking for YouTube videos of
your favorite band, &lt;a href="http://thephysicsvirtuosi.com/author/corky.html"&gt;looking for edifying science
writing&lt;/a&gt;, tips on octopus pet care,
etc): the search service returns a whole spate of results. Usually, the
pages that Google recommends first end up being the most useful. How on
earth does the search engine get it right?&lt;/p&gt;
&lt;p&gt;First I’ll tell you exactly how Google does &lt;em&gt;not&lt;/em&gt; work. When you type in
something into the search bar and hit enter, a message is &lt;em&gt;not&lt;/em&gt; sent to
a guy who works for Google about your query. That guy does &lt;em&gt;not&lt;/em&gt; then
look up all of the websites matching your search, does not visit each
website to figure out which ones are most relevant to you, and does
&lt;em&gt;not&lt;/em&gt; rank the pages accordingly before sending a ranked list back to
you. That would be a very silly way to make a search engine work! It
relies on an individual human ranking the search results by hand with
each search that’s made. Maybe we can get around having to hire
thousands of people by finding a clever way to automate this process.&lt;/p&gt;
&lt;p&gt;So here’s how a search engine &lt;em&gt;does&lt;/em&gt; work. Search engines use robots
that crawl around the World Wide Web (sometimes these robots are
referred to as “spiders”) finding websites, cataloguing key words that
appear on those webpages, and keeping track of all the other sites that
link into or away from them. The search engine then stores all of these
websites and lists of their keywords and neighbors in a big database.&lt;/p&gt;
&lt;p&gt;Knowing which websites contain which keywords allows a search engine to
return a list of websites matching a particular search. But simply
knowing which websites contain which keywords is not enough to know how
to order the websites according to their relevance or importance.
Suppose I type “octopus pet care” into Google. The search yields 413,000
results- far too many for me to comb through at random looking for the
web pages that best describe what I’m interested in.&lt;/p&gt;
&lt;p&gt;Knowing the ways that different websites connect to one another through
hyperlinks is the key to how search engine rankings work. Thinking of a
collection of websites as an ordinary list doesn’t say anything about
how those websites relate to one another. It is more useful to think of
the collection of websites as a network, where each website is a node
and each hyperlink between two pages is a directed edge in the network.
In a way, these networks are maps that can show us how to get from one
website to another by clicking through links.&lt;/p&gt;
&lt;p&gt;Here is an example of what a network visualization of a website map of a
large portion of the WWW looks like. (Original full-size image
&lt;a href="http://upload.wikimedia.org/wikipedia/commons/d/d2/Internet_map_1024.jpg"&gt;here&lt;/a&gt;&lt;a href="http://www.blogger.com/"&gt;&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;
&lt;img src="https://thephysicsvirtuosi.com/images/skeleton-search/Internet_map_1024.jpg" width="100%" alt="internet map" style="float:center"&gt;
&lt;/p&gt;

&lt;p&gt;Here is a site map for a group of websites that connect to the main page
of English Wikipedia. (Original image from
&lt;a href="http://en.wikipedia.org/wiki/Site_map"&gt;here&lt;/a&gt;.) This smaller site map is
closer to the type of site map used when making a search using a search
engine.&lt;/p&gt;
&lt;p&gt;
&lt;img src="https://thephysicsvirtuosi.com/images/skeleton-search/Main_Page_Usability.png" width="100%" alt="internet map" style="float:center"&gt;
&lt;/p&gt;

&lt;p&gt;So, how does knowing the underlying network of the search results help
one to find the best website on octopus care (or any other topic)? The
search engine assumes that behind the seemingly random, hodgepodge
collection of files on the WWW, there is some organization in the way
they connect to one another. Specifically, the search engine assumes
that finding the websites most central to the network of search results
is the same as finding the search results with the best information.
Think of a well-known, trusted source of information, like the New York
Times. The NY Times website will have many other websites referencing it
by linking to it. In addition, the NY Times website, being a trusted
news source, is likely to refer to the best references for other sources
that it wants to refer to, such as Reuters. High-quality references will
also probably have many incoming links from websites that cite them. So
not only does a website like the NY Times sit at the center of many
other websites that link to it, but it also frequently connects to other
websites that themselves are at the center of many other websites. It is
these most central websites that are probably the best ones to look at
when searching for information.&lt;/p&gt;
&lt;p&gt;When I search for “octopus pet care” using Google I am necessarily
assuming that the search results are organized according to this
core-periphery structure, with a group of important core websites
central to the network surrounded by many less important peripheral
websites that link to the core nodes. The core websites may also connect
to one another. There may also be websites disconnected from the rest,
but these will probably be less important to the search simply because
of the disconnection. Armed with the knowledge of the connections
between the different relevant websites and the core-periphery network
structure assumption, we may now actually find which of the websites are
most central to the network (in the core), and therefore determine which
websites to rank highly.&lt;/p&gt;
&lt;p&gt;&lt;a id="note1"&gt;&lt;/a&gt;
Let’s begin by assigning a quantitative “centrality” score to each of
the nodes (websites) in the network, initially guessing that all of the
search results are equally important. (This, of course, is probably not
true. It’s just an initial guess.) Each node then transfers all of its 
centrality score to its neighbors, dividing it evenly between 
them&lt;a href="https://thephysicsvirtuosi.com/posts/old/the-skeleton-supporting-search-engine-ranking-systems/#fnote1"&gt;&lt;sup&gt;[1]&lt;/sup&gt;&lt;/a&gt;.
(Starting with a centrality score of 1 with three neighbors, each of
those neighbors receives 1/3.) Each node also receives a some centrality
from each neighbor that links in to it. Following this first step, we
find that nodes with many incoming edges will have higher centrality
than nodes with few incoming edges. We can repeat this process of
dividing and transferring centrality again. Nodes with many incoming
links will have more centrality to share with their neighbors, and nodes
with many incoming links will themselves also receive more centrality.&lt;/p&gt;
&lt;p&gt;After repeating this process many times, we begin to see a difference
between which nodes have the highest centrality scores: nodes with high
centrality are the ones that have many incoming links, or have links to
other central nodes, or both. This algorithm therefore differentiates
between the periphery and the core of the network. Core nodes receive
lots of centrality because they link to one another and because they
have lots of incoming links from the periphery. Peripheral nodes have
fewer incoming links and so receive less centrality than the nodes in
the core. Knowing the centrality scores of search results on the WWW
makes it pretty straightforward for us to quantitatively rank which of
those websites belongs at the top of the list.&lt;/p&gt;
&lt;p&gt;Of course, there are more complex ways that one can add to and improve
this procedure. Google’s algorithm PageRank (named for founder Larry
Page, not because it is used to rank web pages) and the HITS algorithm
developed at Cornell are two examples of more advanced ways of ranking
search engine results. We can go even further: a search engine can keep
track of the links that users follow whenever a particular search is
made. (This is almost the same as the company hiring someone to order
sought-after web pages automatically whenever a search is made, except
all the company lets the user do it for free.) Over time, search engines
can improve their methods for helping us find what we need by learning
directly from the way users themselves prioritize which search results
they pursue. Still, these different search engine ranking systems may
operate using slightly different methods, but all of them depend on
understanding the list of search results within the context of a
network.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a id="fnote1"&gt;&lt;/a&gt;
1. &lt;a href="https://thephysicsvirtuosi.com/posts/old/the-skeleton-supporting-search-engine-ranking-systems/#note1"&gt;^&lt;/a&gt; It's not always all - there are other variations where nodes only
transfer a fraction of their centrality score at each step.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sources (and further reading)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I wanted to include no mathematics in
this post simply because I cannot explain the mathematics behind these
algorithms and their convergence properties better than my sources can.
For those of you who want to see the mathematical side of the argument
for yourselves (which involves treating the network adjacency matrix as
a Markov process and finding its nontrivial steady state eigenvector),
do consult the following two textbooks:&lt;/p&gt;
&lt;p&gt;Easley, David, and Jon Kleinberg. &lt;em&gt;Networks, Crowds, and Markets:
Reasoning about a Highly Connected World&lt;/em&gt;. Cambridge University Press,
2010 (&lt;a href="http://www.cs.cornell.edu/home/kleinber/networks-book/networks-book-ch14.pdf"&gt;Chapter 14&lt;/a&gt;
in particular) &lt;/p&gt;
&lt;p&gt;Newman, Mark. &lt;em&gt;Networks: an Introduction&lt;/em&gt;. Oxford
University Press, 2010 (Chapter 7 in particular)&lt;/p&gt;
&lt;p&gt;A popular book on the early development of network science that contains
a lot of information on the structure of the WWW:&lt;/p&gt;
&lt;p&gt;Barabasi, Albert-Laszlo. &lt;em&gt;Linked: How Everything is Connected to
Everything Else and What It Means&lt;/em&gt;. Plume, 2003.&lt;/p&gt;
&lt;p&gt;A book on the history of modern computing that contains an interesting
passage on how search engines learn adaptively from their users (that
deserves a shout-out in this blog post).&lt;/p&gt;
&lt;p&gt;Dyson, George. &lt;em&gt;Turing's Cathedral&lt;/em&gt;. Pantheon, 2012.&lt;/p&gt;&lt;/div&gt;</description><category>centrality measures</category><category>networks</category><category>octopodes</category><category>search engine ranking</category><category>search engines</category><guid>https://thephysicsvirtuosi.com/posts/old/the-skeleton-supporting-search-engine-ranking-systems/</guid><pubDate>Tue, 01 Jan 2013 19:09:00 GMT</pubDate></item></channel></rss>