Sunday, March 18, 2012

The real problem with search engine optimization... or the web?

This evening, I read a news that Google is Planning to Penalize Overly Optimized Sites, and I found it through Slashdot. What really interests me is that the way I found the news is an illustration of a search engine optimization problem itself. It turns out the real "beef" of the news behind the links is 5 degrees to the secondary source and 2 more degrees to the primary source. Each time the story is linked, another person either adds a little bit more insight or just summarizes the linked successor slightly differently without much value added.

Here is the link structure of the story.

The distinction between primary source and secondary source is made that secondary source creates new insight from the primary source, but the primary source is just a recording of facts without any insight. In this case, the primary source included an audio recording of a panel discussion, and the secondary source highlighted pieces of it with its own interpretation.

Note that CNET also posts the link to the primary source on sxsw.com, but without context that it is the primary source.

I think this particular example of the search engine optimization story illustrates the problem very well. In primary school, you were taught that every reference is either a primary source or a secondary source, but the reality is always more complicated. Some secondary sources are primary sources in some aspect, and tertiary sources can still add value (or you can argue that the third category is also a secondary source). The web makes it much more easier to link, and now people also have financial incentive---the more they get cited, the more ads they can show, and that gives them a profit. They can do that without generating any original content, or they can add a little value by adding their own interpretation. The ads revenue they generate, on the other hand, depends on how much effort they put in promoting their post, so that other people will want to link to them.

Let's put the financial motive aside and assume that people will always link without ads revenue. If a person can get away showing tons of ads but actually adds tons of value to some topic, then why penalize him?

The issue is that getting to the "beef" of the story is still a graph search problem. The search engine (e.g. Google Bot) supposedly has all the link graph information, but it does not understand the distinction between primary and secondary sources, and it's up to the reader to investigate by spending time and effort. With the amount of information exploding each day, we really need a search engine that will do better. If it's too much to ask a search engine to understand the difference between primary and secondary sources, I think it is at least plausible to have a tool to highlight the contribution of each page with regard to a particular aspect or topic---like a "diff" tool with fuzzy syntactic matching.

Friday, March 9, 2012

The Tower Hotel Problem

Dynamic storage allocation problem, both as an online problem and an offline problem, is widely studied in literature, such as this article by Jordan Gergov. Here I present a similar problem which I call "The Tower Hotel Problem."

The Tower Hotel is peculiar. It has an infinite number of floors, but each floor has the same fixed number of rooms. The proprietor wanted to keep the number of floors open to the public minimal in order to save costs on maintenance; a closed floor incurs no cost. A floor is closed if none of the rooms are occupied. But as long as there is at least one occupied room, the floor must be open to public. Once a guest arrives and settles in a room, the proprietor must not ask the guest to move to another room.

In a zealous attempt to minimize cost, the proprietor requires all guests to make reservations before the year begins. The guest would indicate upon reservation the arrival and departure dates. Once the year begins, no more reservation can be made. The proprietor then optimizes the placement of guests on floors. This eccentric policy may be one reason the hotel is not very well known.

A number of years passed, and the proprietor passed away. His son knows nothing about hotel management, so the son interviewed three potential managers. In order to make the hotel more popular, the son required that the managers must not assume the reservations are made in advance. In fact, he went a step further and said that a guest could arrive at any time and leave at any time.

Each manager had a different strategy. The first manager would place the guest at random on any of the opened floors with vacancy (and open up a new floor if none of the existing floors are available). The second manager would place the guest on the floor with the most number of occupied rooms (but still has vacancy). The third manager would place the guest on the lowest numbered floor with vacancy. The third manager has an assistant who, due to her desire to be promoted, secretly proposes to the proprietor's son a slightly different strategy than her manager: she would locate the lowest numbered floors with vacancy, fill it entirely first, then find again the lowest numbered floors with vacancy, which may be lower than the previously chosen floor because some guests below may have checked out.

How does the proprietor's son determine which manager has the most cost-effective strategy? How do these managers and the third manager's assistant compare with what the proprietor could do, when the proprietor had full knowledge of all the reservations in the whole year?

This problem describes segregate fits dynamic storage allocation with virtual memory where all objects are of the same size, and a memory page could fit a fixed number of these objects. A virtual memory page has to be backed by a physical memory page if there is at least one object in the page. Once all objects in the page are freed, the physical memory page backing can be removed. This presents a different sort of fragmentation problem where certain long-lived objects can pin a memory page even though the page is mostly unused.