Tuesday, November 27, 2007

MPlayer with libdvdnav2

At some point, MPlayer forked their own libdvdnav and many other libraries. MPlayer continues to compile with many stock libraries, but starting about two weeks ago, MPlayer no longer compiles with stock libdvdnav. It was related to this svn commit in April, and nicodvb finally decided it's okay to assume MPlayer is using its own fork. I don't know if the plan is to eventually link libdvdnav2 directory to their build tree at some point, like what they do with libdvdcss and libav{codec,format,util}, but this is where to find libdvdnav2 that works:
svn co svn://svn.mplayerhq.hu/dvdnav/trunk/libdvdnav2
For my own information, this is where to find mplayer build tree:
svn co svn://svn.mplayerhq.hu/mplayer/trunk mplayer
Make and install libdvdnav2, then make clean and reconfigure mplayer. It should work.

Update: it seems that they now prefer svn co svn://svn.mplayerhq.hu/dvdnav/trunk/libdvdnav (not libdvdnav2) once again.

Friday, October 19, 2007

Observation on a Comcast network

I moved to a new apartment a month ago, and Comcast didn't disconnect the previous resident, so my roommates and I had been using Comcast for free. This is what I found out about their network.

This is pulled from my memory because they disconnected the cable on Monday.
  1. When I upload a file, I could send the first few megabytes at 1MB/s, then it gets throttled down to 30KB/s or less. That means bandwidth throttling does not kick in for the first few seconds of a connection. Afterwards, data is let through in short bursts. I could, however, fire up multiple TCP connections and maintain 30KB/s across all of them. The throttling appears to be connection oriented.
  2. Even low-bandwidth, long lasting TCP connections, such as Skype and SSH, get randomly cut off. There is no "connection reset," and it simply gets stalled. Sometimes that happens every 10 minutes. Sometimes it could stay on for 10 hours. The median is around 2 hours.
It is difficult to throttle bandwidth on a cable network. They can't let the individual cable modems govern itself because users can use their own modem and workaround the bandwidth policy. The next possibility is at the gateway. That's difficult too because all modems in the neighborhood share the same "ether" before reaching the gateway. Doing bandwidth accounting for each IP address is going to be expensive for the gateway router. That's why Comcast bandwidth throttle is connection based.

What I observed is peculiar to how Comcast culls p2p traffic on their network, by making long-hanging TCP connections unreliable. However, it's interesting how Comcast allows the first few outgoing MBs at full speed. Since SMTP connections are short, this allows spammer botnets to go off at full speed.

In other words, Comcast favors spammers over p2p file sharing, Skype, and SSH.

There is a workaround. File sharing protocol should be redesigned to dynamically create and tear down multiple, simultaneous TCP connections. And it should be able to distribute traffic across these TCP connections.

Maybe Comcast will eventually move to discourage short, high-bandwidth TCP connections as well, but that means spammers will be hit, and we end up with less spam. It's a win-win situation for the end user regardless.

Haskell, not your usual programming language

I am learning Haskell, and these are the points I came up with that summarizes the shock with the language. I'm proficient in other strict, unpure functional languages.
  1. Haskell is truly lazy. If there wasn't a need to print anything to the terminal, then computation simply would not happen. In other words, computation is bootstrapped by a side-effect. This explains why the main() expression in Haskell is an IO () monad. Program won't run without being a monad.
  2. An expression with a monadic type simply means it's using call-by-value semantics rather than the usual call-by-need in Haskell. It's analogous to the lazy keyword in O'Caml, which is used for call-by-need evaluation in a call-by-value language. It might be more comfortable for someone like me to start writing in the monad language of Haskell.
  3. There is no signaling mechanism in Haskell, but you can timeout a computation by running it on a separate thread, wait for the result or for the timeout on the current thread, then kill the computation thread if it runs out of time. This is how it is implemented in the System.Timeout module.
  4. Naturally, threads are also monad expressions. Even being a monad, thread won't run if it doesn't cause any side-effects. As a result, System.Timeout only works with expressions of a monadic type. Computation cannot be forced by simply lifting the expression to a monad using the return function.
  5. Haskell concurrency implementation only performs context switching when the program allocates from the heap. As a result, the following expressions do not time out (run forever):
    • length (repeat 1)
    • let xs = 1 : xs in length xs
    But the following do:
    • length [1..]
    • let xs = make_xs () where make_xs () = 1 : make_xs () in length xs
    Note that length xs does not terminate when xs is infinite. For brevity, the result of all the expressions above are printed using IO.putStr, which is the easiest way to force evaluation.

Wednesday, September 19, 2007

Some Ubuntu notes

These are some of my own customization that differs from the default Ubuntu installation.
  • For Dell with integrated Intel graphics card, install xserver-xorg-video-intel and use the "intel" driver rather than "i810" one. This should make 915resolution obsolete.

    915resolution is used to patch the video BIOS for the "i810" driver so additional video modes (such as 1680x1050) are supported for widescreen LCD monitors (mine is an Acer X221W 22", used on Intel 915G chipset). It turns out that the "i810" driver ignores the ModeLine settings in xorg.conf and instead tries to use the BIOS for switching video mode. It is no longer needed for the "intel" driver.

    Additional xrandr invocation might be needed to get the timings straight. Even when the screen resolution is set at 1680x1050, there might be a position problem where the display is chopped off on some sides and has blank bars on others. I find that setting it to 1600x1200 first, then switch to 1680x1050, fixes the problem.
  • alpine, ssmtp: use alpine instead of pine, and ssmtp in place of sendmail, postfix, or exim4. Probably want to symbolic link /usr/bin/alpine to /usr/local/bin/pine.

    Either using pine or alpine, modify /etc/pine.conf by setting sendmail-path to "/usr/sbin/ssmtp -t".
  • Add medibuntu to the repositories for additional non-free packages. This makes the additional packages libdvdcss, skype, ... available.
Pending issues:
  • Empirical regression of nfs performance over BU Linux. Wonder why?
  • Skype static no longer works. Stuck at end user license agreement dialog, and skype adruptly quits.

Friday, April 20, 2007

Comparing Javascript Inheritance in Firefox, IE7 and Opera

When I started writing Javascript programs just a few months ago, I got excited in how you could augment an existing class by modifying its prototype. For example, you could add a functional-style map (which is what Google's map-reduce is based on) to Array by modifying its prototype as follows:
Array.prototype.map = function (f)
{
var len = this.length;
var arr = new Array(len);
for (var i = 0; i < len; i++)
arr[i] = f(this[i]);
return arr;
}
Now, for any array, you could call the map method, passing in a mapping function as an argument, and the result would be a new array where each element is transformed from the original array by the mapping function. For example:
var xa = [1, 2, 3, 4];
var ya = xa.map( function(n) { return n * n; } );
would result in ya = [1, 4, 9, 16].

I also became fascinated by a technique to dynamically making an existing object inherit from another object, by copying the other object's properties over. For example, I don't need to hard-code event handlers in HTML. I could wrap these event handlers in a constructor function along with all local, instance-specific states. I lookup the HTML element whose behavior I want to override, and inject my event handlers to it. I could also reuse that event handling class for many elements in the same document without worrying if I would pollute global variables. This achieves good abstraction and code reuse.

Since all objects are inherited from Object, I could write a function Object.prototype.inherit and make that sort of dynamic inheritance available to all objects. Right?

Not so fast. It turns out that in Internet Explorer 6 and 7, not all objects are instances of Object. Most notable ones are document, window, any values representing DOM node or a node list, and any instances of ActiveXObject. As a result, they do not inherit changes made to Object.prototype. Furthermore, these objects don't have a constructor. All objects have a "constructor" property, but these objects are an exception.

Are there any exceptions like that in Firefox? There is. If you try to obtain a DOM node list, for example, by document.getElementsByTagName("A"), you would get an object whose constructor is HTMLCollection. Firefox admits such object to be an instance of HTMLCollection class, but the object doesn't inherit changes made to HTMLCollection.prototype (which is actually read-only). It does inherit changes to Object.prototype, however.

Such inconsistencies are probably due to the fact these objects are implemented natively (in C) and wrapped for Javascript.

Some browsers are more consistent for native objects than others. All classes (constructor functions) are instances of Function because that is the way Javascript is designed. In Firefox, IE and Opera, all primitive classes such as Number, String, RegExp, even Function itself are instances of Function. The situation is different for DOM classes.
  • Document object is an instance of HTMLDocument (Firefox, Opera), and HTMLDocument is an instance of Function (Opera only).
  • A DOM node list object returned by document.getElementsByTagName is an instance of NodeList (Opera, this is the correct W3C DOM behavior) or HTMLCollection (Firefox, compatible but incorrect), and NodeList is an instance of Function (Opera).
  • XMLHttpRequest is an instance of Function (Opera, IE7).
  • ActiveXObject is an instance of Function (IE6, IE7).
  • IE doesn't define DOM classes such as HTMLDocument, HTMLElement, NodeList, etc.
As one can see, Opera is very consistent in inheritance even for native objects.

Here is what you can count on when it comes to prototype inheritance. In Firefox and Opera, all objects inherit from Object. In IE, objects that interface with the browser or other native components do not inherit from Object.

In the beginning of this article, we saw how prototype inheritance can be used to augment existing classes. If this feature is implemented consistently, then this is a powerful way to extend Javascript. Imagine how we can work-around browser incompatibility by implementing the missing functions in Javascript. For example, DOM2 events can be simulated by setting appropriate HTML element attributes in DOM1. To an extreme, it is possible to emulate DOM in ancient browsers as long as we have document.write (though it is not practical to do).

It is not surprising that IE has the most broken Javascript implementation, considering that IE is historically the dominant browser. Because prototype inheritance is not possible for DOM, browser, and other native objects, IE is the least flexible in terms of language extension and work-around. By contrast, both Firefox and Opera have better prototype inheritance, and it is easier to extend or adapt these browsers.

Tuesday, April 17, 2007

Problems with Generics in Java

I was substituting labs for a introductory programming class taught in Java a few months ago, and I thought it would be nice to teach Generics to the students because it illustrates safe, reusable coding practice. I usually program in ML where polymorphic functions are written and used almost everywhere. However, I soon ran into various dark corners of type variables in Java. Such complications are not present in ML due to ML's absence of subtyping and mutable values.

The complications of Generics in Java are well-documented in this paper, Parametric polymorphism for Java: is there any hope in sight?, and this IBM article, Java theory and practice: Generics gotchas, but I'm summarizing the problems a bit differently, in a way you can see how bad assumptions break the system.

Invariance   It is almost always assumed that if T extends T', then Class<T> extends Class<T'>; this is called covariance. This is true for read-only (immutable) collections, but not true if you're allowed to insert elements. The original collection assuming elements have type T now contains elements of type T'. For Java, subtyping of generic classes really is invariant.

Static type erasure   In order to maintain backward compatibility such that the newer generic collection classes can be used by older non-generic code, Java treats class Class<T extends S> the same as Class whose definition has T replaced by S throughout (if T does not extend anything, then Java assumes T extends Object). The idea is to reduce parametric polymorphism to subtyping polymorphism. It works if Class<T> extends Class<S> holds, but since we already know subtyping should be invariant in Java, this is unsound.

Dynamic type erasure   Again, for the sake of backward compatibility, existing "compiled-once" bytecode must be able to use compiled generic classes, including being able to reflect on it. Reflection is a language device to get information about meta-properties of an object such as its class and its method names. Because older code has no concept of generics, Class<T> is just Class in the run-time. After erasure, Class<T> cannot reflect on T, so it doesn't know T's constructor (cannot construct T) nor can it construct an array of T. In order to work around this, the construction site of T must be given a reflection of class T.

Another consequence is that you lose type information when you serialize objects. After writing Collection<T> to a disk file, you no longer know what T is after reading it back. If you want to manually keep track of this information, your serialization must be given a witness of T in the same manner as constructing T and an array of T.

Solution?   What if we provide backward compatibility by keep the old non-generic collection classes, and put the new generic-based collection classes in a new package? The same goes for reflection API, where an old version reflects only on the type erased part of a generic class, but a new reflection API can inquire about type variables.

There may still be complication when old code tries to construct an instantiation of generic class using reflection, but I think this contains the problem much better.

Saturday, April 14, 2007

Presentation Technique

Recently, I attended a regional conference on programming languages, and I was giving a talk there about my own research. Although it was the first time I gave talk at a public venue, that I am by no means experienced, I had thought of some techniques that I want to share. The most important issue is actually timing, and many of the concerns revolve around that.

The novelty of my ideas really falls in the inverse pyramid organization of slides, but I feel that common-sense presentation techniques should be considered before that for effectiveness, so I would mention the common-sense first.

Pace yourself   The amount of time you're given for a talk limits how much material you can cover. After finishing all the slides, you should rehearse at your own leisure. This allows you to determine the natural pace of each slide. You should not rush, and you might even want to slow down a bit.

Remain focused   It is likely that you go overtime in the first rehearsal, but the solution is not to speed up; you must decide what to prune. Consider what is the most important message you want to deliver, and only keep the slides which support that message. It is also possible that you may not have enough slides to support your thesis, which is only apparent after rehearsing to someone and hearing back from them in my case.

If you choose to speed up rather than prioritize, you would find that your talk still go overtime and out of focus.

Inverse pyramid scheme   After pruning, you might be concerned that you pruned too much, which makes you appear unprepared. The inverse pyramid scheme is a well-known journalist approach that states the bottom-line about a new report first (when, where, what) and continues to the less important facts (why, how). The scheme is devised so an article can be truncated easily by a newspaper editor to fit the page. It can be applied to a technical talks as well.

I organize my slides as a logical tree, but I move the leaf slides to an overflow section. The official talk ends at a slide that says "thank you for your attention," but the overflow slides follow. The idea is that I would not talk about them but refer to them if people ask questions.

My overflow slides contain the hairy technical details that take some time to digest. If you have detailed slides that you don't want to spend too much time on them, they are probably good candidates for overflows. However, if the slides are integral to your talk, then you should just break them up instead.

Conclusion   I used these techniques for the talk, and at the end I feel I was well-prepared and well-timed. The inverse pyramid scheme allowed me to employ an "on-demand" approach in showing some slides, and it appeared effective and on-point. However, I think this approach is most effective when used after the common-sense to maintain pace and focus.

Thursday, April 5, 2007

Compiling AbiWord from Source on Mac OS X 10.3

The latest AbiWord at the time of writing is 2.4.6, but for Mac OS X, only the binary for 2.4.5 is available for download. I've decided to compile one. However, the instructions aren't very clear, and the build system appears to be a hack, so I've decided to write about it.

First of all, AbiWord relies on a number of Fink libraries, so make sure they are installed.
fink install \
libgsf glib2 libxml2 gettext libiconv bzip2 libwpd-0.8
Alternatively, you may use apt-get to avoid compiling these packages from scratch if you know what you're doing.

The official way to build AbiWord is to go to ${SRCDIR}/pbx and read the instructions there and proceed. However, before running build_all.sh, I had to adjust a couple of things.

First,
# current working directory is ${SRCDIR}/pbx

ln -s ../{popt,libpng,wv} .
so build_all.sh can find them.

Notice that for fribidi, you cannot use the source code supplied in ${SRCDIR}/fribidi, which is for building on Windows only. Prepare it by the following steps:
  1. Download and unpack a fresh copy of fribidi. You should get a directory called fribidi-${version}, where ${version} is some version of fribidi you downloaded.
  2. Make a symbolic link from fribidi to fribidi-${version} so build_all.sh can find it.
  3. Run configure script there to generate config.h
  4. Make a symbolic link from wcwidth.c and wcwidth.i to fribidi_wcwidth.c and fribidi_wcwidth.i respectively.
  5. Open fribidipbx/fribidipbx2.pbproj, then add fribidi/fribidi_types.i and fribidi/fribidi_char_sets.i to the project. The paths should be "relative to project," and make sure their roles are "public" under Targets/fribidipbx/Headers and the file types are "sourcecode.c.h".
AbiWord requires Enchant, which is not included from the source tar.gz for some reason. Get it using the commands:
# current working directory is ${SRCDIR}/pbx

export CVSROOT=':pserver:anoncvs@anoncvs.abisource.com:/cvsroot'
cvs login    # password is 'anoncvs'
cvs export -renchant-1-3-0 enchant
cvs export -rrelease-1-2-1 enchantpbx
unset CVSROOT
Open enchantpbx/enchantpbx.xcode with Xcode and add enchant/src/pwl.c to the target "enchant framework." Now enchant should build.

The last phase is to prepare the source of abi/, using the commands:
# current working directory is still ${SRCDIR}/pbx

wget http://www.abisource.com/.../link-grammar-4.2.4.tar.gz
tar zxvf link-grammar-4.2.4
ln -s link-grammar-4.2.4 link-grammar
(cd .. && ln -s pbx/link-grammar-4.2.4 link-grammar)
perl -pi -e 's,DICTIONARY_DIR=\\"\\",DICTIONARY_DIR=\\\\\\".\\\\\\",' \
link-grammar/link-grammar.xcode/project.pbxproj

ln -s ../{abi,abipbx,abidistfiles,abiword-plugins} .
(cd abi/src/wp/ap/xp/ToolbarIcons/ && make)

perl -pi -e 's/hdiutil create -megabytes 30/hdiutil create -megabytes 40/' \
build_all.sh
Finally, one should be able to build AbiWord using the command ./build_all.sh all. However, the resulting AbiWord won't run if your glib2 depends on libgettext3 (libintl.3.dylib) instead of gettext (libintl.1.dylib). This, of course, depends on the versions of Fink packages you use. If this is a problem for you, you can either try to add libintl.3.dylib to abipbx/abipbx2.pbproj (add that to preparelibs.sh as well), or just copy /sw/lib/libintl.3.dylib to AbiWord.app/Content/Frameworks.

It is great to have AbiWord on Mac OS X, but users should not jump through so many hoops to compile it from source. The source tree needs a good clean-up, and Xcode project files need to be made consistent with the latest versions of fribidi and enchant. They should bundle fribidi (for building Mac OS X framework), enchant, and link-grammar. They should definitely fix the build_all.sh script, so the symbolic links are no longer necessary.

Furthermore, I think it's a bad idea to depend on compiled Fink binaries. This has at least two problems.
  1. Fink package versions vary wildly from machine to machine. Fink itself ensures only that minimal dependencies are met, and will happily use more recent packages if available. The situation of gettext vs. libgettext3 is just one of the many examples.
  2. One usually expects that Fink packages are compiled from source and run on the same machine. It is common to find packages that assume MACOSX_DEPLOYMENT_TARGET to be the same version as the host machine. If you compile AbiWord on 10.3, you can't run it on 10.2.
I hope they fix these problems. Please let me know if these instructions work for you.

Wednesday, April 4, 2007

Google Apps Appliance

A fellow student sent some e-mail attachment of technical report updates to a professor a few weeks ago, but he's still waiting for the update to be posted. I suggest that it is too much work for the busy professor to fish for the attachment from inbox, so he should send them again. I also wonder, wouldn't it be wonderful to be able to search our department's e-mail like Gmail does it?

It looks like Google Search Appliances shouldn't just be for searching an enterprise's Intranet web. What about e-mails and calendar items? Google now offers an array of applications—Gmail, Calendar, Chat, and Docs & Spreadsheets—which are ripe candidates for business solutions. They currently offer these services on a domain level as Google Apps for Domains. What about Google Apps Appliances? It looks like this is not a new idea; many people have speculated about it. Obviously, Apps Appliances are expected to provide search functionality like the Search Appliance.

There are some merits for Apps Appliances over Google Apps for Domains.

Privacy   Companies enjoy greater privacy of data, as their e-mails, calendars, chat logs, and documents never leave the site. This is the same argument for Google Search Appliances.

Migration and Integration   An admin should be able to integrate Google Apps with existing backend: NFS, SMB for storage of documents; Exchange, Notes for e-mail and calendar; IMAP gateway for e-mail. However, I can already see that Exchange and Notes integration will be hairy.

Google Apps for Domains currently can use IMAP gateway. Other forms of integration are not possible due to corporate Intranet security.

I think backend integration difficulty depends on how much integration is desired. It is difficult to interact with Exchange and Notes in real-time because these protocols must be reverse engineered. Efficiency is another consideration; these backends might not be designed to process massive amount of queries like the Google File System is designed to handle, and search performance (which is a hallmark of all Google Apps) will be severely degraded if that is the case.

It is easier to import and export data especially if everyone speak the same language (iCal for calendar, Unix SVR4 mbox for e-mail, and HTTP for document transfers). It is important that import and export should not lose fidelity of data, and the whole process should be automated.

Google Calendar allows import and export of calendars in iCal format, but I think each user has to do it one by one. This is clearly not feasible for a whole domain of users.

New ProductFilesystem Appliance   Each Apps or Search Appliance will happily serve its own filesystem, but storage space can run out. When this happens, companies should be able to purchase storage solutions from Google and enjoy scalability and redundancy that Google File System provides. This is another business opportunity.

At this point, however, I want to stress that it is a bad idea if Apps or Search locks in to Filesystem Appliance by lack of network filesystem support such as NFS or SMB. This is because: (1) lock-in is evil by virtue; (2) companies who invested heavily on their storage solution won't be happy; and (3) a domain specific Filesystem Appliance cannot compete with general purpose network storage systems like Network Appliance products.

For now, without the Apps Appliances, I think the best way for graduate students to help busy professors manage e-mail attachments is by sending them again. It is as if graduate students aren't busy... Right?

Friday, March 30, 2007

Unsafe String Concatenation in C

Many remote exploit vulnerability can be classified into two kinds: memory management problems and confusion of language domains. Memory issues are characteristic of programs written in a low-level language like C. Confusion of language domain is more characteristic of programs written in a high-level scripting language like Perl, PHP and Python. For networked applications that require performance, programmers tend to use a low-level language like C, so memory issues like buffer overrun has been a primary concern in this case.

Because buffer overrun has been Numero Uno cause of security issues, courses and programming books teaching C nowadays have (hopefully) delivered good practices of memory management. One example is to use string functions that limit buffer size. For example, use snprintf() instead of sprintf(), fgets() instead of gets(), strncpy() instead of strcpy(). These functions provide an argument to impose buffer usage limit in order to avoid overrun.

One may be inclined to also recommend strncat() in place of strncat() for string concatenation. However, this deserves closer scrutiny. This function has the following prototype:
char *
strncat(char * restrict s, const char * restrict append, size_t count);
It copies at most count characters from append to the end of s. However, if s is nearly filling up its buffer size, a count that is too large would still overflow the buffer that holds s. We can illustrate this using the following example:
char buf[16];
strcpy(buf, "hello world!");
strncat(buf, " adam!", sizeof(buf)); // wrong!
This code instructs strncat() to append at most 16 bytes, which is sizeof(buf), from " adam!" string. Obviously, strncat() ends up appending the whole string, and this overflows buffer by 3 characters because the string "hello world! adam!" is 18 characters in length plus the terminating NUL.

This is apparently a real problem that once plagued CUPS (printing system for Linux and Mac OS X) and Solaris ufsrestore (incremental file system restore).

Another easy to make mistake is the fact that strncpy() may not properly zero-terminate a string. For example:
char buf[4];
strncpy(buf, "love", sizeof(buf));
printf("buffer contains the string '%s'", buf);
The code would end up printing "buffer contains the string 'love...'" with some garbage. The reason for the garbage is that buffer is no longer properly NUL terminated. You're lucky if the garbage is relatively short. Trying to pass buf to a string processing function that expects NUL terminated string may cause that function to run forever, even for benign functions like strlen().

On the other hand, fgets() reads a string into buffer up to size - 1 characters and properly NUL terminates the string.

All in all, I think these C functions are examples of bad interface design. If someone is reworking the string functions in C, he should focus on making the interface consistent, namely:
  1. That strncat function, like everyone else, should accept destination buffer size limit, rather than limiting number of characters to append; and
  2. All string functions should result in valid, properly NUL terminated strings.
Furthermore, I actually encourage one to use strcat() and strcpy(), despite them being allegedly "unsafe." The rationale is that this forces you into the habit of always checking string length with strlen() and make sure you have enough buffer size to hold the resulting string before performing the operation.

There are additional gripes about strtok()/strsep()—they destroy the original string, which is a problem if the string is shared with another data structure or function. You should use strchr() or strcspn() to find the next occurrence of delimiter, and use strncat() to copy the delimited portion to an empty—but NUL terminated—string. The reason why you shouldn't use strncpy() is precisely that it does not NUL terminate your string.

You should consider strncat() as a safer substitute for strncpy() because strncat() at least guarantees NUL terminating the resulting string (string copying is a special case for concatenation). This also brings up an issue about how to initialize a string buffer. Instead of zeroing out every bytes in the buffer, you only need to zero the first byte.

We discussed a number of string processing functions and their interface irregularities, and we showed some workarounds for better security practice. This illustrates the amount of detailed attention one has to pay when writing C programs. I hope that students who first learn C would just pick up the good practice from the beginning.

Wednesday, March 28, 2007

PDF editing on Linux (and other open source OS)

I recently received several PDF documents to fill out for background search (I'm applying for summer internship). However, the PDFs did not use proper form fields, so I cannot use Acrobat Reader to fill them. I surveyed two PDF editors that serve this particular purpose—add text to an existing PDF without form.

Before I began, I did some search on Google. I found this article (mentioning flpsed) that looks somewhat outdated. I also remember reading about a PDF editor based on QT (PDFedit) a while ago. I decided to try both.

I tried PDFedit first because it looks more recent. According to the user documentation, it is quite powerful. It has a Javascript-like scripting language (QSA) that let you automate many things. It can alter existing text attributes, delete objects, add lines and rectangles, and add text. It also lets you view PDF object tree and edit it directly. PDFedit supports multi-page documents, which Adobe Illustrator doesn't.

As for the function I was looking for, adding text to a document to fill out forms, PDFedit performs rather poorly. The button to add text does not have an obvious icon. Once you find the button and click on it, your mouse pointer turns into "add text" mode, so you can now click anywhere in the document to add text.

When you click to add text, PDFedit creates a small overlaid text box that lets you type. However, if you type too much, parts of your typing is scrolled away and becomes temporarily hidden. When you press the Enter key, everything you just typed (including the occluded part) is rendered onto the document, but with a different font and size. The desired font can be chosen using a drop-down box on the toolbar, but this setting is not previewed when you type.

Once you entered the text, you may discover that the placement is slightly off. You must switch the mouse back to selection mode, select the text you just entered, and drag it. Mouse selection does not work well, and you can easily select the wrong object or some mysterious object not visible in the document.

After adding text, dragging, and adding more text for a few times, PDFedit becomes very slow. On a fairly recent machine (Pentium 4, 2.4Ghz), it can take 20 seconds from pressing the Enter key until text appears in the document.

When I finally struggled through, I saved the resulting PDF file and tried to open it in Acrobat Reader. It opened the document only partially and complained about a "q" operator that is illegal in text. I tried opening in GhostView, and it just rejected the document completely. I tried loading it in PDFedit again, and it was fine. It seems that after you tinker a PDF with PDFedit, the only program you can ever open it again is PDFedit.

I decided to give flpsed a try. The reason I didn't try it first is because the article I read claimed that it only supports PostScript files. It wasn't that big of a problem because PostScript can be converted back and forth from PDF using GhostScript, but with some loss in font details.

Compiling flpsed from source was a pleasure. It requires fltk 2.0, which took only a minute to compile; flpsed itself took only a few seconds. On the contrary, PDFedit requires some functionality from boost, and both took me tens of minutes to compile.

With flpsed, I first tried converting PDF to PostScript, and it worked fine. The program is simple and serves only one purpose: add text to an existing document. It also supports multi-page documents. I can choose font sizes, but not font family. Adding text is simple—you click on anywhere in the document and start typing away. You can only edit what you typed, but you cannot modify original document text.

Flpsed is also much more responsive than PDFedit. Typing, selecting, and moving text is instantaneous. You can also move text with your keyboard cursor keys for more precise alignment.

After finishing the forms, I discovered that flpsed also imports directly from PDF and exports back to PDF. This actually produces better result than doing PDF to PostScript conversion. I started over with PDF, and it was a breeze to do.

However, the simplicity of flpsed comes with functionality trade-off. For example, there is no cut and paste function. It also doesn't have input method support (which is required for many non-English languages). I also cannot add image—this is to simulate signing a document, although it may be a bad idea to allow your signature to be infinitely reproducible by putting it inside a printable PDF document.

Here is my recommendation: if you need to fill PDF "forms" that do not have form fields, flpsed does the job beautifully and efficiently. But don't expect it to do much else.

Wednesday, March 21, 2007

Micro-blog Idea

A much desired function of my calendar is the counter-part of the TODO list, that is, the DONE list. I need to keep a list of things I have done on a day to day basis, even if it was not on TODO before. It will also serve as a memo taker---some sort of a micro-blog, if you will. It will be different than a TODO list, where you cross things out when something is done and you forget about it. I want to remember what I have done and go back to it when I need to.

I need it because that's the only way I could show my professor that I have done reasonable work whenever he throws at me the "Angus, you're not sticking to your agenda" crap.

This micro-blog would be split up in different streams, some could become public, and others would remain private. It would be browsed over the web or pulled in as an RSS/Atom feed.

Now, it strikes me that SQL database is probably not the best way to implement the back-end for any sizable web application. None of the SQL servers I know of can distribute both load and storage across different nodes the way Google File System does it. SQL, the language, abstracts very little from bare metal, and many higher-level languages like ML that support higher-order functions can do much better in terms of abstraction.

Remember MapReduce from Google? The concept is so prevalent in functional languages that one can almost accuse them of reinventing the wheel. With more advanced compilers (like what Haskell has) that perform deforestation, you can even fuse together a number of MapReduce steps so you do not need to generate intermediate data set. Deforestation allows you to write easy to understand, pipe-lined programs without compromising efficiency.

I think there are many ways that databases can benefit from advanced features of a programming language. For example, the list that results from traversal of a tree index can be implemented as a lazy-evaluated stream, where the resulting list is generated on-demand. Many stream based operations are readily available, such as fold, map, and filter. Other fancy stream operations are also possible, such as taking the cross-product, distribute a set of lists in a round-robin fashion, and doing merge sort.

However, a major short-coming of any functional language today is the lack of a good XML library. People in this community have designed plenty of esoteric, experimental XML parsers and generators, but they are difficult to use in practice. It needs to be as easy to use as Javascript to DOM without sacrificing the power of functional programming, e.g. pattern matching, fold, map, and combinators---I don't know, maybe these are mutually conflicting goals.

We also lack a good HTTP client library. Ocamlnet is very powerful, but it is way too involved for casual usage. I think XMLHttpRequest is a great API that someone should implement. Its greatness is evident from the success of Web 2.0, also known as AJAX.

I'm afraid I have to leave this post inconclusive. I've drifted from an idea to the critique of possible solutions. In the end, I think the saying that ideas are worth (...insert your favorite object of no value here...) is true. I can only make a difference if I actually implement it.

So little time, so much to do.

DONE: spent 2 hours in the morning writing about a worthless idea.

Monday, March 19, 2007

Misleading Symptoms

Computer systems have become so complicated that it is likely to misdiagnose the problem if you look only at the symptoms and lack understanding of how it works.

A friend's old Windows XP computer was getting the error "Windows---Display Driver Stopped Responding: The i81xdnt5 display driver has stopped working normally," and he asked me to look at it. Whenever it happens, the display switches back to 640x480 resolution at 16 colors and messes up desktop icon arrangement. He could restart the computer, but it gets annoying.

A quick search turns up a Microsoft Knowledge Base article explaining that the device driver might be faulty and suggested that we upgrade the display driver. We tried that but it didn't help.

He said he has a way to reproduce it---the error happens when he runs a lot of applications. I suspect it has something to do with virtual memory issues, so I checked:
  1. If the disk is running low on available space. But it had plenty, with 9GB free.
  2. How much memory is free, using Task Manager. I discovered that Windows only uses up to 384MB of page file, and it is nearly full right after the system was rebooted.
  3. The performance setting, and found out that someone had limited page file size to 384MB.
I changed the page file size to "system managed" and solved the problem.

The computer originally had only 128MB of ram, but I helped my friend upgrade to 256MB at some point. I think whoever bought the computer for my friend before had followed the rule that page file should be three times the physical memory size, but the memory upgrade requires the size limit to be updated. That's why he's running into this problem only after memory upgrade.

I can see there are many possible misdiagnoses:
  1. As the Microsoft Knowledge Base states, it could be due to a faulty display driver. Cost of misdiagnosis is the time to find display driver and install it; not bad so far.
  2. It could be due to a faulty hard drive for failing to provide a reliable page file. We had a hard drive problem on that computer before. It was fixed after a chkdsk, but that could indicate that the drive is dying. Cost of misdiagnosis is ~$80 for a new hard drive.
  3. Since the problem only happened after a memory upgrade, it could be due to the new memory modules. Cost of misdiagnosis is ~$100 for a new memory module.
  4. It could be due to a faulty display card for "not responding" to a driver. However, the computer uses integrated graphics chip on the motherboard, so that could imply the motherboard needs replacement. Cost of misdiagnosis is either ~$50 for a new display card, or ~$150 for a new motherboard; trying both and finding out that neither help cost ~$200.
As you can see, it could easily cost us $300+ sans labor if we had tried everything we can imagine under the sun.

Think about the implication of a medical misdiagnosis. Human bodies are much more complicated than computers.

Upgrade v.s. False Alarm

I am having difficulty concluding the moral of a story I'm about to tell you, so please bear with me while I tell the story first.

I've recently started performing an upgrade of sshd on a multi-user server from original ssh 2.0.13, circa 1999, to the latest OpenSSH, 4.6p1 at the time of writing. The reason is complicated. The server is a Solaris 2.6 system that has been in use since before 1999, and it is already falling behind on security patches. The ssh is very old---protocol 2 had some bugs, and we always have to fall back using ssh protocol 1 instead, which has long been deprecated.

Several people and I have root access to that server, but nobody assumed the position of a system admin. I think that's a case of collaborative irresponsibility. The department has an appointed system admin who is supposed to do the job, but he has been reluctant in keeping that server up to date especially after the department's migration to Linux a few years ago.

Some time ago, I read about yet another telnet remote root exploit and decided to disable all insecure means of login to that particular server, which are telnet, rsh, rlogin, rexec, and ftp. About a month later, I got an e-mail from IT backup service telling me that backup server cannot login using rsh. My professor asserted that I should find a resolution because I was the one who disabled the insecure services. Tell me about how it feels to be rewarded with mounting responsibility for a moment of attentiveness.

I told IT to use ssh, but they insisted on using a ssh protocol 2 key, and that was the last straw that prompted me to take the initiative to upgrade sshd.

After installing the software and converted the host keys, I thought it would be nice to run the new sshd on a separate port for testing. It allowed other people to help me iron out a number of problems, like the infamous "Disconnecting: Corrupted check bytes on input" problem when using ssh protocol 1 from an older client (the solution is to make cipher-3des1.c and cipher-bf1.c include openssl-compat.h).

However, the very evening after I ran sshd for testing, the department system admin was alerted of a possible intrusion. Although I already told IT that I would upgrade sshd, another branch of IT---the intrusion response team---didn't know about it and thought it to be suspicious that sshd is being run on another port. This resulted in the server being taken offline for a few hours until I had a chance to intervene.

Well, what's the big deal? The original idea of running a testing sshd on a separate port is to minimize downtime, so that our group members could continue using the old sshd while I prepare the new sshd for production. It ended up causing more downtime. The measure backfired because of the false intrusion alarm.

How do they tell there is another ssh service running on suspicious port? You can just scan all open ports and see which one responds with the string "SSH-1.99-OpenSSH_4.5" that is not from port 22. I think this is pretty absurd though; I could play a prank by ssh into a legitimate host and use port forwarding to tunnel its own ssh port to any other port I want on this host, e.g. ssh hostname -R10022:hostname:22 -R11022:hostname:22 -R12022:hostname:22 and so on. This will make the system admin sweat without me violating any computing policies.

I think there are a number of morals of this story:
  1. It is always a good idea to let others help you whenever you can. I knew about the "corrupted check bytes on input" problem, but another person helped me identify the fix because I invited everyone to test the new sshd. There are other times when I could have helped the department system admin if he were more responsive to suggestions and problem reports.
  2. In an organization where one hand doesn't wash another, even a thoughtful plan can run into problems. In this case, a well-intended test bed triggered false intrusion alarm. These problems aren't always fatal, but they require overhead to fix. Optimistically speaking though, I think experience can help someone avoid such problems.
  3. Don't be so keen on taking care of something unless you want the responsibility to be assumed upon you. Alternatively, I think I could have done a better job promoting to my professor the value of the additional "not my jurisdiction" work that I do. Managing your superior is a difficult art to master, but that can be necessary.
Nevertheless, the upgrade is now completed, and several people are satisfied.