Cataloging a Library

Previous Topic Previous Next Topic Next
Xoc Software

Other Xoc managed sites:

It was time to create an inventory of the books in my library. I have about 1300 books, and the ordering of the books on the shelves was haphazard. There also was no inventory of the books for insurance purposes. Sandi and I started to write software ourselves, but it quickly became apparent that this was a very large project, so I decided to see what was out there.

There are a number of library programs out there on the market. The first that I downloaded for trial, Collectorz exactly met my needs. Within minutes, all thoughts of writing software were abandoned, and I had hit the trial program's limit of 100 books. I shelled out the cash and also ordered a cuecat scanner.

Entering books consists one of these actions:

  1. Scan a ISBN barcode with the cuecat
  2. Type in the ISBN
  3. Enter a Library of Congress catalog number
  4. Enter an author and/or title
  5. Enter all the information on the book by hand
With all but the last, the software checks against an online database, that seems derived from information that maintains plus several other sources. For almost all books with ISBNs, the program downloads the information and enters the book. A few books have ISBNs that are apparently invalid, so are not in the database. A few other books have ISBNs from foreign countries that aren't searchable. The number of books that wouldn't scan was very small. Many books have bar codes that are UPC codes, not ISBNs. Many of the later mass-market paperbacks have the ISBN bar code on the inside of the front cover, and have a UPC on the back. You can identify a ISBN because it starts with 978 or 979.

The trick to reliably using the cuecat is to:

  1. Start with the cuecat on one side of the bar code
  2. Completely slide across the bar code
  3. Then completely slide it back across
The whole scan should take less than a second.

Many books don't have a ISBN bar code, but do have the number somewhere on the book: On the back cover, spine, or if all else fails on the copyright page. Some of the older mass-market paperbacks (from the early 70s) only have the ISBN on the spine, but you need to add a leading zero, and then take then next nine digits. Typing in an ISBN is almost as fast, but slightly more error prone than scanning with the cuecat scanner.

The ISBN standard was adopted as an ISO standard in 1970. So any book published before then won't have an ISBN. For those books, many of them have a Library of Congress (LoC) catalog number on their copyright page. These numbers appear as something like 65-3472. When you perform a search for the book on the LoC web site (talked about later), you need to extend the number out to have eight digits by inserting zeros where the dash appears. So this book would be 65003472.

After getting the information for the book, almost always I was unhappy with the was the information was shown. When publishers add information to the ISBN database, they frequently muck up the fields by putting publisher info in the title field. The subtitle is also frequently added to both the title and in the subtitle field. Publishers are not consistent on the name of the publisher: they might use Ace, Ace Books, or Ace Books, Inc. I frequently spent time cleaning up the entries. Very frequently the number of pages and the cover price don't match what is in the book. Sometimes the cover price is encoded in the bar code, as part of the smaller bar code on the right. If the number starts with a 5, what follows is the price in U.S. Dollars. If the number is 90000, then no price is encoded, but it frequently is shown on the book somewhere.

I needed to choose an ordering system for the books on the shelf. It suddenly became apparent to me that library science and database management systems are really the same thing, except the books are records that you can hold. The books on the shelf are the clustered index. Anyway, there are two major cataloging systems used by libraries: The Dewey Decimal System, and the LoC System. Most public and grade school libraries use the Dewey Decimal System, whereas, most university libraries use LoC. I chose to use the LoC system.

The LoC system was developed in 1898 to replace the system that Thomas Jefferson had in place when he sold his library to the United States. While other libraries can use the system, it is specifically designed to handle the 38 million books that they catalog. The basic idea is that all of human knowledge can be broken into 21 major categories which are each given a letter. These are then sub-categorized with another letter, so DF is used for the history of Greece. Further sub-categories are done with numbers. Each category has its own system of further breaking down the categories, so there is no consistency across the primary letters for how things get broken down. The system used for categorizing literature, in the P section, is particularly obscure. Also some books have been categorized into wildly different categories. For example, I have two tourist guide books on The Alhambra castle in Granada Spain that are very similar. One was assigned a Spanish History (DP) call number, whereas the other was assigned an architecture (NA) call number.

So a category is given a number such DS154.9. These are then followed by what are called cutter numbers. Charles Cutter came up with a system to allow grouping of titles and authors into an alphanumeric hash that allows the quick categorization on the shelf. It is important to be able to recognize the parts of a catalog number. For example: DS154.9.P48 .A9213 2000 v.1. In this number DS154.9 is the category. .P48 is the cutter of the title, and .A9213 is the cutter for the author, 2000 is the year of publication, and v.1 is and indication that this is volume 1 of a multi-volume set. When a library invents its own catalog number instead of using one from the LoC, they generally end the catalog number with a lower case "x". The LoC is inconsistent about having the last cutter number being preceded by a period. This actually causes problems in Collectorz because it sorts LoC Call Numbers alphabetically and this causes mis-ordering of books which don't have a period. I always make sure there is a space and period before the last cutter number. The year of publication is required for all republications of books, but is frequently not included for the initial publication. Maps, for some reason, seem to have the year before the last cutter number.

Many of the books pulled down by ISBN do not have the LoC Call Number. So getting this number requires some work. The entire LoC catalog is available online at You can search for a book and if it is in the LoC, you can get its Call Number. However, the LoC doesn't keep every book ever published, and if it isn't in the permanent collection it doesn't have a number. Furthermore, some of the things in the collection are not given useful numbers. For example, many of the fiction books are stored offsite, and are thus given a box number where they can be found rather than a shelf number. This doesn't help with the correct ordering of books.

Fortunately there are other resources, the most useful of which is Worldcat, at Worldcat is a online catalog of thousands of libraries around the world. You can look up a book that isn't in the LoC and find some other library, then see what catalog number their librarian gave that book. I found the most useful library was the Boston Public Library. It has about 15 million volumes and uses the LoC system, unlike most public libraries. The zip code for the Boston Public Library is 02116, and I use that as the start location of my searches. There are many other large university libraries in Boston, so this tends to put other useful libraries at the top of the list.

I wrote a program for doing research on the LoC and which can scrape the information from their web site and put the information into Collectorz, but it is a finicky thing and isn't nearly robust enough to release to other people.

Which brings us to the issue of how the LoC deals with literature, and particularly fiction. Each author is given a unique number that identifies him or her. Unfortunately, this numbering scheme is a little brain-dead. Essentially major sub-categories are divided by the country of the author. Thus PR is used for English authors, whereas PS is used for American authors. This means that Arthur C. Clarke, who was English, is sorted completely differently and before Isaac Asimov, who was American. Furthermore, authors are divided into the half-century that they started writing in. So writers such as Edgar Rice Burroughs from the first half of the 20th Century are given numbers in the range PS3500 to PS3550, whereas writers from the second half of the 20th century are given numbers from PS3551 to PS3599. Writers of the 21st Century are given 3600+ numbers. This causes the sorting to be very strange.

Furthermore, the LoC also uses PZ numbers for many books. Some books by an author might be given a PS number, some a PZ number, and sometimes both. I don't know whose bright idea this was, but they should be taken behind the shed and whipped.

I debated for some time what to do about my fiction books. The vast majority of my fiction books are American authors from the second half of the 20th century, so they all will sort correctly. But I have some English authors, and a few from other countries that have had their works translated into English. I considered doing what most public and grade school libraries do with fiction, which is treat it special and alphabetize it by author's last name, out of the non-fiction sequence. I decide that I could live with how the LoC handles PR and PS numbers, but not the PZ numbers. Since each author is assigned a unique PR or PS number, even if he has PZ books, I decided to "fix" all the PZ numbers back to PS numbers. The Queens University library of Canada does this, which is where I got the idea. After getting the unique identifier for the author, I generate a cutter number for the title of the book. This is a mechanistic process, so can easily be done. The only time it has any problems is when the same author has published multiple books that have titles that start with the same three letter, such as Frank Herbert's Dune and Dune Messiah. In those cases, the second cutter can have a 5 appended to it so it doesn't conflict. This is very rare. With all the books having PR and PS numbers, I added the fiction into the order on the shelves.

After getting all the books into Collectorz, I created my own templates to upload my books to a private web site. Collectorz comes with some templates, but I didn't like the layout. Collectorz allows you to use XSLT to put any information you want about a book onto the web site. Since I was already an expert on XSLT, this was fairly trivial. I created special templates for viewing information on my web enabled cell phone.

I also wrote a program to find orphaned images in the Collectorz image directory and allow me to delete them. It downloads images of books, but if I later replace the images it downloads with my own scans, the previous ones are orphaned, and without my program those images hang around forever.

This project has taken a considerable amount of time, but is now nearing completion. I still have a few books that are problemmatic assigning call numbers to (less than one shelf worth).