Archive for Digitization

Building low-cost legal digital collections on the cheap

Today, while surfing the web, I found this YouTube video of Eric Gilson and John Joergensen of Rutgers Camden Law Library presenting at the 2005 CALI (Computer Aided Law Instruction) conference. In it, Gilson & Joergensen discuss how to digitize congressional documents and build a digital library in a low-cost manner using open-source software. Unfortunately, you cannot really see the screen output from the projector, but the concepts are still relevant.

For more specifics about full-text indexing, here is Joergensen at CALI 2010 explaining the Swish-e search engine which Rutgers uses to index the congressional documents:



Some of the processing details have changed since 2005, but the Digital Library is still running today, with over 13K congressional documents processed in its U.S. Congressional Documents collection.

If you are interested in legal informatics, I highly recommend CALI’s YouTube channel.

Why I am Learning to Code

This week I mentioned my involvement in the Codeyear project. I also attended a great introductory workshop in Python programming sponsored by PhillyPUG, Philly Pystar, and Devnuts.

Why learn to code?  Two of the best responses to this question are already answered on the web:

    1. Shana McDanold, e-resources and serials cataloging librarian at a major US research university, answers this perfectly for librarians and catalogers in her response “Why am I learning to code?”.


  1. Randall Degges, a computer programmer, answers this question wonderfully for programmers as he explains “How I Learned to Program“.

So why am I learning to code?  Here are some of the reasons:

  • Digital Librarianship

Part of working in a digital library or archive means having to having to manage and manipulate tons of files: image files, born-digital documents, scanned manuscripts, oral histories, video clips, and more.  Being able to code and understand the concept of regular expressions means that you can write scripts to manipulate these files in a systematic way, which can save you lots of time in the long run.  You can rename files, check file integrity with checksums, batch-auto-correct images without an expensive license for Photoshop, and create dynamic HTML pages using CGI scripts.

In my current job, I have needed to review HTML code for problematic divs and runaway lines. I have had to research CGI and JavaScript scripts to run on webpages created by ContentDM for our future software upgrade. In my previous job, I used perl programming to rename large sets of image files, automatically embed EXIF and Dublin Core metadata into image files, and auto-correct images based on a set of algorithms, all using free, open-source software.  (See the Publications tab of this website for links to articles regarding my projects at Rutgers University Camden Law Library.) These types of projects are accessible to beginner coders and can help solve real challenges in a digital library.

  • Professional Development

Coding in different programming languages does not just let you add acronyms to your resume. It helps my professional development to be able to say that I have an arsenal of tools and transferable skills that I can bring to the table for solving problems at any type of library, archive, museum, or information-related occupation.

  • Getting “Under the Hood”

When working as an information professional, sometimes you will be called upon to troubleshoot computers and computer software on your job. If you have some basic skills of scripting and HTML, you will be able to get “under the hood” and see where a web page may not be coded correctly, or at the very least have an idea of what a 3rd-party database does when it performs a search. Now, I don’t know ANYTHING about car repair, but I know the basics of how to check the oil, view the anti-freeze level, check my tire pressure and such. I can read my car’s manual, and even if I cannot fix my car, I can attempt to converse appropriately with the service department at my car dealer.  Which brings me to…

  • Critical Thinking

Being able to code, at least a little bit, brings an information professional into the world of computer programmers. You will learn a logical way of approaching problems. You will learn new vocabularies for coding. Even if you never have to write programs for your job, you will be able to converse more intelligently with vendors and tech support personnel. I have found that this knowledge has greatly increased the sensitivity of my internal “bullshit detection sensor”. This is important when dealing with sales representatives who want to sell you the latest/greatest/shiniest new software or hardware or with technical support personnel who may not really know the answer to your problem. Thinking through problems in a logical manner is a side benefit to learning how to code which can help you both on and off the job.

  • Maintaining Mental Flexibility

Studies have been done which discuss the benefits of keeping active both physically and mentally as we age. Now, I admit that I am a “data potato” (i.e. a “desk jockey”) but I am NOT letting my brain turn to mashed potato mush.From the Franklin Institute, here is a page about exercising the human brain.

Learning to code feels very similar to me as learning a second or third language. I LOVED my foreign language classes in high school and college, and hope to continue to learn other languages in the future, even just for some basic conversation. (One of my goals is to take a Korean class one day.) Being forced to think critically, follow logical steps, look for patterns in repetitive data, associate symbols and values and learn complex syntactical combinations all exercises the brain. Coding keeps you young!  (OK and hanging out with younger nerds doesn’t hurt either.)

Having learned some perl and SQL programming and now being in the process of learning python and JavaScript, I am reaping all of the above benefits. I recommend it to anyone, and especially recommend it for librarians, archivists, and information professionals.

Rutgers Law Library (Camden, NJ) Project Featured in Philadelphia Inquirer


John Joergensen of Rutgers Law Library Camden NJ (Image by Kevin Riordan of the Philadelphia Inquirer)


Last week, my supervisor, John Joergensen, was profiled in the Philadelphia Inquirer. Check out the article to see the kinds of projects we do at Rutgers Camden and why we do it. The tall bookshelves featured in the photographs are right next to the desk where I sit in Room 510, aka the “scanning room”, the hub of the Rutgers Law Camden Digital Library.

Way to go, John!

If you are interested in John’s work, check out his blog at A Hacked Librarian, or follow him on Twitter:  @jjoerg42

GSP: Presentation on Newspaper Digitization Projects (8/6/11)

On Saturday, I attended a local presentation on the topic of large-scale historical newspaper digitization projects, sponsored by the Genealogical Society of Pennsylvania. The first speaker was Sue Kellerman of the Penn State University Libraries Digitization and Preservation unit. She discussed her role in the Pennsylvania Newspaper Project from 1983-1990 and how she traveled around the state of Pennsylvania finding and cataloging historical local newspapers held by libraries, archives, historical societies, and individuals. She even rescued some old newspapers by purchasing them from antique dealers. What a fascinating, albeit daunting, job to be tasked with!

The second part of the presentation centered on the PA Newspaper Digitization Project, managed by Karen Morrow of Penn State University. 5 historical newspapers from different counties were selected for the process of digitization from microfilm masters. During Phase I, the project received NEH funding to digitize approximately 103,000 pages from papers dating from 1888-1922. In Phase II, they received a second grant to digitize another 100,000 pages dating from 1836-1922. These papers are now included in the Library of Congress searchable Chronicling America database. Because of the extensive metadata provided and full-text keyword search capabilities, these newspapers are wonderful free resources for genealogists. The database contains almost 4 million digitized pages from newspapers dating from 1836-1922 (public domain) covering about 25 US states. In addition to full text searching, users can also search the U.S. Newspaper Directory, which contains listings from 1690 of all newspapers available in digital formats (via free or pay sources) from all U.S. states,

Scans were produced using the LC NDNP technical specifications.  Technical information about the database api can be found here.

I was interested to hear that using the LC technical specifications, scans from microfilm to digital file cost the project approximately $1.00 per page.

Morrow and her team provided the audience with detailed reference lists of Pennsylvania newspapers which are already digitized and can be accessed for free or via paid subscriptions. Resource lists are available on the Digitized Titles page of the PADNP Phase II Blog.

Both Kellerman and Morrow mentioned that currently, the LC is only able to accept newspapers in English, French, Spanish and Italian due to font issues. In the future, they hope that LC will be able to accept German fonts to be able to digitize the large number of historical German-language newspapers found in Pennsylvania.  The team will be writing a grant for Phase III of the limited three-phase grants from the National Endowment for the Humanities to continue their work beyond 2012. I sincerely hope that they receive their grant, and future ones, to be able to preserve our state’s wonderful historic periodicals.

In addition, there are many more titles that could not be chosen due to time and financial constraints. Many counties in the state of Pennsylvania are still underrepresented in the digitization of historic newspaper content. (See the PADNP map tracking digitization rates across counties in the state.) The field of digital collection building is ripe for greater work in this area, especially now that Google has decided not to add new titles to its newspaper digitization endeavors.

What should small libraries or historical societies do if they want to preserve their historic newspapers but lack the funds to do so? The presenters suggest that small institutions focus their limited resources on microfilming for long-term preservation those issues or runs which get the most reference requests or patron usage, and as such might incur greater wear and damage. The key is to preserve as much as you can, even in partial runs or in stages over time, instead of waiting for funding and not preserving anything. Small-scale fundraising can be very effective if done in stages.

At the end of the presentation, Kellerman focused on a special collection sponsored by PSU Libraries, the Pennsylvania Civil War Era Newspaper Collection. Did you know that Pennsylvania is the only state to currently have published online a digital collection of newspapers dating specifically from the Civil War era? (I did not. Kudos to PSU!)

Thank you to the Genealogical Society of Pennsylvania for sponsoring this informative presentation.