Record of recent activity on Early Writings sites, place for discussion and new ideas about the same, and all-purpose narcissistic blogging location.

Wednesday, January 28, 2009

Free Latin, Coptic, and Greek Texts

I received in the mail today something I sent out for two weeks ago. It is a comprehensive collection of Latin texts down to 200 CE along with an extensive collection of Greek papyri and several versions of the Bible. High quality and printed on two CD-ROMs, it is the kind of material that would go for over $500 if sold commercially, and that's only if it had the good fortune to be a Logos add-in instead of something aimed at the University library market. Yet I have the invoice right here and, all told, I didn't pay a cent. And if you don't have someone who'll lend you a stamp, it'll cost you less than a dollar.

I'm talking about the Packard Humanities Institute's "PHI CD ROM" #5.3 and #7. The efforts gone into the production of these discs were fully funded by PHI, a non-profit private foundation, and libraries and independent scholars around the world can benefit from their (non-commercial, educational, and non-transferrable) use just by sending in the right form to the right address. It's the kind of paper-pushing arcana that allows astute people to get bigger tax refunds or more governmental assistance; you won't find any link to this form on the official site of the Packhard Humanities Institute (as you are expected to email them first). But once you have the tip, it's easy to do.

I downloaded the forms from a site called In Rebus and you can also get them from my site in this download, where I've changed them from MSWord (.doc) to Portable Document Format (.pdf) and checked the box for "Individual" by default (instead of Institution). This infusion of source material to study--for free--comes at just the right time, when many are feeling the effects of the recession. Enjoy!

Saturday, January 24, 2009

On signal to noise and the blogosphere

"Signal to noise" has been a phrase applied to the ratio of quality posts to posts of no merit ever since there has been a medium for the masses on the Internet, starting with the venerable Usenet, and being applied also to the realms of e-mail lists and web boards of all types. The grand unified theory of signal-to-noise, in any medium where all involved have equal post rights, involves two equations:
  • As the number of users increases, without moderation, the number of users with high output of noise increases, because there is a bell curve distribution of users with less than 100% signal on average.
  • As the signal to noise ratio decreases, the number of users with high output of signal decreases (those higher on the bell curve), as they are frustrated with their careful efforts getting equal attention to the nonsense as well as usually too bothered to keep up with nonsense, especially by way of response.
These two equations work together to contribute to the decline of any unmoderated channel's signal-to-noise ratio. There is also a law of Usenet that any moderated channel will not survive if there is an unmoderated channel with the same function and topic. This all is why Usenet has been abandoned as a primary means of collaborative communication by most of the people who were originally there, at some point following the commercialization of the Internet in the mid 1990s.

The fine-tuning of permissions and moderation policies possible on e-mail lists and web forums has further proven that excess selectiveness of moderation has a depressive effect on posting quantity, even as it brings up the ratio of signal to noise; there would have been more absolute signal with less moderation, and this is in part because even a bit of nonsense can be redeemed into a productive thread by those who don't write nonsense generally.

How does this apply to the blogosphere? It is an interesting medium because, in order to have a full voice, one needs to cultivate one's own forum in which he or she is the primary speaker (or on a consortium of speakers), which is entirely unlike the format in which everyone can participate equally who happens to be reading and logs in. Indeed, it feels more like the bloggers are individuals who step up, in turn, to the podium and give presentations or speeches, while the rest of the people logged in (with comments) are merely raising hands in the audience. It's explicitly a system where one does not get an equal vote by right of participation alone, but by how much one invests in that participation by cultivating ones own blog, and in this way is more like an oligopoly than a democracy, with every blogger plying his own way.

Are you with me? Since every blogger is competing for market share--readership and greater recognition among the rest of the blogosphere--in order to get his or her comments seen, instead of having them automatically seen by all who are paying attention to a certain list subscription, there is an organic moderating effect on all bloggers. Quite simply, if people think you are not making enough signal compared to the noise your blog produces, they can stop following your blog posts as much. It's like the "invisible hand" of the free market, and in this case it very nearly works as Adams promised, since there is very good (if not perfect) information available to the blog consumer and there is very low barrier to entry for the blogger. The best blogs get read most, where "best" is measured as that which people want to read.

This blog post is entirely theoretic except for a single note that is not meant to be sensational, but which will illustrate the point. Stephen C. Carlson updated his blog a half dozen times in December and only once so far in January, the latter being a terse note about the revival of my Early Christian Writings. I am of course biased because that pretty little note keeps showing up in my gmail list of feeds followed and because I have been in touch with Stephen several times by instant messenger, but it's got to be admitted that when Stephen posts, there's usually something worth reading. Meanwhile, the indefatigueable Jim West outshines all other bloggers on the Bible by a mile for consistency of high volume of output over time. For that reason, you either love his blog, or you may (as I quickly am) tire of having it in your list of to-read blogs. Due to my absense from the blogosphere, I'm currently fighting to keep up with what's being written now so that I can delve here and there into what was posted in the past. The last straw for having Jim West in my feed list was this Doofy Darwinist post that draws on The Onion to make some kind of oblique dig at Hershel Shanks's BAR magazine. I do not subscribe to the Biblical Archaeology Review, I do not read The Onion, and I won't be keeping Jim West's blog in my Google Reader. All of these cases relate to the fundamental axiom of optimizing your reading activity in this respect, which is to eliminate those sources with intolerable levels of noise.

Labels:

Pulling in the reins on my imagination

Over the past week I've been investigating various technologies that could be applied to scholarly literature and primary source texts. These range from techniques of summarization to information extraction, machine translation to computer understanding, and use every possible linguistic resource, from gigantic tagged corpora for statistical study to meticulously-constructed hierarchies of semantic information for more rule-based grammatical analysis.

At the end of this week, I'm starting to realize that this is the kind of stuff in which professionals and graduate students get lost in for years, only to produce long monographs that conclude with statements along the lines of, "We haven't gotten any closer to what we want to do, but we've learned so much along the way as we have tried to do it."

So I am cutting back on how far my imagination will stretch for the new site. It will still be at earlywritings.com (as opposed to the sourceb.org name that I selected for something even more grandiose), it will at first cover no more than three distinct domains (Christian, Jewish, and Classic Latin), and it will branch out beyond its ranges of date and provenance only slowly, as supported by careful handmade information that expands the frontier and ensures it to be as high quality as what already is covered.

So I will be pulling in the information on the Early Latin Writings from Chris Weimer and combining it with the existing information on Early Jewish Writings and Early Christian Writings, and then adding a dynamic website engine (possibly built on Drupal or another existing CMS technology) with some features that I think are important, and then adding material and features in response to testing and feedback of the new site gradually. I will not try to build a masterpiece from the get-go; that's fine for engineering in the material world, but in the space of information science it is usually a formula for pushed deadlines and fundamental mistakes of chasing rabbits down holes, from which it is hard to recover. Rapid prototyping wins in cyberspace.

Labels:

Enamored of history once more

While I'm posting on the direction my future studies will take (Latin and Greek), I should also mention that I discussed things with people and have redirected my future university study back to the subject of History. Due to a combination of more units in the major already completed, and a lower number of total units in the major required, I can obtain a History degree in a third of the time that it would take to get a Chemistry degree (one year versus three). Since the state of California does not put any requirements on the degree taken for a Bachelor's on its teachers, I could do high school teaching in any subject with the History BA, just by passing an exam (and I excel in examinations). The only thing I would need, along with the Bachelor's, is a year-long credential program at Cal State Fullerton. Or, if I decide to do so instead, I could go full-time with an online business enterprise or two after the BA, or possibly support myself through a Ph.D. program right after (and Claremont has its appeal). We will see.

Labels:

What Books I'm Studying Next

I have a large, 5-shelf bookcase stuffed with books on all levels, with stacks on top also, most of them about the interpretation of the Bible, the methods of history, and particular New Testament topics. (I also have smaller sections on science and computer programming.) Yet when I thought about which books I should be cracking open next, none of them were quite appropriate, because none of them did a very good job of what should be foundational to a serious study of ancient history.

That foundation is the study of the languages used in the ancient texts.

For this reason, I purchased four books that arrived from Amazon today, which are Teach Yourself Latin, Wheelock's Latin, Teach Yourself Greek, and Oxford Grammar of Classical Greek. With the help of these books and other tools that are online, I will progress beyond the bare basic understanding that comes from familiarity with reading secondary literature based on close reading to the sources and move toward a better understanding that will allow me to make, more rapidly and accurately, my own close reading of those sources.

This is, after all, the same criterion that I have applied to other authors when weighing them as authorities on the interpretation of a text: if they did not have a good grasp of the language, their opinion was of little merit. I should seek the same competencies that I seek in others that I rely upon.

Labels:

Tuesday, January 20, 2009

The Licensing of the Early Writings Site

There are, generally, two types of corpora (in licensing terms) that are being imported into the EarlyWritings.com site, and a third class of corpora that could be integrated only offline.
  1. The first type of corpora is that designated 'public domain'. Whether that is from Project Gutenberg, from modules in existing Bible software, from scans on Google Books, from other sites such as Christian Classics Ethereal Library, or just by deduction from the date of publication (works in 1922 and before are public domain), these works have no restriction in their licensing terms.
  2. The second type of corpora is that designated 'non commercial', or more specifically Creative Commons Attribution - Noncommercial - Share Alike. This is the license type used by the Perseus Projects at Tufts and (roughly speaking) the Oxford Text Archive, among others. This type of license cannot be made public domain (as it does have some copyright rights reserved), but the reverse is possible: public domain works can be subsumed into a work that, in general, has a Creative Commons license.
  3. The third type are those that prohibit redistribution entirely. The most significant sources of this type are the Thesaurus Lingua Graecae (of UCI) and the PHI #5 and #7 CD-ROMs (from Packhard Humanities Institute). The latter are free to individuals who write for them, and the former is available to individuals by subscription (formerly on CD-ROM). One might also include in this category any modules or data for proprietary software packages such as Logos and Accordance, assuming that they could be read (and that there is no legal issue with doing so). As stated, the website could not present such material, but offline software could read it if the user already obtained it on their own.
A fourth category is presented by the input from the creators and users of the site, which could be licensed any way in which the site's terms state. A custom license might, for example, make provision for mirrors so long as there is a prominent notice of the www.earlywritings.com domain as the place where the material was originally collected.

A last, fifth category is the software that runs the site and, if implemented, an offline client. Options here include free / libre / open source software licensing (which makes most sense if other sites would want to take advantage of the software and contribute to its code themselves) or a proprietary, copyright license (which makes sense if it isn't likely to be a reused tool and, rather, is likely to be maintained only by me and maybe one or two others). The latter would increase the frequency that users of the software offline would pay a reasonable amount for it, allowing funding to go to the site itself and to the pockets of those who build it.

With these five different licensing terms being considered, here is how I see them interacting:
  1. Public Doman materials are clearly marked and demarcated as "public domain" where they appear (for purposes of citation and reuse) and subsumed into the larger site licensing (license #4).
  2. Creative Commons Attribution - Noncommercial - Share Alike materials are clearly marked and demarcated as "cc-by-nc-sa" with a link to the full license (again, for purposes of citation and reuse) and subsumed into the larger site licensing (license #4).
  3. These materials are accessed by software under full copyright (license #5) but are themselves under the copyright of their respective copyright holders.
  4. The site as a whole has a "cc-by-nc-sa" license because it is impossible to adapt a "share-alike" licensed work into a work that is licensed in a different way. (See the Creative Commons FAQ.) But the "by attribution" part does allow the work's creators to specify the manner of attribution, which would be by prominent display (and hyperlink if possible) of the Early Writings URL .
  5. The software doesn't need to be exposed to the web user. It can be licensed for a user when sold (if proprietary), or it can go the open source route. As I've suggested already, I am inclined to the proprietary route due to lack of predicted outside assistance (if it is open source) and because it seems the most natural way for me, the hacker, to profit off of the whole affair (doing the work on compiling and formatting all the public domain and Creative Commons content, more or less, as a "loss leader").
There was a time when I was nearly (but never quite fully) enchanted of the mantra that information "wants" to be free and that there is never a good reason for restrictive licensing. Ironically, it is a book by an open source advocate (titled The Cathedral and the Bazaar) that inclines me against such a dogmatic approach to free (as in beer or as in speech) information. The concept (of "free" or "open source" licenses) is not adopted because it is inherently charitable or especially humanistic, but because it works in satisfying the needs of those who are involved with creating and using the information.

In this case, the stuff that is already been and is being produced by online communities is the raw information (the texts, ebooks, and so on). The free or open source licensing (be that public domain or Creative Commons) works for them because it is a powerful motivator. The idea that a work, once digitized and machine-readable, can remain so for perpetuity, strokes the ego in the sense of being an immortality project (something that will outlast you) as well as being, broadly, an exercise in goodwill.

Meanwhile, the Early Writings website requires software that will run on the webserver for providing the information requested by users. That means, from the get-go, that software must be written and a webserver must be paid for. At the same time, the initial legwork of gathering content (primary and secondary) to be accessed through the site is likely to be the work of one man (that is, me). And the reason that I am doing all this, full-time, when I could instead be applying for jobs, is quite honestly because it is not only more rewarding in itself, but it pays better. The Early Christian Writings and Early Jewish Writings CD-ROM sales proved that there are a number of people who are willing to support the efforts of disseminating these materials and who wish to have an offline program.

Moreover, as far as income for the site goes, it is far from a pet project in its conceived form. I have already, a couple years ago, commissioned Chris Weimer to produce an Early Latin Writings website, which has been essentially completed so far as the secondary material (comments from Chris) are concerned. I also gave a contract to my sister to proofread the scanned editions of Charles's Pseudepigrapha, and now the entire work will be available in machine-readable HTML format. In the future I plan to commission parts of the Nag Hammadi Library and Dead Sea Scrolls for fresh translations that can be displayed online.

I hope readers are as glad as I am to see the websites growing again after years of resting on laurels. However well-earned those laurels may have been, anything that is not getting stronger is dying in the online world. Please be sure to email me at peterkirby@gmail.com if you have any of your own thoughts on the directions of the sites in the future.

Labels:

Sunday, January 18, 2009

Need Aid to Build Early Writings Site

Copying the plea for help from the Early Christian Writings homepage. The list of possible jobs here is not exhaustive; another kind of job could be programmer. Please just be sure to email me at peterkirby@gmail.com if you want to help build the site.

I need volunteers (with or without special skills) to help with the EarlyWritings websites. Anyone, for example, can volunteer to help convert scanned books in English and their images into machine-readable text (proofreader). People who know any combination of English, French, and German can help translate among the three languages in which input to the sites will be accepted. People who know any of these three languages, plus one of another seven--Mandarin, Japanese, Spanish, Russian, Portuguese, Arabic, Hindi--can help with the translations to these languages. Anyone with an understanding of English, French, or German and a familiarity with scholarship can make contributions. Lastly, those with facilities with the original languages (such as Greek, Hebrew, and Latin) can help with making fresh translations of the original texts or better critical editions of the original texts. Please email peterkirby@gmail.com in order to volunteer and mention any areas of interest or skill. Some jobs (those requiring more special skills and dedication, such as original language facility) may even be compensated. There are also more opportunities that have yet to be thought of (e.g., spam scouring and quality of submission checking) that require no special ability. So email peterkirby@gmail.com today to express interest!

Labels: