A Reference Model for Web Site Globalization

A generic model illustrating the steps and components required for the automated management of multilingual Web sites.

(Printed in Multilingual Computing, issue #35, Oct/Nov 2000, under the name “Evaluating Technologies for Web Site Globalization”)

Your Web site is going global. Your company has realized that maintaining the company’s Web site in multiple languages is critical; the problem is now yours and it’s basically a can of worms. You already have several contributors providing content in different formats with different tools and now the problem is going to be multiplied by 10 languages. Chaos is just around the corner and it’s not an option. So you start looking for a multilingual process automation and management tool. Your first questions may be: Is it available? How much will it cost? How seamlessly will it integrate with my Web site? Will I have to spend days, weeks or months integrating it? What languages are supported? What content types are supported?

But these questions alone are not sufficient. The major cost of the system will typically be the on-going cost of localization and management over the years, not just the initial costs. The cost of translation can greatly depend on the tools provided to translators: Can the translator work in context? Does the system contain a translation database? If so, how efficient is it? Online translation, for example, may be too slow or too expensive in certain countries. Similarly, the cost (your time) of management will depend on the quality of the management tools and, in particular, on how they help you deal with problems. With localization tasks performed by different people in multiple countries there will often be problems; how quickly the system warns you of these problems and how easily you can solve them will impact not only the release date and cost but also the image of your department.

Fortunately, several technology vendors have emerged and are offering global Web site automation solutions, e.g., in alphabetical order: GlobalSight, Idiom, Trados, WorldPoint. Several CMS (Content Management Systems) vendors are also adding language management features to their solutions, e.g. Glides, Tridion. The domain is new, the technologies are still young and detailed information on these technologies, beyond what can be found on Web sites, is often difficult to obtain.

Different vendors present their technologies in different ways with diverging terminology. Trying to understand what's available is already difficult; comparing them without a reference model is nearly impossible.

This article presents a generic model developed for the purpose of understanding and evaluating Web site globalization technologies. The model encompasses the entire process from the initial assessment to the automated localization cycle that keeps multiple local Web sites synchronized with a source site. We have successfully utilized the model to enable a parallel evaluation of GlobalSight, Uniscape, Tridion and WorldPoint products for internal purposes. More evaluations are underway.

A bird’s eye view

The top-level objective is to keep a series of localized target sites synchronized with a given source site. As shown in the figure below, this happens in two phases:

Set-up

During the initial set-up phase, the Web site content and applications are assessed and the globalization strategy is defined. As with any localization project, the Web applications, databases and scripts must first be properly internationalized to support the target languages. Double-byte and bi-directional languages, in particular, can require a significant amount of work. Finally, a project is defined. The project specifies the parameters that will control the localization cycle: the source and target languages, coded character sets, and physical locations are described as well as the steps, rules and people involved in the localization cycle. In many cases, the set-up phase involves significant consulting work.

Localization Cycle

Once the system is installed it must be maintained. As the source site content changes, translation work is introduced into the localization cycle. Rules define how the work is bunched into jobs. A job is the manageable unit of work in the system. It is useful to distinguish an MLjob that has many target languages from an SLjob that has a single target language. For example, Uniscape uses the word order for MLjob and sub-order for SLjob. GlobalSight uses the word job, but all jobs are SLjobs. The workflow engine routes the jobs through the system in an automated fashion. The workflow is specified as a series of processes that define what needs to be done at each step. Some vendors call the set of processes the workflow template. Jobs represent actual work done by the system; processes define how jobs are executed.

The Generic Model

With the basic concepts introduced, the generic model can be presented. The model highlights the major components and steps involved in the process. Although the order of steps presented here is logical, different technologies may use a different order, may group two or more steps into a single step or may not implement certain steps at all. The model is also independent of who actually performs the work. For example, if the customer is using a localization services firm, many steps will be performed by the vendor and others by the customer. When working with a small translation house, there will be a different sharing of responsibilities. The value of the model is that it provides a certain completeness to the description and a vocabulary to discuss such issues.

 


Generic model of the automated process provided by Web site globalization technologies

 

Source Site

The source Web site may be of varying complexity. A simple site may be composed of static HTML files. More complex sites may be dynamic and database driven. Sophisticated Web sites are managed using a CMS; in this case, the files and databases are different content types managed by the CMS. Multimedia files - Flash and Photoshop for example - are of particular interest as their localization is inherently more complex. An important issue here is invasiveness: will the Web site have to move to another platform or database? Must tags be inserted in the files? Do the files have to be converted to a template based approach? Must the database schema be changed? Some technologies require significant structural and operational changes to your site. While this may be desirable to utilize the technology, it introduces a component of work, time and expense that must be appropriately budgeted.

Content Management System

In many cases, the source site already has its CMS and the globalization technology must connect with it. There are however cases where the customer wishes to purchase a CMS and a globalization solution at the same time. This makes sense especially when the on-going cost of localization will exceed the cost of the CMS. This is called the bundled or embedded CMS. In such a case the features of the CMS, the features of the globalization technology and how well they are integrated must all be considered.

Target Sites

The final objective is to create and maintain any number of localized Web sites. These target sites may also have files, databases, and a CMS, all of which must be adequately interfaced. The localized Web sites may reside on a single multilingual server along with the source site or may be on different servers in various countries. Local content development must also be possible while keeping synchronization with the source site – this is a significantly under-appreciated need. Finally, for truly global enterprises, the target site for certain content may also be source site for other kinds of content.

Central Repositories

All the localization steps shown in the generic model revolve around two major repositories: Workflow and Language Resources. Workflow has to do with process automation, while Language Resources help minimize translation work and ensure better quality. The repositories store basic system objects: processes, jobs, language resources. Each of these objects requires management and maintenance with the help of an appropriate tool.

The repositories in the model are logical objects; different systems may map these repositories to physical databases in different ways. For example, some systems may store workflow and language resources in the same database. Other systems are even more integrated and store workflow and language resources along with source content in the source site database.

Workflow

The workflow engine is ideally a general purpose tool that allows arbitrarily complex processes to be defined: any number of steps, any number of people involved anywhere in the world. Most workflow engines reviewed are Internet based and allow anyone reachable by email to be part of a process. Some workflow engines allow deadlines to be associated with jobs or with single job steps; some allow the definition of escalation or re-routing procedures when a job step goes beyond a certain time limit or when some generic failure occurs. The reliability and scalability of the workflow engine are critical.

Process Management

The workflow engine must provide a tool to define processes. Such a tool will be used extensively during the initial set-up and progressively less and less as process definitions mature. Processes are basically defined in some scripting language (some directly in Java) but most vendors provide a GUI based tool for process definition. These tools may be mature enough for use by customers, others are designed to be used by the vendor’s consultants. Flexibility is the main criteria.

Job Management

As the system starts processing jobs, it must provide a management console to allow tracking and control over the jobs. Some systems have distinct consoles for the customer and for the localization vendor. The tracking features should make it easy to see all jobs in the system and in particular to see all jobs that are stuck or behind schedule. The control features should make it easy to get jobs unstuck, either by rerouting them or by canceling them.

Language Resources

Language resources include a translation database, also referred to as translation memory or TM, glossaries, dictionaries and possibly automatic translation software. Language resources reduce the cost, increase the quality and increase the consistency of translation work. In the most general sense, language resources store translation knowledge; if the system knows how to do something once, it may be able to do it again automatically (or at least make pertinent suggestions to the human translator).

Language Resource Maintenance.

An often neglected point is that language resources must be maintained. The more work that is routed through the system, the more translation knowledge is accumulated. The system has more pre-translated data and presuming appropriate authoring guidelines (e.g. not changing text unnecessarily) translation costs should decrease as re-use increases. But as more and more data is accumulated, the system will also accumulate different translations for the same text segments. As the knowledge grows it becomes less precise and contains more “garbage”. Language Resource Maintenance is required to avoid chaotic growth of translation knowledge and ensure that the captured data can be leveraged in a meaningful way.

Set-up

Globalization Services

Most projects require globalization services and in many cases they represent a significant cost. The first and most critical service is assessment: sources and systems are reviewed to evaluate the complexity and cost of internationalization and a globalization strategy is defined. Other services include internationalization training, consulting and placement. Most internationalization is not technically difficult, managing it efficiently is. The critical issue here is time-to-market and selecting the right strategy.

Project Definition

The project embodies what does not change or changes rarely. The project describes the source content, the target content and the mapping between the two. Content is described by its location, its type, its language and its character. The mapping may also have attributes such as priority and preferred resources. For example: “File A - in English coded in ASCII - is to be translated to File AG - in German coded in Unicode - and it should be handled as high priority (since this is a product announcement) and I would prefer Klaus X. as translator”. Some systems support character set conversions, others support file type conversions.

Localization Cycle

The localization cycle is a sequence of steps that are repeated at regular intervals. The cycle is executed when the source Web site content changes and a job enters the system.

CMS Interface

This module integrates with the existing CMS system on the source Web site. Its main purpose is to provide a seamless integration with the CMS content control mechanisms for a variety of content types. Various degrees of integration are possible. At a minimum the CMS Interface must inform the Change Detection module that some content has changed. Higher levels of integration make the globalization technology appear as an extension of the CMS: the CMS tracking system sees jobs in the localization cycle or the CMS workflow definition system can be used to define localization workflow.

Change Detection

The Change Detection module monitors the source Web site and is responsible for detecting changes and initiating action. Change monitoring may operate continuously or at regular intervals. When examining files it can detect change by a variety of mechanisms: file modification dates, checksums, etc. When connected with a database or a CMS the change detection module will usually wait for change notification from a database trigger or from the CMS Interface. The changed data, or a pointer to it, is passed to the Job Creation module.

Job Creation

The Job Creation module receives a variety of data content that needs to be translated. Each data item has a list of attributes provided by the project definition; source and target languages, for example. The Job Creation engine groups the data elements into one or more jobs for the workflow system. The jobs may be created on the basis of number of files, file types, target languages, priority, etc. This module is also responsible for providing proper context for each changed data content. When dealing with dynamic content from databases, providing the proper context is challenging. Indeed, in such a case, the system must know in what page or pages the database item appears and must provide these to the translator. Making this possible requires custom work during the set-up phase.

Extraction Filter

The Extraction Filter processes each data item of the current job. According to its data content type, each data item is parsed in order to separate text segments from formatting information. Formatting information may accompany the text segments and allow in-context viewing during translation, also known as a preview function. For each format that can be decoded by the filter, there must also be an encoder that can rebuild the translated file from the translations and the formatting information. The list of supported formats differs from vendor to vendor.

Leveraging

The Leveraging module tries to translate all extracted text segments using the Language Resources repository. It may find exact and fuzzy matches using a translation database or glossaries, it may also use automatic translation software. The end result is for each text segment, a list of matches and a classification of these matches into exact matches, fuzzy matches of various precision (90%, 80%, etc.), glossary matches, automatic translation “matches”, etc. The match results will be provided to the translator either by storing them in a job-specific translation database or glossary or by pre-translating the text segments. (It should be noted that the current state of machine translation technology is typically limited to “gisting”, and never relied upon for translating core content.)

Costing & Approval

All text segments classified by the leveraging step are word-counted. For each text segment, the system now has the word-count, the match class as well as source and target languages, priority, deadline, etc. All these factors can be combined to produce a cost for each data item. This in turn is used to produce a quote for the current job which can be routed to the appropriate authority for approval. Some systems support limited forms of automatic approval. It is important that automatic approval have a budget limit because given the highly dynamic nature of Web content, it is easy to imagine that a large and costly translation job could be accidentally generated.

Work Distribution

Once a job has been approved, the work must be distributed to one or more translators in one or more countries. If it has not happened before, this is where an MLjob is split into several SLjobs. Using word counts and knowing the current load of translators, load balancing is possible. Work notifications are sent to translators by email, preferably in their native language. Some systems support sending the work files as attachments to the notification email, others require subsequent log-in and download, while others support online interaction.

Translation

In this step, the translator actually translates the work received using the tools provided by the system or his own tools if the system can interface them. This is likely the most important step as the main cost of localization is translation and the cost of translation is largely determined by how efficient an environment is provided to the translator. The translation tool may be browser-based or may run on Windows. A browser-based tool is not always feasible for remote translators working in countries where the Internet is slow or expensive. The features a translator needs include: exact matches, fuzzy matches, glossaries, previewing, spell-checking, concordance searches, online dictionaries, etc.

Review

The translation work is then routed for review (editing and proofing). The work is checked for translation accuracy and for overall document correctness. Language quality metrics will be computed on a sample of the work before and after review. For this case and others, these systems should allow for user-defined metrics to be stored in the workflow database. With most systems, review is performed in the same environment as the translator. There are however some tools specific to the review process which can be useful: an automated check for common translation errors, for example.

Quality Assurance

The resulting work should then be tested before being published. It is most efficient if the work can be tested by the localization provider that performed the work. This means that the localization vendor must have access to the customer staging site for the various target Web sites. This allows the vendor to test in situ those parts of the site that have changed and others as required. Some systems have features to track bugs and generate bugfix tasks in the workflow.

Work Completion

This step is actually quite simple: a reviewer or tester uploads a tested file or simply hits a “Done” button. The importance of this step is that it represents a milestone at which the translation work performed is deemed correct. It is at this time that the language repository updates its translation database and glossaries. It is at this time, after updating the language resources, that it is most efficient to perform Language Resource Maintenance.

Delivery to Target Sites

The final work must be delivered to the target Web sites, by email, FTP, or by going through a CMS interface. Delivery should not interfere with local content development or site operation.

Billing & Collecting

In this model, the billing occurs after the work has been delivered for simplicity. In reality, as soon as the job has been approved and resources committed, there can be partial billing. This is a feature that WorldPoint insists on; their system can generate machine readable invoices and they have interfaced some financial systems.

Using the model

This generic model illustrates the major issues involved in building and maintaining the global Web site. It is useful in evaluating these technologies by providing a vendor-neutral terminology and conceptual reference. The model has allowed us to generate an evaluation framework of close to 1,000 questions by elaborating each step of the process and including business-related questions. The evaluation framework has been used to perform a comprehensive evaluation of several vendors including GlobalSight, Tridion, Uniscape and WorldPoint; others are in progress. Web site managers facing the multilingual challenge can use this model to better understand, question and compare the solutions proposed by the various vendors so they can select the appropriate technology to help them manage their global Web site.