How to Build a Translation Pipeline by aaron@linguamaven.com
Friday March 1^st, 2024 at 5:13 PM

Lingua Maven

Last post I wrote about the business case for a translation pipeline, and why keeping it vendor-agnostic is the ideal scenario.

Now let’s talk about the how. Someday I’ll write more about why you should trust my opinion on this topic, but for now you’ll just have to believe that I have experience in this exact thing.

It needs Projects and Keys.

All TMS Vendors basically have a notion of Projects and Keys. Our translation pipeline will need to have these concepts as well.

Project

This is basically a way to group relevant keys. If you’re a development team, each codebase likely has its own project (but not always).

A project always has a source language (the language your company authors content in), and target languages (the languages to which you’re translating this group of keys into).

Keys

This is a single translatable item. There’s really not a one-size-fits-all on what a key is supposed to be.

If you’re a marketing team, a key could be 1 single blog post or you could make individual keys out of each paragraph. If you’re a development team translating an interface then a key could be a button labels, form field placeholders, or really anything with text that will be read by an end user.

What’s most relevant for our purposes is that each key has three important parts:

Unique ID – call it a slug, id, or whatever makes sense. It can be words, emoji, numbers. I can by dynamically created or it can be something specific. What’s important is that it’s unique within your project.
Original Value – this is the actual content that will get translated into multiple languages. It can be a lot of text, or a little. It can include HTML, or not. Keep in mind that non-technical translators will be working with this content, so if it has a lot of HTML and variables then it will increase the number of mistakes during the translation process.
Description – this is a one-sentence description of where and how this specific key is being used. Providing this to your translator will give them the context they need to provide the most correct translation.

It Doesn’t Have to be Insanely Performant

You read that correctly. There is never an instance that you should be making an API call to your TMS on each page load.

This could be its own blog post someday, but all translations should be coming from a local file in your repo, or a cache, or if you must something like S3.

What this means for us is that we could build an incredibly resilient translation pipeline that takes up relatively few resources. It would use SNS + SQS (or similar) to make decisions on changed strings, and it would push compiled translations into S3 or similar. I’ll be discussing this in more technical detail in future posts.

It Needs A Lot of TMS Connectors

This is really the most complicated aspect of a translation pipeline: pushing and pulling translatable content into and out of TMS vendors.

Some vendors have clearly documented API’s, others don’t. Some have very strict rate limiting (for reasons mentioned in the above section). Your code has to be able to retry intelligently and have a ton of logging to help developers debug any issues.

Nice-to-Haves

There are plenty more features that would help a company more easily translate their content. Here are a few that come to mind.

Test Projects

When a development team is testing an integration, they don’t want to wait for a key to push through SNS, SQS, into a TMS Vendor, and back again. They want a quick cycle to confirm their API’s are working as expected.

This is where it’s useful to have a Test Mode for a Project that skips the TMS and simply applies Pseudo Translation to each key.

Lots and Lots of SDK’s

As you can imagine, there are a lot of small tools and scripts that could simplify the process of a team integrating with a specific Project. The most common being a way to map an incoming language with a supported language.

For example, some devices might have an Accept-Language header of en-US but your project supports en_BG. There are a lot of small edge cases like this that are easily fixed with a few utilities

Read the whole story

AaronPresley

47 days ago

reply

Portland, OR

Potluck: Dynamic documents as personal software
Saturday November 5^th, 2022 at 12:46 PM

Ink & Switch

Gradually enriching text documents into interactive applications

Read the whole story

AaronPresley

529 days ago

reply

Portland, OR

Announcing ICU4X 1.0 by Unicode, Inc. (noreply@blogger.com)
Thursday September 29^th, 2022 at 12:24 PM

The Unicode Blog

I. Introduction

Hello! Ndeewo! Molweni! Салам! Across the world, people are coming online with smartphones, smart watches, and other small, low-resource devices. The technology industry needs an internationalization solution for these environments that scales to dozens of programming languages and thousands of human languages.

Enter ICU4X. As the name suggests, ICU4X is an offshoot of the industry-standard i18n library published by the Unicode Consortium, ICU (International Components for Unicode), which is embedded in every major device and operating system.

This week, after 2½ years of work by Google, Mozilla, Amazon, and community partners, the Unicode Consortium has published ICU4X 1.0, its first stable release. Built from the ground up to be lightweight, portable, and secure, ICU4X learns from decades of experience to bring localized date formatting, number formatting, collation, text segmentation, and more to devices that, until now, did not have a suitable solution.

Lightweight: ICU4X is Unicode's first library to support static data slicing and dynamic data loading. With ICU4X, clients can inspect their compiled code to easily build small, optimized locale data packs and then load those data packs on the fly, enabling applications to scale to more languages than ever before. Even when platform i18n is available, ICU4X is suitable as a polyfill to add additional features or languages. It does this while using very little RAM and CPU, helping extend devices' battery life.

Portable: ICU4X supports multiple programming languages out of the box. ICU4X can be used in the Rust programming language natively, with official wrappers in C++ via the foreign function interface (FFI) and JavaScript via WebAssembly. More programming languages can be added by writing plugins, without needing to touch core i18n logic. ICU4X also allows data files to be updated independently of code, making it easier to roll out Unicode updates.

Secure: Rust's type system and ownership model guarantee memory-safety and thread-safety, preventing large classes of bugs and vulnerabilities.

How does ICU4X achieve these goals, and why did the team choose to write ICU4X over any number of alternatives?

II. Why ICU4X?

You may still be wondering, what led the Unicode Consortium to choose a new Rust-based library as the solution to these problems?

II.A. Why a new library?

The Unicode Consortium also publishes ICU4C and ICU4J, i18n libraries written for C/C++ and Java. Why write a new library from scratch? Wouldn’t that increase the ongoing maintenance burden? Why not focus our efforts on improving ICU4C and/or ICU4J instead?

ICU4X solves a different problem for different types of clients. ICU4X does not seek to replace ICU4C or ICU4J; rather, it seeks to replace the large number of non-Unicode, often-unmaintained, often-incomplete i18n libraries that have been written to bring i18n to new programming languages and resource-constrained environments. ICU4X is a product that has long been missing from Unicode's portfolio.

Early on, the team evaluated whether ICU4X's goals could have been achieved by refactoring ICU4C or ICU4J. We found that:

ICU4C has already gone through a period of optimization for tree shaking and data size. Despite these efforts, we continue to have stakeholders saying that ICU4C is too large for their resource-constrained environment. Getting further improvements in ICU4C would amount to rewrites of much of ICU4C's code base, which would need to be done in a way that preserves backwards compatibility. This would be a large engineering effort with an uncertain final result. Furthermore, writing a new library allows us to additionally optimize for modern UTF-8-native environments.
Except for JavaScript via j2cl, Java is not a suitable source language for portability to low-resource environments like wearables. Further, ICU4J has many interdependent parts that would require a large amount of effort to bring to a state where it could be a viable j2cl source.
Some of our stakeholders (Firefox and Fuchsia) are drawn to Rust's memory safety. Like most complex C++ projects, ICU4C has had its share of CVEs, mostly relating to memory safety. Although C++ diagnostic tools are improving, Rust has very strong guarantees that are impossible in other software stacks.

For all these reasons, we decided that a Rust-based library was the best long-term choice.

II.B. Why use ICU4X when there is i18n in the platform?

Many of the same people who work on ICU4X also work to make i18n available in the platform (browser, mobile OS, etc.) through APIs such as the ECMAScript Intl object, android.icu, and other smartphone native libraries. ICU4X complements the platform-based solutions as the ideal polyfill:

Some platform i18n features take 5 or more years to gain wide enough availability to be used in client-side applications. ICU4X can bridge the gap.
ICU4X can enable clients to add more locales than those available in the platform.
Some clients prefer identical behavior of their app across multiple devices. ICU4X can give them this level of consistency.
Eventually, we hope that ICU4X will back platform implementations in ECMAScript and elsewhere, providing a maximal amount of consistency when ICU4X is also used as a polyfill.

II.C Why pluggable data?

One of the most visible departures that ICU4X makes from ICU4C and ICU4J is an explicit data provider argument on most constructor functions. The ICU4X data provider supports the following use cases:

Data files that are readable by both older and newer versions of the code; for more detail on how this works, see ICU4X Data Versioning Design
Data files that can be swapped in and out at runtime, making it easy to upgrade Unicode, CLDR, or time zone database versions. Swapping in new data can be done at runtime without needing to restart the application or clear internal caches.
Multiple data sources. For example, some data may be baked into the app, some may come from the operating system, and some may come from an HTTP service.
Customizable data caches. We recognize that there is no "one size fits all" approach to caching, so we allow the client to configure their data pipeline with the appropriate type of cache.
Fully configurable data fallbacks and overlays. Individual fields of ICU4X data can be selectively overridden at runtime.

III. How We Made ICU4X Lightweight

There are three factors that combine to make code lightweight: small binary size, low memory usage, and deliberate performance optimizations. For all three, we have metrics that are continuously measured on GitHub Actions continuous integration (CI).

III.A. Small Binary Size

Internationalization involves a large number of components with many interdependencies. To combat this problem, ICU4X optimizes for "tree shaking" (dead code elimination) by:

Minimizing the number of dependencies of each individual component.
Using static types in ways that scope functions to the pieces of data they need.
Splitting functions and classes that pull in more data than they need into multiple, smaller pieces.

Developers can statically link ICU4X and run a tree-shaking tool like LLVM link-time optimization (LTO) to produce a very small amount of compiled code, and then they can run our static analysis tool to build an optimally small data file for it.

In addition to static analysis, ICU4X supports dynamic data loading out of the box. This is the ultimate solution for supporting hundreds of languages, because new locale data can be downloaded on the fly only when they are needed, similar to message bundles for UI strings.

III.B. Low Memory Usage

At its core, internationalization transforms inputs to human-readable outputs, using locale-specific data. ICU4X introduces novel strategies for runtime loading of data involving zero memory allocations:

Supports Postcard-format resource files for dynamically loaded, zero-copy deserialized data across all architectures.
Supports compile-time linking of required data without deserialization overhead via DataBake.
Data schema is designed so that individual components can use the immutable locale data directly with minimal post-processing, greatly reducing the need for internal caches.
Explicit "data provider" argument to each function that requires data, making it very clear when data is required.

ICU4X team member Manish Goregaokar wrote a blog post series detailing how the zero-copy deserialization works under the covers.

III.C. Deliberate Performance Optimizations

Reducing CPU usage improves latency and battery life, important to most clients. ICU4X achieves low CPU usage by:

Writing in Rust, a high-performance language.
Utilizing zero-copy deserialization.
Measuring every change against performance benchmarks.

The ICU4X team uses a benchmark-driven approach to achieve highly competitive performance numbers: newly added components should have benchmarks, and future changes to those components should avoid regressing on those benchmarks.

Although we always seek to improve performance, we do so deliberately. There are often space/time tradeoffs, and the team takes a balanced approach. For example, if improving performance requires increasing or duplicating the data requirements, we tend to favor smaller data, like we've done in the normalizer and collator components. In the segmenter components, we offer two modes: a machine learning LSTM segmenter with lower data size but heavier CPU usage, and a dictionary-based segmenter with larger data size but faster. (There is ongoing work to make the LSTM segmenter require fewer CPU resources.)

IV. How We Made ICU4X Portable

The software ecosystem continually evolves with new programming languages. The "X" in ICU4X is a nod to the second main design goal: portability to many different environments.

ICU4X is Unicode's first internationalization library to have official wrappers in more than one target language. We do this with a tool we designed called Diplomat, which generates idiomatic bindings in many programming languages that encourage i18n best practices. Thanks to Diplomat, these bindings are easy to maintain, and new programming languages can be added without needing i18n expertise.

Under the covers, ICU4X is written in no_std Rust (no system dependencies) wrapped in a stable ABI that Diplomat bindings invoke across foreign function interface (FFI) or WebAssembly (WASM). We have some basic tutorials for using ICU4X from C++ and JavaScript/TypeScript.

V. What’s next?

ICU4X represents an exciting new step in bringing internationalized software to more devices, use cases, and programming languages. A Unicode working group is hard at work on expanding ICU4X’s feature set over time so that it becomes more useful and performant; we are eager to learn about new use cases and have more people contribute to the project.

Have questions? You can contact us on the ICU4X discussion forum!

Want to try it out? See our tutorials, especially our Intro tutorial!

Interested in getting involved? See our Contribution Guide.

Want to stay posted on future ICU4X updates? Sign up for our low-traffic announcements list, icu4x-announce@unicode.org!

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Read the whole story

AaronPresley

566 days ago

reply

Portland, OR

Rising from the ashes: Stage Manager by cricket
Thursday June 9^th, 2022 at 2:25 PM

Tech Reflect

Every year I worked on macOS/iOS, I would get attached to a handful of features that would ultimately get axed. Over time, I grew de-sensitized to it, but sometimes a feature would come along that I would never be able to get over.

While Apple was transitioning to Intel in …

1942 Letter to My Grandfather from His Father
Tuesday January 4^th, 2022 at 6:55 PM

inessential.com

My Dad ran across a remarkable letter and shared it with family. I volunteered to share it more widely, and Dad and his siblings agreed.

* * *

This 1942 letter was written by John Simmons to his son Donald as Donald was about to be shipped to Europe (England first). He had enlisted after the Pearl Harbor attack.

Donald Simmons was my grandfather, and I had the fortune of knowing him.

My father and his siblings knew John Simmons, their grandfather, and with this letter they are able to know him a bit better.

John was 63 when he wrote this letter to his youngest son.

* * *

You can read all three pages together as a PDF — or you can click a thumbnail for each page for the full size version.

Page 1

Page 2

Page 3

* * *

Below is the text. The original is written in cursive on Wright Aeronautical Corporation letterhead.

Aug. 20, 1942

Dear Don:

My thoughts are with you tonight so strongly that I shall drop you a note. Of course it is hard for you to go but not much harder than for us to see you go. You see we love you and are now so helpless to aid you in any way. But then you are a man now and will have to make your own way from here on. And we’re sure that you have the stuff to do it. It won’t be an easy job but then, Don, no job worth the doing is. There are bound to be dark days and darker nights but you must always remember that nothing lasts forever and in the morning it will be a new day. And you are better trained than most for the work.

You are too intelligent to be told and believe that war is anything but a tragic mess, however, we are in this not with our consent but because of a treacherous attack that we did not invite. Regardless of the causes the effect is that we simply must win. And we shall win in spite of our petty bickerings among ourselves for in the final analysis we are the greatest nation on earth. We know that this country has reached the highest degree of living for its people ever attained by any nation in the history of the world and I believe enough in God to feel that with His help all the good, the right, and the fine things must survive. Of course we feel that possibly you may not come back — there is always that possibility and, too, we may be gone when you do come back but in the very grim business of war bombs and bullets go where they are sent and we must for our own peace of mind look that fact in face. Naturally the law of averages is greatly on the side of your returning to us and please God that will be our happiest day in a life time. I am confident you will receive very good training in whatever field you are placed and that you will be adequately prepared to protect yourself.

You are entering what will probably be the greatest adventure in your life. You are going to see miserable, filthy, low, mean and degrading sights for men are like that but you will also see fine, good, self sacrificing and even heroic things for men are like that also. That you will fall into the later class we who love you and have every confidence in you have not the slightest doubt. You have the background and the spirit and will to do so. Just keep yourself so you can look yourself in the face and not be ashamed of what you see. You will come through all right.

And now, old Son, I’ll close by wishing you again the best of everything there is in any old world and all the luck that there is. I truly wish I were going with you — it is hard too to stand and wait.

John

Sgt. Donald Simmons did, of course, make it back.

Read the whole story

AaronPresley

834 days ago

reply

Portland, OR

Stunning photos from the time when oil derricks loomed all over California beaches, 1910-1955 by RHP
Monday December 20^th, 2021 at 1:55 PM

Rare Historical Photos

The Golden State got its nickname from the Sierra Nevada gold that lured so many miners and settlers to the West, but California has earned much more wealth from so-called “black gold” than from metallic gold. When Europeans finally arrived in California, petroleum had already been in use by Native Americans for about 13,000 years, […]

Read the whole story

AaronPresley

849 days ago

reply

They moved them all to West TX I guess

Portland, OR

How to Build a Translation Pipeline by aaron@linguamaven.com Friday March 1st, 2024 at 5:13 PM

It needs Projects and Keys.

Project

Keys

It Doesn’t Have to be Insanely Performant

It Needs A Lot of TMS Connectors

Nice-to-Haves

Test Projects

Lots and Lots of SDK’s

Potluck: Dynamic documents as personal software Saturday November 5th, 2022 at 12:46 PM

Announcing ICU4X 1.0 by Unicode, Inc. (noreply@blogger.com) Thursday September 29th, 2022 at 12:24 PM

I. Introduction

II. Why ICU4X?

II.A. Why a new library?

II.B. Why use ICU4X when there is i18n in the platform?

II.C Why pluggable data?

III. How We Made ICU4X Lightweight

III.A. Small Binary Size

III.B. Low Memory Usage

III.C. Deliberate Performance Optimizations

IV. How We Made ICU4X Portable

V. What’s next?

Rising from the ashes: Stage Manager by cricket Thursday June 9th, 2022 at 2:25 PM

1942 Letter to My Grandfather from His Father Tuesday January 4th, 2022 at 6:55 PM

Stunning photos from the time when oil derricks loomed all over California beaches, 1910-1955 by RHP Monday December 20th, 2021 at 1:55 PM

How to Build a Translation Pipeline by aaron@linguamaven.com
Friday March 1^st, 2024 at 5:13 PM

Potluck: Dynamic documents as personal software
Saturday November 5^th, 2022 at 12:46 PM

Announcing ICU4X 1.0 by Unicode, Inc. (noreply@blogger.com)
Thursday September 29^th, 2022 at 12:24 PM

Rising from the ashes: Stage Manager by cricket
Thursday June 9^th, 2022 at 2:25 PM

1942 Letter to My Grandfather from His Father
Tuesday January 4^th, 2022 at 6:55 PM

Stunning photos from the time when oil derricks loomed all over California beaches, 1910-1955 by RHP
Monday December 20^th, 2021 at 1:55 PM