Cultural Agency to Provide Japanese Texts For Training Domestic LLM

Yomiuri Shimbun file photo
The Cultural Affairs Agency in Kamigyo Ward, Kyoto.

The Cultural Affairs Agency will provide Japanese text data needed to train Large Language Models (LLM), the fundamental technology of generative artificial intelligence.

Amid the rapid spread of LLM worldwide, the agency will support the development of highly accurate domestic AI by providing reliable linguistic data.

LLM train on massive text datasets and predict the most likely next word in a sequence to generate human-like texts, make summaries or answer questions. Wrong or biased data can harm the quality of LLM.

The agency will use a database of written language operated by the National Institute for Japanese Language and Linguistics (NINJAL). The number of words in the database will be increased from the current 100 million words to 200 million by fiscal 2028.

The agency will then establish a system to provide the data in stages to domestic generative AI operators.

Texts in the database are picked statistically from books, textbooks and internet message boards, among other materials, to serve as a microcosm of modern Japanese. These texts have been checked by NINJAL staff and are said to be free from copyright issues.

The Cultural Affairs Agency deems it important for domestic operators to develop LLM from the perspective of international competitiveness and hopes that “reliable language resources” will improve the accuracy of the technology.

The agency also will create a database of spoken and written dialects, as well as standard Japanese translations, to promote the development of voice recognition AI technology specializing in dialects. Such a database could facilitate smooth communications with elderly people when they are receiving medical care or during the recovery process from disasters, as older people often speak in a dialect.