以词为本的编码方案的探讨

江汉大学学报（自然科学版） ›› 2013, Vol. 41 ›› Issue (2): 47-52.

以词为本的编码方案的探讨

程元斌

江汉大学数学与计算机科学学院，湖北武汉 430056

出版日期:2013-04-12 发布日期:2014-01-07
作者简介:程元斌（1954 —），男，副教授，研究方向：信息安全、人工智能。

Encoding Scheme Based on Words

CHENG Yuan-bin

School of Mathematics and Computer Science，Jianghan University，Wuhan 430056，Hubei，China

Online:2013-04-12 Published:2014-01-07

摘要/Abstract

摘要： 语言是人进行思维的主要工具，词是语言处理的基本单位。在计算机信息处理中，目前是按字设计编码。随着计算机信息处理技术的发展，这种完全按字编码的不足也日益显示出来。从信息处理的基本需求以及词的基本特性出发，提出字词综合考虑且以词为本的统一编码方案。该方案以现行的主要编码标准UTF-16为基础，维持现有的字编码，增加词编码；词编码以包括一定语义信息及语义关系的概念空间树进行逻辑组织，以适应聚类检索及语种间代码转换的原则进行空间组织。最后指出了需要进一步深入研究的几个疑难问题。

关键词: 词编码, UTF-16, 聚类检索, 概念空间树, 自然语言处理

Abstract: Language is the main tool of thinking. Words are the basic unit of language. However，character encoding is the present encoding method in computer information processing. With in-depth development of computer information processing，the disadvantages of character encoding increasingly appear. From the basic needs of information processing and the basic characteristics of the words，an unified encoding scheme on comprehensive consideration of word-character，and word-oriented is proposed. The scheme based on the existing coding standard UTF-16，maintains the existing character encoding，adds words coding；words encoding are logical organized with the concept space tree including some semantic information and semantic relationship，adapting to cluster retrieval and language code convert between two languages are the principles of spatial organization. At last，points out several problems which need further study.

Key words: words encoding, UTF-16, cluster retrieval, concept space tree, natural language processing

中图分类号:

TP391.11

程元斌. 以词为本的编码方案的探讨[J]. 江汉大学学报（自然科学版）, 2013, 41(2): 47-52.

CHENG Yuan-bin. Encoding Scheme Based on Words[J]. Journal of Jianghan University(Natural Science Edition), 2013, 41(2): 47-52.