TREC 2005问答系统评测综述外文翻译资料

 2022-08-25 21:28:56

Overview of the TREC 2005 Question Answering Track

Ellen M. Voorhees

Hoa Trang Dang

National Institute of Standards and Technology

Gaithersburg, MD 20899

Abstract

The TREC 2005 Question Answering (QA) track contained three tasks: the main question answering task, the

document ranking task, and the relationship task. In the main task, question series were used to define a set of targets.

Each series was about a single target and contained factoid and list questions. The final question in the series was an

“Other” question that asked for additional information about the target that was not covered by previous questions in

the series. The main task was the same as the single TREC 2004 QA task, except that targets could also be events;

the addition of events and dependencies between questions in a series made the task more difficult and resulted in

lower evaluation scores than in 2004. The document ranking task was to return a ranked list of documents for each

question from a subset of the questions in the main task, where the documents were thought to contain an answer

to the question. In the relationship task, systems were given TREC-like topic statements that ended with a question

asking for evidence for a particular relationship.

The goal of the TREC question answering (QA) track is to foster research on systems that return answers them-

selves, rather than documents containing answers, in response to a question. The track started in TREC-8 (1999), with

the first several editions of the track focused on factoid questions. A factoid question is a fact-based, short answer

question such as How many calories are there in a Big Mac?. The task in the TREC 2003 QA track contained list and

definition questions in addition to factoid questions [1]. A list question asks for different instances of a particular kind

of information to be returned, such as List the names of chewing gums. Answering such questions requires a system

to assemble an answer from information located in multiple documents. A definition question asks for interesting

information about a particular person or thing such as Who is Vlad the Impaler? or What is a golden parachute?.

Definition questions also require systems to locate information in multiple documents, but in this case the information

of interest is much less crisply delineated.

In TREC 2004 [2], factoid and list questions were grouped into different series, where each series was associated

with a target (a person, organization, or thing) and the questions in the series asked for some information about the

target. In addition, the final question in each series was an explicit “Other” question, which was to be interpreted as

“Tell me other interesting things about this target I donrsquo;t know enough to ask directly”. This last question was roughly

equivalent to the definition questions in the TREC 2003 task.

The TREC 2005 QA track contained three tasks: the main question answering task, the document ranking task,

and the relationship task. The document collection from which answers were to be drawn was the AQUAINT Corpus

of English News Text (LDC catalog number LDC2002T31). The main task was the same as the TREC 2004 task, with

one significant change: in addition to persons, organizations, and things, the target could also be an event. Events were

added in response to suggestions that the question series include answers that could not be readily found by simply

looking up the target in Wikipedia or other pre-compiled Web resources. The runs were evaluated using the same

methodology as in TREC 2004, except that the primary measure was the per-series score instead of the combined

component score.

The document ranking task was added to build infrastructure that would allow a closer examination of the role

document retrieval techniques play in supporting QA technology. The task was to submit, for a subset of 50 of the

questions in the main task, a ranked list of up to 1000 documents for each question. The purpose of the lists was to

create document pools both to get a better understanding of the number of instances of correct answers in the collection

and to support research on whether some document retrieval techniques are better than others in support of QA. NIST

pooled the document lists for each question, and assessors judged each document in the pool as relevant (“contains an

answer”) or not relevant (“does not contain an answer”). Document lists were then evaluated using trec eval measures.

Finally, the relationship task was added. The task was the same as was performed in the AQUAINT 2004 rela-

tionship pilot. Systems were given TREC-like topic statements that ended with a question asking for evidence for a

particular relationship. The initial part of the topic statement set the context for the question. The question was either

a yes/no question, which was understood to be a request for evidence supporting the answer, or an explicit request for

the evidence itself. The system response was a set of information nuggets that were evaluated using the same scheme

as definition and Other questions.

The remainder of this paper describes each of the three tasks in the TREC 2005 QA track in more detail. Section 1

describes the question series that formed the basis of the main and document ranking tasks; section 2 describes the

evaluation method and resulting scores for the runs for the main task, while section 3 describes the evaluation and

results of the document ranking task. The questions and results for the relationship task are described in section 4.

Section 5 summarizes the

剩余内容已隐藏,支付完成后下载完整资料


TREC 2005问答系统评测综述

摘 要

TREC 2005问答系统(QA)评测包含三个任务:问题回答任务,文档排序任务以及关系识别任务,其中问题回答任务为主要任务。在主要任务中,使用问题序列来定义一组目标。每个序列关于某一个目标,并包含事实和列表类问题。该序列的最后一个问题类型是“其他”类型,要求提供相关目标更多的信息,这些信息没有在先前的问题序列中出现过。除了目标也可以是事件类型外,主要任务与TREC 2004 QA任务相同; 由于问题序列中添加了事件和依赖关系使得任务更加困难,导致评估得分低于2004年。文档排序任务是从主要任务的问题子集中返回一组排好序的文档,文档被认为包含该问题的答案。在关系识别任务中,系统给出类如TREC的主题陈述,结尾提出了一个问题,要求为特定关系提供证明。

TREC问答系统(QA)评测任务的目标是为了促进系统研究如何返回答案本身,而不是返回包含问题答案的文档。该评测任务从TREC-8(1999)开始,评测的前几个版本着重于事实类问题。事实类的问题是一个基于事实的简短的问题, 例如“How many calories are there in a Big Mac?”。TREC 2003问答系统评测任务中除了事实类问题外还包含了列表类和定义类问题[1]。列表类问题要求返回关于特定信息的不同实例,例如“List the names of chewing gums”。回答这样的问题需要系统从多个文档的信息中汇总答案。定义类问题要求返回有关特定人物或事物的有关信息,例如“Who is Vlad the Impaler?”或者“What is a golden parachute?”。定义类问题也要求系统在多个文档中查找信息,但在这种情况下,有关的信息很少被清楚地描绘。

在TREC 2004[2],事实类和列表类问题被分为不同的问题序列,每个序列都是与目标相关联的(人、组织或事物)并且序列中的问题会询问一些关于目标的信息。此外,在每个序列的最后一个问题是一个“其他”类型的问题,这个问题可以理解为“告诉我关于这个目标的其他有关信息,我不能通过直接询问获得”。最后一个问题大概相当于是TREC 2003任务上问题的定义。

TREC 2005问答系统(QA)评测包含三个任务:问题回答任务,文档排序任务以及关系识别任务,其中问题回答任务为主要任务。包含答案的文档集合来自于英文新闻文本语料库AQUAINT(LDC目录编号ldc2002t31)。主要任务大致和TREC 2004的任务一样,只有一个显着的变化:目标除了可以是人、组织和事物,也可以是一个事件。由于一些答案不能简单通过查找维基百科或其他预先编译的网络资源获得,事件类型被添加进来就是为了能够回复含有那些答案的问题序列。

同时使用了与TREC 2004相同的方法对运行结果进行评估, 但主要的度量值是每个序列的分数而不是组合分数。

文档排序任务被添加到基础结构的构建中, 以便能够更好地检测文档检索技术在支持 QA技术方面所起的作用。该任务对主要任务中提交的50个问题子集中的每个问题提供最多1000个排序的文档列表。返回这些列表的目的是为了创建文档集, 以便更好地了解集合中正确答案的实例数, 以此来研究在支持 QA系统上某些文档检索技术是否优于其他的文件提取方法。NIST汇集了每个问题的文档列表, 评审员将文档集中的每个文档判定为相关 ('包含答案') 或不相关 ('不包含答案'),然后使用 trec eval 度量值来评估文档列表。

最后增加的关系识别任务与在 AQUAINT 2004 关系识别中执行的工作相同。系统给出类如TREC的主题陈述,结尾提出了一个问题,要求为特定关系提供证明。主题陈述的初始部分为问题设置了上下文语境。这类问题是一个是非问题,可以理解为要求提供支持答案的证据,或者是对证据本身有明确要求。系统返回的是一组信息集, 使用的是与定义类以及“其他”类问题相同的方案进行评估。

本文的其余部分将详细介绍TRE 2005 QA评测中的三个任务。第1节描述了构成系统主体以及文件排列任务的问题序列;第2节描述了主要任务运行结果的评估方法和结果得分; 第3节描述了文档分级任务的评估和效果;第4节描述了关系识别任务的问题和结果;第5节总结了系统用来回答问题的技术方法; 最后一节着眼于评测的未来。

1、问题序列

TREC 2005 QA评测的主要任务要求为问题序列中的每个问题提供答案。一个问题序列会包括几个事实类问题,一到两个列表类问题,还有一个“其他”类问题。与每个问题序列相关联的是一个定义好的目标。问题所属的序列、问题在序列中的顺序以及每个问题的类型 (事实、列表或其他类型) 都是以用于描述测试集的xml格式进行编写的。示例问题序列(除去xml标记) 如图1所示。

  1. return of Hong Kong to Chinese sovereignty
    1. FACTOID What is Hong Kongrsquo;s population?
    2. FACTOID When was Hong Kong returned to Chinese sovereignty?
    3. FACTOID Who was the Chinese President at the time of the return?
    4. FACTOID Who was the British Foreign Secretary at the time?
    5. LIST What other countries formally congratulated China on the return?
    6. OTHER
  1. AMWAY
    1. FACTOID When was AMWAY founded?
    2. FACTOID Where is it headquartered?
    3. FACTOID Who is the president of the company?
    4. LIST Name the officials of the company.
    5. FACTOID What is the name “AMWAY” short for?
    6. OTHER
  1. Shiite
    1. FACTOID Who was the first Imam of the Shiite sect of Islam?
    2. FACTOID Where is his tomb?
    3. FACTOID What was this personrsquo;s relationship to the Prophet Mohammad?
    4. FACTOID Who was the third Imam of Shiite Muslims?
    5. FACTOID When did he die?
    6. FACTOID What portion of Muslims are Shiite?
    7. LIST What Shiite leaders were killed in Pakistan?
    8. OTHER

图 1: 测试集的问题序列示例,序列95以一个事件作为目标, 序列111以一个组织作为目标,序列136以一个事物为目标。

这个主要任务的情景是:一个以英语为母语的成年人正在为一个感兴趣的目标寻找更多的信息,该目标可以是一个人,组织,事物或事件,用户被认为是美国报纸的“平均”水平的读者。NIST评估员担任用户代理,提出问题并评判系统返回的答案。

在TREC 2004,问题序列几乎都是在评估人员检索AQUAINT文档集之前编写的,因此,许多问题序列是没有用的,因为文档集缺少足够的信息来回答这些问题。为了避免这个问题,TREC 2005的问题是在评估人员检索了AQUAINT文档集合并确定有足够的目标信息之后制定的。评估员制定了事实和列表类型的问题,它们的答案可以在文档集中找到,他们试图将这些问题变为他们即使没有看过文档集也会提问的问题。评估员还记录了其他有用的信息,这些信息不是对事实或列表类型的问题的回答 (因为信息不是事实类型,或者问题很明显是答案的反面陈述),这些可以用来回答序列中最后的“其他”类型的问题。

上下文处理是问题回答系统的重要组成部分,因此,序列中的问题可以使用一个代词、一个明确的名词短语或其他引用表达式来关联目标或先前的答案,如图1所示,每个序列都是一段抽象出的对话信息,用户试图在这段对话中定义目标,但这种抽象很有限,与真实对话不同的是,问句不能够提及序列中先前问题的答案。因为每个有用的序列都会包含一个列表问题,它的答案被命名为实体,所以评估人员有时会问一些他们实际上并不感兴趣的列表类型的问题。这意味着,这个序列不一定是评估者对于目标的兴趣的真实反映。

最后一个测试集包含75个序列,这些序列的目标在表1中给出。在75个目标中,19个是人,19个是组织,19个是事物,18个是事件。这个系列包含了362个问题,93个列表问题,75个(每个目标一个)“其他”问题。每个序列包含6-8个问题(计算另一个问题),大多数序列包含7个问题。

参加者必须在收到测试集的一个星期内提交检索结果。所有问题的处理都必须是完全自动化的。系统要求能够彼此独立地处理序列,并且能够处理一个排好序的单独序列。也就是说,系统被允许使用序列中先前问题和答案来回答同一个序列中后面的问题,但是不能“向前看”,用后面的问题来帮助回答前面的问题。为了方便评测,NIST使用PRIZE文档检索系统和问题目标作为查询,为每个目标的前1000个文档进行了文档排序。30名参与者中,有71个用例是提交给主要任务的。

2、主要任务评价

每个运行的评估包括对每个问题类型的评估,以及每个序列的最终平均得分。这三种问题类型都有自己的返回格式和评估方法。2005年的评估部分与2004年TREC的QA评测是相同的。每个序列的分数是用该序列中问题得分加权平均值来计算的,而最后的得分是用每个序列的平均值计算出来的。

2.1 事实类问题

本系统对于一个事实类问题的返回结果,要么是一个docid,要么是字符串,要么是字符串“NIL”。 由于无法保证在文档集中有问题的答案,所以当系统认为没有答案时,会返回NIL。否则,答案字符串是一个包含问题答案的字符串,而docid是文档集中包含答案字符串的文档的id。

每一个返回结果都由两名人类评估员独立判断的。当两个评估员在他们的判决中意见不一致时,第三个评估员将做出最后的决定。每一个返回结果都被分配了以下四种判断中的一种:

① 错误:答案字符串不包含正确答案或者答案不是想要的答案;

② 不支持:答案字符串包含一个正确的答案,但返回的文档不支持答案;

③ 不准确:答案字符串包含一个正确的答案和文档支持这个答案,但字符串包含的不仅仅是答案或者答案有缺失;

④ 答案正确:字符串包含正确的答案,返回的文档支持答案。

为了返回正确的答案,一个返回的字符串应当包含正确的单元,并引用正确的“著名”实体(例如:在问题问及“泰姬陵”时,泰姬陵赌场不会被返回。)问题也同样需要在问题序列的时间范围内进行解释;例如,如果目标是“法国在足球世界杯上赢得世界杯”,那么问题“谁是法国队的教练?”然后,正确的答案必须是“Aime Jacquet”(法国足球队在1998年赢得世界杯时的教练的名字),而不仅仅是法国球队的历史或现任教练的名字。

如果在问题集中没有已知的答案,并且其他答案都是不正确的,那么将会返回NIL。对于测试集中的362事实类问题,有17个应当返回NIL (18个问题没有正确的回答,但是其中一个的正确的答案由评估员找到了)。

事实类问题的主要评价指标是准确性,同时还返回了当文档集中不存在答案时的召回率和精确度。NIL的精确度是返回的正确次数和返回的总次数的比值,而NIL的召回率是返回正确的次数和正确的总次数(17)的比率。如果NIL没有返回,NIL精确度是未定义的,且NIL召回率是0.0。表2给出了事实类问题的评估结果。该表显示了每个前10组的事实类问题的最精确的运行结果。这张表给出了整个序列的事实类问题的确得分,以及NIL的精确度和召回率。

表1:75个问题序列的目标

66 Russian submarine Kursk sinks

104 1999 North American International Auto Show

67 Miss Universe 2000 crowned

105 1980 Mount St. Helens eruption

68 Port Arthur Massacre

106 1998 Baseball World Series

69 France wins World Cup in soccer

107 Chunnel

70 Plane clips cable wires in Italia

剩余内容已隐藏,支付完成后下载完整资料


资料编号:[484403],资料为PDF文档或Word文档,PDF文档可免费转换为Word

原文和译文剩余内容已隐藏,您需要先支付 30元 才能查看原文和译文全部内容!立即支付

以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。