Home / current articles with graphs / A Method for Relating Multiple Newspaper Articles by Using …

A Method for Relating Multiple Newspaper Articles by Using … - current articles with graphs

A Method for Relating Multiple Newspaper Articles by Using …-current articles with graphs

A Method for Relating Multiple Newspaper Articles by Using
Graphs, and Its Application to Webcasting
Naohiko Uramoto and Koichi Takeda
IBM t{('.search, Tokyo I{.csearch Laboraloory
1623-14 Shimo-t:suruma, Yamato-shi, Kanagawa-ken 242 ,lapan
{u ramot o,t akeda }((~.IM.i|)nl.(:o.j I)
A b s t r a c t ticular topic is ordered chronologically, and the re-
This I)at/er descri/)es methods for relating (thread- suits are re.t)resented as a directed graph. There are
various ways of relating documents and visualizing
ing) multiple newspal)er articles, and for visualizing their structure. For example, USENET articles can
various characteristics of them by using a directed be accessed by nleans of newsreader software. In the
graph. A set of articles is represented by a set of
word vectors, and the similarity between the vec- system, a label (title) is attached to each posted mes-
tors is then cah:ulated. The graph is constructed sltge, specifying whether it deals with a new topic or
is a reply to a previous message. A chain of articles
fl'om the similarity matrix. By applying some con- on a toI)ic is called a thread. In this case, tim rela-
straints on the chronoh)gical ordering of articles, an tionships between the articles are explicitly defined.
e[ficient threading algorithm that runs ill O(n) time
(where n is the number of articles) is obtained. The This t)ost/rel)ly-t/ased apt)roach makes it t/ossible for
constructed graph is visualized with words that rep- a reader to group all the messages on a particular
resent the topics of the threads, an(t words that rep- topic. IIowever, it is dif[icult to capture tile story of
resent new informatioil in each article. The thread- the thread fl'om its thread structure, since aI)prot)ri-
ing te(:hnique is suitable for Webcasting (t)ush) ap- ate titles are not added to the messages.
plications. A threading server determines relation- This lisper aims to t/rovide ways of relating mul-
ships among articles fi'om wtrious news sources, and tit)le news articles and representing their structure
(:reates files containing their threading information. in a way that is easy Io understand and comt)uta-
This information is represented in e x t e n d e d Marku 1) tionally in(!xt)ensive. A set of relationshiI)s is (l(?tincd
Language (XML), and can be visualized (m most here as a directe(l graph. A node indicates an arti-
Web browsers. The XML-t)ased representation and cle, and an are fr(ml n(lde X to }~ indicates that the
a current prototype are descrihed in this paper. article X is followed by }" (or that X is adjacent to
Y). An article contains both known and unknown
1 Introduction (new) information. Known information consists of
words shared I)y tile I)egilming and ending points of
The vast (tuantity of information available tod~U an arc. When node X is adjacent to Y, the words
makes it difficult to search for and understand the are represented by (X N Y). The known information
information that we want. If there are nlany related is called genus words in this t)aper. Even if an artMe
documents about a topic, it is important to capture folhlws another one, it generally contains some new
their relationships so that we can ot)tain a clearer information. This information can tie represented
overview. However, most information resources, in- I)y subtraction (Y - X) (Damashek, 1995), and is
cluding newspaImr articles do not haw.' explicit re- called diffc.Tvntia words, by analogy with definition
lationshiI)s, t'br examt)le, although documents on sentences in dictionaries, which contain genus words
the Web are connected by hyperlinks, relationships and diftierentia, in this pat)er , genus an(t diff(,.rentiac
cannot be specified. words are used to calculate the similarities between
WebcCusting ("trash") apt)lications such as Point- two articles, an(1 to visualize topics ill a set of arti-
cast i constitute a pronlising solution to tile prob- cles.
lem of information overloading, but the articles they Since artMes are ordered chronologically, there
provide do not have links, or else must be manually are some time constraints on the conne(:tivity of
linked at a high cost in terms of time and eflbrt. nodes. A graph is created l/y constructing an ad-
This i)aper describes nlethods for relating news- jacency matrix for nodes, which in turn is created
paper articles automatically, and its at/plication for fi'om a similarity matrix for nodes.
a Wellc~Lsting application. A set of article on a par-
Some i)otential features of articles in a set (:an l)e
1http://www.pointcast.com (teternfined by analyzing some forInal ~st/ects of the
d2 d3
dl d2 d3 d4 d5 d6 d7 ds
d~ 0 1 0 1 0 0 0 0
d.~ 0 0 1 0 0 0 1 0
d3 0 0 0 0 0 0 0 1
M = d4 0 0 0 0 0 0 0 0
Oa5 _~0 a6 d5 0 0 0 0 0 1 0 0
d6 0 0 0 0 0 0 0 0
Figure 1: Example of a Directed Graph G
d7 0 0 0 0 0 0 0 1
d8 0 0 0 0 0 0 0 0
corresponding graph. For example, tile paths in the
graph show the stories of the nodes they contain. Figure 3: Adjacency Matrix MG of G
Multiple paths for a node (article) show that there
are multiple stories associated with it. Furthermore, Constraint 1
if tile node has a long path, it is in the "main stream" For(di,dj) E A, i < j
of the topic represented by the graph. An efficient Tile constraint simply shows that an old article
algorithm for finding such paths is described, later cannot follow a new one.
in the paper.
Application of the threading method to docu- 3 Creating a Graph Structure for
ments on the Web would be very useful because, al-
though such documents are connected by hyperlinks, Articles
their relationships cannot be specified. In this paper, This section describes how to construct a directed
generated threads by this method are represented in graph structure from a set of articles. Any directed
eXtended Markup Language (XML) (XML, 1997), graph can be represented by a matrix. Figure 3
which is the proposed standard for exchange of in- shows the adjacency matrix Ma of the graph G in
formation on the Web. XML-based threads can be Figure 1.
used by webcasting or push services, since various For example, a value of "1" for the (1, 2) element
tools for parsing and visualizing threads are avail- in M indicates that dl is adjacent to d2. Since an
able. article cannot follow itself, tile value of (i, i) elements
In Section 2, a directed graph structure for arti- is "0". From the time constraint defined in Section
cles is defined, and the procedure for constructing a 3, M a is an upper triangle matrix.
directed graph is described in Section 3. In Section The following is a procedure for constructing a
4, some features of the created graph are discussed. directed graph for related articles:
Section 5 introduces a webcasting application by us-
ing the threading technique, and Section 6 concludes 1. Calculate the similarity and difference between
the paper. articles.
2 Definition of a Graph Structure
2. Construct a similarity matrix.
A set of articles is represented as an ordered set V: 3. Convert tile matrix into an adjacency matrix.
V = {dl,d2,...,d,~}. In the next section, each step is illustrated by us-
The suffix sequence 1, 2 , . . . , n represents the pas- ing the set of articles V in Figure 2 on the subject
sage of time. Article di is older than di+l. The order of nuclear testing taken from the Nikkei Shinbun. 2
is obtained from the publication dates of the articles. 3.1 C a l c u l a t i n g t h e similarities a n d
Different time points arbitrarily are assigned to ar- differences between articles
ticles published on the same day.
Related articles are represented as a directed Tile function sim(di, dj) calculates the word-based
graph (V,A). V is a set of nodes. A is a set of similarity between two articles. It is defined on the
ordered pairs (i,j), where i and j are members of basis of Salton's Vector Space Model (Salton, 1968).
V. Figure 1 shows an example of a directed graph. Words are extracted from an article by using a mor-
In this case, the graph is represented as follows: phological analyzer. Next, nouns and verbs are ex-
tracted as keywords.
x-", dt dj
V = {dl,d2,da,d4,ds,d6,d6,dT}, A = {(dl,d2), sim(di, dj) = 2-.k~ WkwWkw
(d2, d3), (dl, d4), (d5, d6), (d2, dr), (d3, d8), (d7, d8)} d, 2
The nodes are ordered chronologically. The fol-
lowing constraint is introduced into the graph: 2The articles were originally written in Japanese.
dl: The prime minister of France says that it is necessary to restart nuclear testing.
d2: The Defense Minister suggests restarting nuclear testing.
da: At a summit conferei,ee, the Prime Minister will adopt a policy of requesting the French Government to
halt nuclear testing.
d4: China's latest nuclear test will hold up negotiations oil a treaty to abolish such testing.
&,: The Minister of Foreign Affairs, Mr. Youhei Kohno, takes a critical attitude toward China, and asks
France to understand Japan's position.
d(;: The prime minister of New Zealand asks the French Govermnent not to restart nuclear testing.
dr: President of France states that mmlear testing will restart in Septemt)er, and that France will conduct
eight tests between now and next May.
da: France states that it will restart nuclear testing. This will hamper nuclear dism'mament.
d.,~: France states that it will restart nuclear testing. Australia halts defense cooperation with France.
dl0: France states that it will restart nuclear testing. The (J.S. expresses regret at the decision.
Figure 2: V: Articles about nuclear testing
Here, Wkdw, is the weight given to the keyword l ) r o e e d u r e MakeI)istanceMatrix
kw in article di. Modification of tile TF.IDF for i= 2 to n begin
vahm (Robertson et al., 1976) is used for the weight- if i-k < lthens ?- 1 elses t- i-k
ing. gkaw' is the weight assigned to tim keyword kw, f o r j = s t o i - 1 b e g i n
which is a differentia word for di. a(i,j) ~- sim(di,dj)
j, Cd, (kw) k end
?t k w C< ? log Nk ( k w ) " ' &,t, " i + - i + l
,t, f 1.5 kw C differentia(di)
gk~,, = ~ 1 otherwise.
Figure 4: Procedure for Constructing Sinfilarity Ma-
Other parameters are defined as follows: trix
k: constant value
Cd, (kw): frequency of word kw in d(i) includes Constraint 1, is used for in threading algo-
Cd,: number of words in d(i) rithm.
Nk(kw): number of artMes that contain the word Constraint 2
kw in k m'ticles d i - k , . . ? , di F o r (di,dj) E A , j - ( k + l ) < i < j
The function diffcrentia(di) returns a set of key- This constraint means that an artMe can only fol-
words that at)pear in dj but (lo not at)pear in the low the last k artMes. As tile result, the ilulnbcr of
last k articles. times the similarity matrix needs to be cah:ulated is
reduced by kn, giving a complexity of O(n).
d i f f c r c n t i a ( d i ) = { k w [ C d , ( k w ) > O, and for all By using tile algorithm, each similarity between
dl, nodes is calculated, and the similarity matrix in Fig-
where i - k < l < i, Cd,(kw) = 0} ure 5 shows a similarity matrix S of V. Ill this case,
a.2 Constructing a similarity matrix keywords are extracted from title sentences, and k
is set to five.
A similarity matrix for a set of articles is constructed
t)y using the sire function. In a conventional hierar- a.a Conversion into an adjacency m a t r i x
chical clustering algorithm, a similarity for any con> From the similarity matrix, an adjacency matrix is
bination of two articles is required in order to con- constructed. An element s(i, j) in the similarity ina-
struct a ~h(.i.e..ra1)rchical tree of the set of articles. This trix corresponds to tile element s s ( i , j ) in the adja-
causes ~ calculations of the similarity func- cency matrix SS. There are various strategies for the
tion, for n articles, with a consequent complexity conversion. In this paper, ss(i,j) is set to 1 when
of O(n2). This is very expensive when n is large? s(i, j) > 0.18, and any node can follow at most k/2
In our algorithm for constructing a similarity ma- nodes, in this case two nodes. Figure 6 shows a re-
trix, shown in Figure 4, the complexity of construct- sult of the conversion. Finally, a directed graph for
ing a graph structure for an article set by using a V is created (Figure 7). Figure 8 shows a graph that
constraint is O(n). The following constraint, which visualizes the content of the articles in our example.

What is the difference between a pie chart and a line graph? A pie chart is a circular chart used to compare parts of the whole. It is divided into sectors that are equal in size to the quantity represented. C296013-H November 21, 2018 A line graph displays the relationship between two types of information, such as number of school personnel trained by year. They are useful in illustrating trends over time.