Lexical Link Analysis
Lexical Link Analysis (LLA) and MMOWGLI
Dr. Ying Zhao, Dr. Douglas Mackinnon, and Dr. Don Brutzman (group email)
- News, 9 July 2013: NPS Acquisition Research Program - MMOWGLI, LLA Papers Published
1. Executive Summary
Lexical Link Analysis (LLA) is a form of text mining in which word meanings represented in lexical terms (e.g. word pairs) can be represented as if they are in a community of a word network. LLA “discovers” and displays these networks of word pairs from large-scale unstructured data. It can be installed as a search and knowledge management tool for scoring & ranking interesting information and for visualizing & reporting correlations among categories/layers of information including lexical, semantic and social links. This effort then presents the decision maker with previously unavailable/emerging patterns and themes, as well as unprecedented levels of analysis, thus reducing the workload and overcoming the blind spots of human analysts and with potential automation. For example, for the recent MMOWGLI games, as platforms used to develop and identify new ideas about stated subject matters such as “green” technologies, LLA was leveraged to identify potentially interesting information from “idea cards,” link them , then recommend them to the matched action plans for Game Masters. This is discussed in further detail below.
2. Analysis of the Piracy MMOWGLI Game
Figure 1 shows the game’s content and attributes were processed into the inputs, i.e., meta_data.txt & a directory of text files with card contents, to LLA.
Figure 1: Idea Cards transformed to LLA inputs, e.g., a directory with files of content of the cards and attributes, meta_data.txt
Figure 2 (a) shows semantic networks of cards generated by LLA based on the contents. Cards in the center are more “central” based on various centrality measures. However, the super interesting cards may not be central as shown in Figure 2 (b), where the super interesting cards in green were tagged by human analysts - Game Masters.
Figure 2: Semantic networks of cards in (a), “Super Interesting” cards (green in (b)) tend not to be “central”.
3. Analysis of the Energy MMOWGLI Game (5/24/12)
3.1 Discover Themes
1st Iteration(Figure 3 (a)): compute word pair clusters using Newman community finding algorithm – words as in a community (Girvan & Newman, 2002).
2nd Iteration(Figure 3(b)): Select lexical terms linked to the most central nodes
Figure 3: A theme include a collection of word pairs selected using degree centrality
Figure 4: Many centrality measures can be used to evaluate the importance of the terms
Figure 4 shows the relations among many centrality measures in the social network analysis (SNA) literature, these measures can be used to evaluate the importance of lexical terms. The PageRank used by Google uses one of the measures.
3.2 Recommend Interesting Cards Linked to Action Plans
Figure 5: Card 1650, 1826, 120, 478 & 341 are interesting
LLA could be used to identify interesting cards during the game that might be recommended to proceed to action plans, therefore, reduce the work load of Game Masters. Figure 5 shows Card 1650, 1826, 120, 478 & 341 are interesting and Table 1 shows the associated Concepts for the Cards in Figure 5.
Table 1: Associated Concepts for the Cards in Figure 12
Figure 6: Why might 2015 be interesting? Associated concepts are LUCITE CONTAINERS,CONTAINERS ACTING,BIOLUMINESCENTALGAE. There is no Game Masters involved here, but this area might be moved to action plans.
Figure 6, Figure 7, Figure 8 and Figure 9 are a few more examples of cards are found interesting using LLA.
Figure 7: Why might 1138 be interesting? Associated concepts are BATTLE GROUPS,UAV SQUADRONS,CARRIER BATTLE. Possible linking 1138 to action plans 4, 26, 23
Figure 8: Why might 283 be interesting? Associated concepts REDESIGNING CARRIER,BATTLE GROUPS,CARRIER BATTLE
Figure 9: Why might 810 be interesting? Associated concepts MODIFIED ALGAE,BIOFUEL PRODUCTION,ALGAE BIOFUEL, possible link to Action 12.
Figure 10: Cards sorted by measures
Figure 10 shows all the cards sorted by various measures with the definitions below:
• Measure a theme
– Ratio= (# of lexical terms in a theme) /(# of lexical terms in the largest theme)
– Popularity: Ratio >=2/3
– Anomalousness: Ratio<1/3
– Emergence: 1/3<=Ratio<2/3
• Measure a card
– Popularity: # of terms in the popular themes
– Anomalousness : # of terms in the anomalous themes
– Emergence: # of terms in the emerging themes
Appendix: Overview of Lexical Link Analysis (LLA)
As in military operations, where the term situational awareness is coined, we note that that our efforts can inform awareness of analyzed data, in a unique way that helps improve decision maker’s understanding or awareness of its content.We therefore define awareness as the cognitive interface between decision makers and a complex system, expressed in a range of terms or “features,” or specific vocabulary or “lexicon,” to describe the attributes and surrounding environment of the system. Specifically, LLA is a form of text mining in which word meanings represented in lexical terms (e.g. word pairs) can be represented as if they are in a community of a word network. Link analysis “discovers” and displays a network of word pairs. These word pair networks are characterized by one-, two-, or three-word themes. The weight of each theme is determined by its frequency of occurrence.
Figure 1: Comparing Two Systems Using LLA
Figure 1 shows a visualization of lexical links for Systems 1 and 2 of two systems, which are shown in the red box. Unlinked, outer vectors (outside the red box) indicate unique system features.
Figure 2: Comparing Three Categories Using LLA
For example, Figure 2 shows how the information from three categories can be compared.
Figure 3: Comparing Two Time Periods
Figure 3 shows how the information from two time periods can be compared.
Figure 4: QAP Correlation via UCINET
The closeness of the systems in comparison can be visually examined or quantitatively examined using the Quadratic Assignment Procedure (QAP; Hubert & Schultz, 1976, e.g. in UCINET, Borgatti, et al. 2002) to compute the correlation and analyze the structural differences in the two systems as shown in Figure 4.
Figure 5: Word and Term of Themes Discovered and Shown in Colored Groups
Figure 5 shows each node, or word hub, represents a system feature, and each color refers to the collection of lexicon (features) that describes a concept or theme. The overlapping area nodes are lexical links. What is unique here is that LLA constructs these linkages via intelligent agent technology using social network grouping methods.
Figure 5 also shows a visualization of LLA with connected keywords or concepts as groups or themes. Words are linked as word pairs that appear next to each other in the original documents. Different colors indicate different clusters of word groups. They were produced using a link analysis method -- a social network grouping method (Girvan et al. 2001) where words are connected, as shown in a single color, as if they are in a social community. A “hub” is formed around a word centered or connected with a list of other words (“fan-out” words) centered on other hub words.
Figure 6 shows a detailed view of a theme or word group in Figure 5: the words “analysis, research, approach” are connected and centered around other related words. Here, we use three-word such as “analysis, research, approach” to label a group.
Figure 6: A Detailed View of a Theme or Word Group in Figure 5.
The detailed steps of LLA processing include applying collaborative learning agents (CLA) and generating visualizations, including a lexical network visualization via AutoMap (2009), radar visualization, and matrix visualization (Zhao et al. 2010). The following are the steps for performing an LLA:
· Read each set of documents.
· Select feature-like word pairs.
· Apply a social network community finding algorithm (e.g. Newman grouping method; Girvan et al. 2001) to group the word pairs into themes. A theme includes a collection of lexical word pairs connected each other.
· Compute a “weight” for a theme for the information of a time period, that is, how many word pairs belong to a theme for that time period and for all the time periods.
· Sort theme weights by time, and study the distributions of the themes by time.
Business Problems that LLA Addresses
General questions that LLA usually answers are as follows:
· Discover themes and topics in the unstructured documents and sort the importance of the themes
· Discover social and semantic networks of organizations who were involved, compare the two networks to obtain insights to answer the following questions:
· What were the organizations involved in the important themes
· How do semantic networks suggest more potential collaboration when compared to social networks?
Social and Semantic Networks Analysis
Current research of social network analysis mostly focuses on people or organizations of direct associations regardless of the contents linked. The so-called study of centrality (Girvan, 2002, Feldman, 2007) has been a focal point for the social network structure study. Finding the centrality of a network lends insight into the various roles and groupings such as the connectors (e.g. mavens, leaders, bridges, isolated nodes), the clusters (and who is in them), the network core, and its periphery. We have been working towards two areas of innovations in the network analysis:
· Extract social networks based on the entity extraction
· Extract semantic networks based on the contents and word pairs using LLA.
· Apply characteristics and centrality measures from the semantic networks and social networks to predict latent properties such as emerging leadership, for example, emerging techniques that might dominate, in the social networks. The characteristics are further categorized into themes and time-lined trends for prediction of future events.
Figure 7: Collaborative Learning Agents
In the past a few years, we began at the Naval Postgraduate School (NPS) by using Collaborative Learning Agents (CLA; QI, 2009,Figure 7) and expanded to other tools, including AutoMap (CASOS, 2009) for improved visualizations. Results from these efforts arose from leveraging intelligent agent technology via an educational license with Quantum Intelligence, Inc. CLA is a computer-based learning agent, or agent collaboration, capable of ingesting and processing data sources.
We have been generating visualizations including a lexical network visualization using various open source tools. We began by using the Organizational Risk Assessment (ORA; Center for Computational Analysis of Social and Organizational Systems [CASOS], 2009) tool and expanded to other tools. For example, in the past year, we develop 3-D network views using Pajek (Pajek, 2011) and X3D (X3D, 2011). We also developed our visualizations Radar view and Match view (Zhao, et al., 2010).
LLA uses a computer-based learning agent Called Collaborative Learning Agents (CLA; QI, 2009) to employ an unsupervised learning process that separates patterns and anomalies. CLA is a computer-based learning agent, or agent collaboration, capable of ingesting and processing data sources, leveraged via an educational license with Quantum Intelligence, Inc. The unsupervised agent learning is implemented by indexing each set of documents separately and in parallel using multiple learning agents. Multiple agents can work collaboratively and in parallel. We set up a cluster utilizing Linux servers in the NPS High Performance Computing Center (HPC) to handle the large-scale data and secure environment in the NPS Secure Technology Battle Laboratory (STBL).
Relations to Other Methods
The LLA approach is more properly related to Latent Semantic Analysis (LSA; (Dumais, Furnas, Landauer, Deerwester, & Harshman, 1988) and Probabilistic Latent Semantic Analysis (PLSA). In the LSA approach, a term-document matrix is the starting point for analysis. The elements of the term-document or feature-object (term as feature and document as object) matrix are the occurrences of each word in a particular document, i.e. A = , where denotes the frequency in which term j occurs in document i. The term-document matrix is usually sparse. LSA uses singular value decomposition (SVD) to reduce the dimensionality of the term-document matrix. SVD cannot be applied to the cases where the vocabulary (the unique number of terms) in the document collection is large. LSA has been widely used to improve information indexing, search/retrieval and text categorization.
A recent development related to this method is called Latent Dirichlet allocation (LDA; Blei, Ng, & Jordan, 2003), which is a generative probabilistic model of a corpus. In LDA, a document is considered to be composed of a collection of words—a “bag of words,” where word order and grammar are not considered important. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a statistical distribution (Dirichlet distribution) over the corpus. Our theme generation from LLA is different than LDA, in which a collection of lexical terms are connected to each other semantically, as if they are in a social community, and social network grouping methods are used to group the words. Our method is easily scaled to analyze a large vocabulary and is generalizable to any sequential data.
The method provides the solutions to meet the critical needs of the acquisition research, the key advantages is to provide a innovative near real-time self-awareness system to transfer diversified data services into strategic decision making knowledge, detailed as follows:
· Automation: High correlation of LLA results with the link analysis done by human analysts makes it possible for automation, saving human power, and improving responsiveness. Automation is achieved via computer program or software agent(s) to perform LLA frequently – and in near real-time: Agent learning makes it possible to reach real-time; visualization corrects lexical links to core measures; features and patterns are discovered over time for the system as a whole. We can take advantage of the data in motion (Twitter and social media sites), RSS feed data to build a better picture of real-time program awareness.
· Discovery: It “discovers” and displays a network of word pairs. These word pair networks are characterized by one, two or three word themes. The weight of each theme is determined based on its frequency of occurrence. It may also discover blind spots of human analysis that are caused by the overwhelming data for human analysts to go through.
· Validation: As we continue validating LLA by direct correlation with human analysts’ results, new dimensions of using LLA to validate human analysis also weigh in the advantages of our methodology. For instance, LLA may provide different perspectives of links. In the acquisition context, links discovered by human analysts may emphasize on component/part connections, they do not necessarily reflect on the content overlaps, therefore, interdependencies of the programs identified human analysts, e.g. program managers, might help the programs to stay funded from year to year for the reason of building the importance of themselves, not cost reduction for the government. LLA looks for overlapping of the contents for the effect of improving affordability and meeting the requirements of warfighters. Consequently, it provides better results in terms of trust, quality of association discovery and breakthrough the taxonomy of ignorance (Denby& Gammack, 1999), organizational boundaries, improve organizational reach.
The US DoD acquisition process is extremely complex. There are three key processes that must work in concert to deliver the capabilities required by the warfighters: the requirements process; the acquisition process; and the planning, programming, budget, and execution (PPBE) process. Each process produces a large amount of data in an unstructured manner. There has been a critical need for automation, validation, and discovery to help acquisition professionals, decision makers and researchers to reveal the interrelationships among the data elements and business processes. We applied LLA to extract the links, compare the trends and discover previously unknown patterns (Zhao, et al, 2010 & 2011a) from the data of the three services (Army, Navy and Air Force) in the past ten years.
Multi-agency Radiological Responses Plan and Exercise
Every year, US DHS spend large amounts of money to conduct training, exercises and simulations to prepare for emergency responses. These exercises often involve processes such as planning, organizing, directing, and monitoring activities and collaborations of multi-agencies. The activities generate large amount unstructured data for sensemaking. We performed a case study involving multi-agency radiological responses plan and exercise. The responders were asked to follow field manuals which included Standard Operatiing Procedures (SOPs), playbooks, and job aids. The recorded audio communications were transcribed into text data and can be used for analysis. Lexical links was used for summarizing themes, concepts and discovering the order of the importance of the events.
Facebook, Twitter, and many other social networking sites offer virtual environments for meeting possible candidates that could fit service entry profiles. Sponsored by the Navy Recruiting Command, the goal of this project was to collect and match large-scale Facebook public fan and group profiles with Navy-enlisted and officer-rating documents to improve future Navy Recruiting and advertising efforts. Collecting samples of Facebook data was critical for this project. We collected and analyzed the public “footprints” of Facebook users using LLA, which resulted in a list of selected individuals who could become strong officer candidates for the U.S. Navy.
Navy Chief of Information (CHINFO)
The discursive space is where the discourse takes place and can indicate how it is shaped the influence of stakeholders in the space. We leveraged LLA to determine how strategic communications of CHINFO proliferate through various open sources. The case study analyzed involved the 2006 U.S. Coast Guard Live Fire case, when the Coast Guard planned a live fire training program in the Great Lakes area in Michigan. The program failed in the end because of public opposition. We applied LLA to a four-month public discourse recorded of about 980 public comments and 200 pages of public meeting transcripts, linking all associated comments, and then generating semantic networks over time by stakeholder groups.
APAN Network and Haiti Operation Data Analysis
In the aftermath of the Haiti earthquake, U.S. military and civil organizations provided rapid and extensive relief operations. The challenge is to sift through the data that are collected in real-life events to create an overall picture of how various organizations (military and civil) actually collaborated. SOUTHCOM and USAID used Twitter and the HAITI HA/DR Community of Interest (COI) on the All Partners Access Network (APAN) to handle real-time information gathering and dissemination during the crisis. We analyzed about ~10000 documents collected from these social media platforms such as Twitter, Facebook, news-feed Web sites, official PDF briefing documents, situation reports, forums and blogs. The sensemaking goal was to use LLA to develop utilities and measures for analyzing trends in interagency synergy (Zhao, et al, 2011a).
Collecting data in the area of intelligence analysis and making sense of such data sources as HUMINT - Human Intelligence, we performed a feasibility study using a few months of data composed of approximately 1500 reports. Each report represented a separate event. Improvised explosive device (IED) specific data, for example, include post-blast information, and after-action reports from the Combined Explosives Exploitation Cell (CEXC) (Phillips 2003) and data from other reporting tools used in Iraq and Afghanistan war activities as target development, civil affairs, psychological operations, engagement, or indirect fires. Our efforts demonstrated the capability to reconstruct social networks of people, places, and events, as well as to reveal trends and perhaps predict future events.
Identification of NATO Capability Requirements
Apply LLA to analyze the documents that support the current process to identify NATO capability and force requirements from the current process and supporting documents. Who are the stakeholders, i.e. US and Allies organizations involved in the current process? What are the current social networks, i.e. who talk to whom to identify capability requirements? Further determine "who" inside DoD is involved in the process. What are their strategies in the communications, i.e. rhetoric and themes? Apply the findings to improve the understanding of the current process and improve EUCOM visibility. Apply the information to facilitate, encourage, and recommend new collaborations toward "Smart Defense".
Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.Retrieved from http://jmlr.csail.mit.edu/papers/volume3/blei03a/blei03a.pdf
Borgatti, S. P., Everett, M. G., Freeman, L. C. (2002). Ucinet for Windows: Software for Social Network Analysis. Harvard: Analytic Technologies.
Center for Computational Analysis of Social and Organizational Systems (CASOS).(2009). AutoMap: Extract, analyze and represent relational data from texts. Retrieved from http://www.casos.cs.cmu.edu
Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. 1988. Using latent semantic analysis to improve information retrieval. In Proceedings of CHI'88: Conference on Human Factors in Computing. 281-285.
Denby,E. & Gammack, J. (1999). ModellingIgnorance Levels in Knowledge-Based Decision Support. http://wawisr01.uwa.edu.au/1999/DenbyGammack.pdf
Gallup, S. P., MacKinnon, D. J., Zhao, Y., Robey, J., & Odell, C. (2009, October 6–8). Facilitating decision making, re-use and collaboration: A knowledge management approach for system self-awareness. In Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management (IC3K), Madeira, Portugal.
Girvan, M., & Newman, M. E. J. (2002, June).Community structure in social and biological networks. Proceedings of the National Academy of Sciences, USA,99(12), 7821–7826.
Hubert, L. & Schultz, J. (1976).Quadratic assignment as a general data-analysis strategy. British Journal of Mathematical and Statistical Psychology, 29, 190-241.
Pajek, 2011, http://vlado.fmf.uni-lj.si/pub/networks/pajek/
X3D, 2011. http://www.web3d.org
Quantum Intelligence (QI).(2009). Collaborative learning agents (CLA). Retrieved from http://www.quantumii.com/qi/cla.html
Zhao, Y., Gallup, S.P., MacKinnon, D.J. (2010). Towards Real-time Program-awareness via Lexical link Analysis. 7th Annual Acquisition Research Symposium, Monterey, California, May 11-13, 2010
Zhao, Y., Gallup, S.P. & MacKinnon, D.J. (2011).Towards real-time program awareness via lexical link analysis. Acquisition Research Sponsored Report Series, NPS-AM-10-174, Monterey, CA: Naval Postgraduate School.
Zhao, Y., Gallup, S.P. & MacKinnon, D.J. (2011).A Web Service Implementation for Large-Sale Automation, Visualization and Real-time Program-awareness via Lexical Link Analysis. In Proceedings of the Eighth Annual Acquisition Research Program. Monterey, CA: Naval Postgraduate School. May 11-12, 2011.
Zhao, Y., Gallup, S.P., & MacKinnon, D.J. (2011). Lexical Link Analysis for the Haiti Earthquake Relief Operation Using Open Data Source. In Proceedings of the6th ICCRTS, International Command and Control, Research and Technology Symposium, Québec City, Canada June 21–23, 2011.
Zhao, Y., Gallup, S.P., & MacKinnon, D.J. (2011).System Self-awareness and Related Methods for Improving the Use and Understanding of Data within DoD. To Appear in Special Issue for Software Quality Professional Magazine, American Society for Quality, June, 2011.