Brain Dump

Information Extraction

Tags
text-processing

The [see page 6, task] of creating semantically annotated text from an unstructured information.

Information extraction is essentially structuring (in db, XML tagging, etc.) the annotated data we parse from unstructured sources, this can then by [see page 7, applied] for summarising, index creation or searching/analysis.

Kinds of information can [see page 8, include]:

TypeMeaning
EntitiesPersons, organisations, locations, times, etc.
RelationshipsLinks between entities
EventsSuccession events, occurrences, etc.

See [see page 19, applications] of IE.

Tasks in IE include:

[see page 18, Advantages vs Disadvantages]

  • Facts produced can be fed into other powerful applications (semantic indexing, data-mining)

  • Systems generally domain specific and porting to others can be time-consuming

  • Accuracy is limited

  • Computationally demanding

Approaches

Knowledge Engineering

Divided into:

TypeDescription
DeepUnderstand how language is constructed and get information out of that.
ShallowDefine patterns and actions to run on pattern matches.

The Shallow approach is like the [see page 34, rule-networks] we encountered in year 1, we define patterns (containing wild-cards as slots for words) and actions to execute when a match is found.

The [see page 16, deep] methods parses the input to identify grammatical relation and then apply rules on parser output (parse trees) to extract relations and events.

See [see page 17, advantages vs. disadvantages].

Supervised Learning

[see page 35, Train] a program by having it find the k-words either side of an entity mention or to the left of entity 1 and right of entity 2 (plus the words in-between). We teach the system to [see page 19, learn]:

  • rules: patterns that match extraction targets and arguments
  • classify: tokens as beginning/inside/outside of a tag-type. eg. does this sentence bear a relation between 2 entity types?

ML techniques used: covering algorithms, HMMs, Sims, etc.

[see page 20, Examples] of features that can be used in this approach includes:

Feature TypeDescription
entity featureseg. entity 1 type, entity1 name, etc.
Word based featuresnumber of words between 2 entities, before the \nth{1} entity, or after \nth{2}
Syntactic featuresSyntax path eg. Non-Phrase, Non-Phrase, Subject

See [see page 22, advantages vs. disadvantages].

Bootstrapping Method

A [see page 36, minimum-supervision] approach which [see page 24, uses]:

TermMeaning
Document collectionContaining text corpora to drive information discovery.
Seed tuplesEntity (or location) pairs that have some relation
Seed patternsPatterns in which tuples can be placed to extract meaning (see Knowledge Engineering).

We essentially just keep adding matching tuples with patterns (creating new patterns-and-tuple) along the way until convergence. It's an approach for which we supply an initial seed and then allow to accumulate information as required.

See [see page 30, example] tuple format.

See [see page 40, advantages vs. disadvantages], Note: semantic drift.

Distant Supervision Approach

See [see page 37, here].

Use an [see page 42, existing] structured data source (eg. Database) to find known relations (as sentences) in a document collection and use these sentences as training data for a supervised relation extractor (ML algorithm).

The main [see page 45, assumption] of this approach is:

Any sentence containing two entities with a known relation might express that relation, so tag each occurrence as mentions of the relation.

Note: this approach requires a [see page 46, negative instance] to be able to say that a relation doesn't exist.

See [see page 47, advantages vs. disadvantages].

Evaluation

Using an annotated corpus (test-cases with known results) which was manually produced we compare the correct answers (AKA keys, references, ground truth) with our system results (AKA responses, hypotheses) using some principles of evaluation.

Principle metric of evaluation include:

TermDescription
PrecisionHow much of what the system returns is correct?
RecallHow much of what is current does the system return?
F-MeasureWeighted combination of precision and recall? (See [see page 40, here])

We often [see page 41, combine] precision and recall into micro and macro averages (Note: The variables \(R_i\) or \(H_i\) accumulate the sums of occurrences in the respective row/column).

  • Micro precision - Division of the sum of all correctly classified samples and the number of samples classified as a given class, for all classes in the sample-set.
  • Micro recall - Division of Sum of all correctly classified samples and the total number of samples in the set that have the given class.

Note: Micro measures are often used when data-set is balanced (each class in a corpus is seen an equal number of times). If used when unbalanced the metric will be in favour of the more frequent class.

  • Macro precision - Sum the precision of each individual class then divide by the number of classes.
  • Macro recall - Same as macro precision but for recall.

Links to this note