Text Classification :: Character, Letter, Word, Line, Sentence, Paragraph, Page

Classification of Text  
In designing documents electronically either in a word processor (like MS Word) or for the web (a web page) text can be classified as characters, word, line, sentence, paragraph, page and section.

• Classification for the purpose of Formatting Content

For the purpose of formatting, the content is identified as characters and paragraphs only. The Character formatting options can be applied to one or more characters. Paragraph formatting options which are distinct from the character formatting options can be applied to one or more paragraphs.

Characters, Letters & keyboard keys  
Letters, numbers and other symbols used in a language form the Character Set for the language. The English language character set consists of upper case letters, lower case letters, digits from 0 to 9, and other special symbols like #, $, @ etc.

English Character Set

Upper Case Alphabets » A B C ..... Z Lower Case Alphabets » a b c ..... z
Digits » 0 1 2 ..... 9
!Exclamation mark "Quotation mark
#Cross hatch (number sign) $Dollar sign
%Percent sign &Ampersand
`Opening single quote `Closing single quote (apostrophe)
(Opening parentheses )Closing parentheses
+Plus ,Comma
-Hyphen, dash, minus .Period
/Slant (forward slash, divide) \Reverse slant (Backslash)
=Equals sign ;Semicolon
<(Opening Angle Bracket (Less than sing) >(Closing Angle Bracket (Greater than sing)
?Question mark @At-sign
[Opening square bracket ]Closing square bracket
:Colon ^Caret (Circumflex)
_Underscore *Asterisk (star, multiply)
{Opening curly brace }Closing curly brace
|Vertical line (Pipe) ~Tilde (approximate)

You can enter the characters into the document using your keyboard. Some of the characters are entered when you press a key and some others on pressing two or more keys in succession (for entering the "+" symbol, we press the shift key and then the relevant key).

Each character has a distinct code with which it is identified, called the ASCII code. When you press a key or a set of keys, the computer identifies the code relevant to the key strokes and based on this it displays or writes to the document the relevant character.

• Toggle Keys

Some of the keys on the keyboard work as toggle keys. They have two states, either on or off. Pressing the key once would switch it on and pressing it again would switch it off. Caps Lock key, Num Lock key on the numerical keypad, Scroll Lock key on the navigational key pad are the three toggle keys on your keyboard. The toggle keys have indicator lights just above the numerical keypad to indicate whether they are on or off. If the indicator bulbs are glowing they are on otherwise they are off.

• Entering Upper Case Letters and Symbols

To enter capital letters we either keep the shift key pressed just before pressing the relevant letter key or keep the caps lock key on.

Where there are two or more characters on a key, the character on the upper part of the key can be entered into the document by keeping the shift key pressed just before pressing the relevant key. The caps lock key is not useful for this.

• Entering Numbers

The keys on the numerical keypad can be used to enter numbers only if the Num Lock key is switched on. When the Num Lock key is off the numerical keypad can be used as a navigational key pad. Those keys on the numerical key pad with only one character marked on them work in both states.

• Complex Language Scripts

In case of languages other than English where the script is complex like in the case of most of the Arabic and Asian scripts, a letter is formed by two or more characters in most cases. Each key stroke in such cases may insert only a part of the letter. This part itself is called a character in such cases.

How to enable (Enabling) Multiple Languages on the Computer

One or more letters (characters) without any spaces in between is recognised as a word by the computer. Two words are separated by white space. Normally this space is created using the space bar (which is the longest key located in the bottom row of the standard key pad), but it need not be so. Any space between two characters is treated as word separator.

Whenever a key is pressed, a character is placed in the document. This is true even in the case of space bar and other white space keys (the characters in their case being invisible). That is the reason we see white space being entered when space bar is pressed it.

One or more words terminated by a period (full stop) and followed by two white spaces created using the space bar form a sentence. All the words, the period and the two white spaces, together would be treated as a sentence by the computer.

We know that the first letter of a sentence is to be capitalised. In a word processing program (like Microsoft Word) when we complete entering a sentence and start typing the next sentence, we would notice the first letter getting capitalised automatically.

Your entering the letter after a period followed by two space bar white spaces is the reason for the computer to identify it a the starting letter of a sentence.

A paragraph mark would also terminate a sentence. This is for the reason that every new paragraph starts with a new sentence.

One or more sentences together would form a paragraph. Electronic documents use a special invisible character for recognising a paragraph. This is placed at the end of a paragraph. This is called or a paragraph terminator or paragraph mark.

The computer recognises the presence of a paragraph only if the paragraph mark is preceded by at least one visible character. Say, if there are many space bar spaces followed by a paragraph mark, it would not be recognised as a paragraph. In other words, a set of characters terminated by a paragraph mark would be recognised as a paragraph only if there is at least one visible character in the set.

Lines » Word Wrapping  
A paragraph of text is a unit in electronic documents. The total characters in a paragraph form a single stream of characters (visible and invisible). The words in a paragraph are separated by SPACE bar space or Line Break.

If the text within a paragraph does not fit into a single line, it would automatically flow into the next line. This process of text flowing into the next line automatically when it comes to the end of a line is called "Word Wrapping".

A line has no distinct measurement in electronic documents. It is a virtual format created on account of words within a paragraph wrapped in lines. A line in a document or text area is as long as the width of the document or the text area reduced by the left and right margin spaces if any (which is the width of the paragraph). Whenever there is a change in the width of the paragraph, the width of the line in the paragraph would change.

The number of lines that the paragraph occupies is not a static figure unless the width of the paragraph or any container holding the paragraph (table, frame etc) is fixed. If the width of the document is changed, the length of the paragraph as well as the lines within it would change. The words would rearrange themselves automatically to fit the new line length or paragraph width. This rearrangement would take place every time the dimension is changed.

The number of lines that make up a paragraph and the number of sentences within a paragraph are two different ideas.

Spell Checking » Dictionary  

• Spelling

In any language, one or more letters together form a word. Generally all the word forms that can be formed using the characters and other symbols relating to a language do not have a specific meaning attached to them. Only those word forms that have a meaning are what we call words within that language.

• Dictionary

All the words within a language that have a meaning are listed out in a file which we call a dictionary. Though in general we use the word dictionary to mean something that contains the meanings of words within the language, it has a dual sense in electronic documents (word processing).

A dictionary file is nothing but a simple text file containing a list of words each on a distinct row/line. Each row consists of nothing but the characters relating to the word. Words are separated by paragraph mark [This will be created by placing the cursor at the end of the word and pressing the "ENTER" key].

• How is Spelling Mistake Identified?

To identify whether a word form is correctly spelled or not, the program would compare the word that we enter with the list of words in the dictionary file. If it finds the word form in the file it is assumed to be correct, otherwise the word form is assumed to be wrongly spelled.

Conducting spell check on post text  
The blogger program has a dictionary file which contains a list of words (say the list of all the words found in a dictionary). When you ask the program to check the spelling of your posting in composition, the program would scan all the text from the text area where you entered the post body and would compare each word in it with the list of words in the file it has.

The word forms that it cannot find in its dictionary file are marked as errors.

• No Undo and Redo

This correction cannot be undone and as such redo option would not also be present for steps involving spelling corrections.

Web Page » Word Wrapping  
A web page is also a document. However, unlike a document used in a word processor it does not have a fixed width. The width of the web page is the width of the window in which it is being displayed. Whenever you increase or decrease the size of the window, the width of the web page keeps changing.

A paragraph is an element of a web page. Its width is equal to the width of the web page (i.e. the browser window) except when the width of the paragraph or the width of any element holding the paragraph is defined. Therefore, the number of lines into which a paragraph is wrapped is dependent on its width, which in effect is dependent on the width of the browser window. Thus whenever the browser window is resized, the number of lines within a paragraph would change.

• Forcing Width for a Web Page

The width of a web page can be set to a fixed dimension indirectly. This is done by defining or setting an absolute width to any element in the web document. Say when there is a table or division or paragraph or horizontal line etc., with a specified width (not in terms of % when it would be interpreted as a % of window width), resizing the browser window would not change its size. Instead a horizontal scroll bar would get added to the browser window.

Web Page Length  
In word processing documents a page has a certain length and width. However, in a web document, there is no length or height for a page. Each file is a single document as well as a page on the web. The length of a web page would be the length of the browser window that is needed to display the total content of the web page.

The width of a web page as seen above is not a static figure except when it is indirectly specified by specifying the width of any element within the web page. When the width of the web page is changed on account of the browser window being resized, the length of the document needed to display the total content also changes.

As far as web pages are concerned we would not think of length.

Author Credit : The Edifier ... Continued Page 11

