background preloader

Parsing

Facebook Twitter

Regular Expression Tutorial. This tutorial teaches you all you need to know to be able to craft powerful time-saving regular expressions. It starts with the most basic concepts, so that you can follow this tutorial even if you know nothing at all about regular expressions yet. The tutorial doesn't stop there. It also explains how a regular expression engine works on the inside, and alert you at the consequences. This helps you to quickly understand why a particular regex does not do what you initially expected. What Regular Expressions Are Exactly - Terminology Basically, a regular expression is a pattern describing a certain amount of text. This first example is actually a perfectly valid regex. \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\. With the above regular expression pattern, you can search through a text file to find email addresses, or verify if a given string looks like an email address. Different Regular Expression Engines Give Regexes a First Try As a quick test, copy and paste the text of this page into EditPad Pro.

PCRE - Perl Compatible Regular Expressions. SLRE - Super Light Regular Expression library. How lexers (e.g. from LEX) and parsers (e.g. from YACC or JAVACC) work. Next: Ambiguous and confusing grammars Up: CS2121: The Implementation and Previous: YACC: Further usage Subsections How lexers and parsers work Lexers - Finite State Automata (FSA) Lexers are also known as scanners. LEX converts each set of regular expressions into a Deterministic FSA (DFSA) e.g. for a(b|c)d*e+ which has states 0 to 3, where state 0 is the initial state and state 3 is an accept state that indicates a possible end of the pattern. LEX implements this by creating a C program that consists of a general algorithm: and a decision table specific to the DFSA: This table is indexed by the current state and input character, and used by the algorithm to decide which of the alternative actions to choose, and the new state after a match (the number in brackets in the table above).

FLEX makes its decision table visible if we use the -T flag, but in a very verbose format. Parsers - Deterministic Push Down Automata (PDA) to process their input. Either kind can be table-driven (e.g. . = shift,

Formal language - Grammar

Argtable - ANSI C command line parser. Parsing - lexers vs parsers. Practical Parsing for ANSI C. By Daniele Paolo Scarpazza , December 12, 2006 Source Code Accompanies This Article. Download It Now. parse_c.txt Daniele discusses the design of an ANSI C parser front-end, identifying the pitfalls that make design tricky. Front-ends are present in all applications that process source code—compilers, interpreters, linters, and the like.

Common beliefs about the design of front-ends are that they are made up of phases (lexical, syntactic, and semantic analysis), are largely decoupled, and that lexical analysis consists of regular-expression matching. When these assumptions are true, the design of a parser is facilitated. Myth #1: Front-Ends Are Made Up of Independent Phases A front-end is a program that accepts text input and produces a representation of that text in a desired form—an abstract syntax tree, for instance. Anyone who has taken a compiler course recognizes the classic flow diagram of a front-end in Figure 1. Figure 1: The classical structure of a front-end. ulint_t my_ul_int; Mini-XML. About Mini-XML Mini-XML is a small XML library that you can use to read and write XML and XML-like data files in your application without requiring large non-standard libraries.

Mini-XML only requires an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program. Mini-XML supports reading of UTF-8 and UTF-16 and writing of UTF-8 encoded XML files and strings. Data is stored in a linked-list tree structure, preserving the XML data hierarchy, and arbitrary element names, attributes, and attribute values are supported with no preset limits, just available memory. Mini-XML 2.8 Jan 5, 2014 Mini-XML 2.8 is now available for download from: Mini-XML 2.8 fixes some minor platform and XML issues.

Now call docsetutil using xcrun on OS X (Bug #458) mxmldoc did not escape special HTML characters inside @code foo@ comments. Post comment Mini-XML 2.7 Dec 21, 2011 Mini-XML 2.7 is now available for download from: Enjoy! Parsing Html The Cthulhu Way. Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions.

How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness: You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. (The unicode action in the post, not shown here, is the best part of the gag.) That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god Cthulhu's … er … code. This is all good fun, but the warning here is only partially tongue in cheek, and it is born of a very real frustration. I have heard this argument before. Like I said, this is a well understood phenomenon in most programming circles. It's generally a bad idea.