man PPI () - BETA: Analyze and manipulate Perl code without using perl itself

NAME

PPI - BETA: Analyze and manipulate Perl code without using perl itself

SYNOPSIS

  use PPI;

  # Load a Document from a file
  my $Document = PPI::Document->load('Module.pm');

  # Does it contain any POD?
  if ( $Document->find_any('PPI::Token::Pod') ) {
      print "Module contains POD\n";
  }

  # Get the name of the main package
  $pkg = $Document->find_first('PPI::Statement::Package')->namespace;

  # Remove all that nasty documentation
  $Document->prune('PPI::Token::Pod');
  $Document->prune('PPI::Token::Comment');

  # Save the file
  $Document->save('Module.pm.stripped');

STATUS

As of version 0.900, PPI is officially feature-frozen and in beta.

The core PPI feature-set is now implemented, and the API now supports all of the major language structures, and should be able to handle the entire perl syntax.

Source filters are not and will not (and can not) be supported.

The class structure of the PDOM (Perl Document Object Model) is complete and frozen. All of the analysis methods within the PDOM that are documented can also be considered frozen.

Most of the non-core distributions have also been brought up to date.

The following packages are all also considered up to date.

PPI::Tester - Wx-based Interactive Testing Application
PPI::HTML - HTML Syntax Highlighting
PPI::XS - XS Acceleration for PPI (negligable speed up at this point, but should improve over time)
PPI::Processor - Framework for bulk-analysis of Perl documents

The following packages are stale and currently being updated or killed

PPI::Format::Apache - This module is redundant and will be replaced with an alternative module.
Perl::Compare - This module is currently being updated
Perl::Signature - This module is currently being updated
Perl::SAX - This module was a proof of concept only, and requires re-implementation.

DESCRIPTION

About this Document

This is the PPI manual. It describes PPI, its reason for existing, its structure, its use, an overview of the API, and provides implementation samples.

Background

The ability to read, and manipulate perl (programmatically) other than with the perl executable is one that has caused difficulty for a long time.

The root cause of this problem is perl's dynamic grammar. Although there are typically not huge differences in the grammar of most code, some things cause large problems.

An example of these are function signatures, as demonstrated by the following.

  @result = (dothis $foo, $bar);

  # Which of the following is it equivalent to?
  @result = (dothis($foo), $bar);
  @result = dothis($foo, $bar);

This code can be interpreted in two different ways, depending on whether the CW&dothis function is expecting one argument, or two, or several.

To restate, a true or real parser needs information that can not be found in the immediate vicinity. In fact, this information might not even be in the same file. It might also not be able to determine this without the prior execution of a CWBEGIN {} block. In other words, to parse perl, you must also execute it, or if not it, everything that it depends on for its grammar.

This, while possibly feasible in some circumstances, is not a valid solution ( at least, so far as this module is concerned ). Imagine trying to parse some code that had a dependency on the CWWin32::* modules from a Unix machine, or trying to parse some code with a dependency on another module that had not even been written yet...

For more information on why it is impossible to parse perl, see:

<http://www.perlmonks.org/index.pl?node_id=44722> Originally, PPI was short for Parse::Perl::Isolated. In aknowledgement that someone may some day come up with a valid solution for the grammar problem, it was decided to leave the CWParse::Perl namespace free.

The purpose of this parser is not to parse Perl code, but to parse Perl documents. In most cases, a single file is valid as both. By treating the problem this way, we can parse a single file containing Perl source isolated from any other resources, such as the libraries upon which the code may depend, and without needing to run an instance of perl alongside or inside the the parser (a possible solution for Parse::Perl that is investigated from time to time).

Why do we want to parse?

Once we accept that we will probably never be able to parse perl well enough to execute it, it is worth re-examining CWWHY we wanted to parse perl in the first place. What are the uses we would put such a parser to.

Documentation
Analyze the contents of a Perl document to automatically generate documentation, in parallel to, or as a replacement for, POD documentation.
Structural and Quality Analysis
Determine quality or other metrics across a body of code, and identify situations relating to particular phrases, techniques or locations.
Refactoring
Make structural, syntax, or other changes to code in an automated manner, independently, or in assistance to an editor. This list includes backporting, forward porting, partial evaluation, improving code, or whatever.
Layout
Change the layout of code without changing its meaning. This includes techniques such as tidying (like perltidy), obfuscation, compression, or to implement formatting preferences or policies.
Presentation
This includes method of improving the presentation of code, without changing the text of the code. Modify, improve, syntax colour etc the presentation of a Perl document.

With these goals identified, as long as the above tasks can be achieved, with some sort of reasonable guarantee that the code will not be damaged in the process, then PPI can be considered to be a success.

Good Enough(TM)

With the above tasks in mind, PPI seeks to be good enough to achieve the above tasks, or to provide a sufficiently good API on which to allow others to implement modules in these and related areas.

However, there are going to be limits to this process. Because PPI cannot adapt to changing grammars, any code written using code filters should not be assumed to be parsable. At one extreme, this includes anything munged by Acme::Bleach, as well as (arguably) more common cases like Switch.pm and Exception.pm. We do not pretend to be able to parse code using these modules, although someone may be able to extend PPI to handle them.

UPDATE: The ability to extend PPI to handle lexical additions to the language, which means handling filters that LOOK like they should be perl, but aren't, is on the drawing board to be done some time post-1.0

The goal for success is thus to be able to successfully parse 99% of all Perl documents contained in CPAN. This means the entire file in each case.

IMPLEMENTATION

General Layout

PPI is built upon two primary parsing components, PPI::Tokenizer and PPI::Lexer, and a large tree of nearly 50 classes which implement the various objects within the Perl Document Object Model (PDOM).

The Perl Document Object Model is somewhat similar in style and intent to the regular DOM, but contains many differences to handle perl-specific cases.

On top of the Tokenizer and Lexer, and the classes of the PDOM, sit a number of classes intended to make life a little easier when dealing with PDOM object trees.

Both the major parsing components were implemented from scratch with just plain Perl code. There are no grammar rules, no YACC or LEX style tools, just code. This is primarily because of the sheer volume of accumulated cruft that exists in perl. Not even perl itself is capable of parsing perl documents (remember, it just parses and executes it as code) so PPI needs to be even cruftier than perl itself. Yes, eewww...

The Tokenizer

The Tokenizer is considered complete and of release candidate quality. Not quite fully stable, but close.

The Tokenizer takes source code and converts it into a series of tokens. It does this using a slow but thorough character by character manual process, rather than using complex regexs. Well, that's actually a lie, it has a lot of support regexs throughout, and it's not truly character by character. The Tokenizer is increasingly skipping ahead when it can find shortcuts, so the current character cursor tends to jump a bit wildly. Remember that cruft I was mentioning. Right, well the tokenizer is full of it. In reality, the number of times the Tokenizer will ACTUALLY move the character cursor itself is only about 5% - 10% higher than the number of tokens in the file.

Currently, these speed issues mean that PPI is not of great use for highly interactive tasks, such as an editor which checks and formats code on the fly. This situation is improving somewhat with multi-gigahertz processors, but can still be painful at times.

How slow? As an example, tokenizing CPAN.pm, a 7112 line, 40,000 token file takes about 5 seconds on my little Duron 800 test server. So you should expect the tokenizer to work at a rate of about 1700 lines of code per gigacycle. The code gets tweaked and improved all the time, and there is a fair amount of scope left for speed improvements, but it is painstaking work, and fairly slow going.

The target rate is about 5000 lines per gigacycle.

The main avenue for making it to this speed has now become PPI::XS, a drop-in XS accelerator for miscellaneous parts of PPI.

Since PPI::XS has only just gotten off the ground and is currently only at proof-of-concept stage, this may take a little while.

The Lexer

The Lexer is considered complete, but subject to minor. Early beta quality.

The Lexer takes a token stream, and converts it to a lexical tree. Again, remember we are parsing Perl documents here, not code, so this includes whitespace, comments, and all number of weird things that have no relevance when code is actually executed.

An instantiated PPI::Lexer object consumes PPI::Tokenizer objects, or things that can be converted into one, and produces PPI::Document objects.

Overview of the Perl Document Object Model

The PDOM is a structured collection of data classes that together provide a correct and scalable model for documents that follow the standard Perl syntax.

Although this is a basic overview and doesn't cover the PDOM classes in order or details, the following is a rough inheritance layout of the main core classes.

  PPI::Element
      PPI::Token
          PPI::Token::*
      PPI::Node
          PPI::Statement
              PPI::Statement::*
          PPI::Structure
              PPI::Structure::*
          PPI::Document

To summarize the above layout, all PDOM objects inherit from the PPI::Element class.

Under this are PPI::Token, strings of content with a known type, and PPI::Node, contains to hold other Elements.

The first PDOM element you are likely to encounter is the PPI::Document object.

The Document

At the top of all complete PDOM trees is a PPI::Document object. Each Document will contain a number of Statements, Structures and Tokens.

A PPI::Statement is any series of Tokens and Structures that are treated as a single contiguous statement by perl itself. You should note that a Statement is as close as PPI can get to parsing the code in the sense that perl-itself parses Perl code when it is building the op-tree. PPI cannot tell you, for example, which tokens are subroutine names, or arguments to a sub call, or what have you.

At a fundamental level, it only knows that this series of elements represents a single Statement. For specific Statement types however, the PDOM is able to derive additional useful information.

A PPI::Structure is any series of tokens contained within matching braces. This includes things like code blocks, conditions, function argument braces, anonymous array constructors, lists, scoping braces et al. Each Structure contains none, one, or many Tokens and Structures (the rules for which vary for the different Structure subclasses)

The PDOM at Work

To demonstrate, lets start with an example showing how the PDOM tree might look for the following chunk of simple Perl code.

  #!/usr/bin/perl

  print( "Hello World!" );

  exit();

This is not all that complicated. Very very simple in fact. Translated into a PDOM tree it would have the following structure.

  PPI::Document
    PPI::Token::Comment                '#!/usr/bin/perl\n'
    PPI::Token::Whitespace             '\n'
    PPI::Statement
      PPI::Token::Bareword             'print'
      PPI::Structure::List             ( ... )
        PPI::Token::Whitespace         ' '
        PPI::Statement::Expression
          PPI::Token::Quote::Double    '"Hello World!"'
        PPI::Token::Whitespace         ' '
      PPI::Token::Structure            ';'
    PPI::Token::Whitespace             '\n'
    PPI::Token::Whitespace             '\n'
    PPI::Statement
      PPI::Token::Bareword             'exit'
      PPI::Structure::List             ( ... )
      PPI::Token::Structure            ';'
    PPI::Token::Whitespace             '\n'

Please note that in this this example, strings are only listed for the ACTUAL element that contains the string. Also, Structures are listed with the brace characters noted.

The PPI::Dumper module can be used to generate similar trees yourself.

Notice how PPI builds EVERYTHING into the model, including whitespace. This is needed in order to make the Document fully round trip compliant. That is, if you stringify the Document you get the same file you started with.

The one exception is that if the newlines for your file are wrong, PPI will probably have localised them for you.

We can make that PDOM dump a little easier to read if we strip out all the whitespace. Here it is again, sans the distracting whitespace tokens.

  PPI::Document
    PPI::Token::Comment                '#!/usr/bin/perl\n'
    PPI::Statement
      PPI::Token::Bareword             'print'
      PPI::Structure::List             ( ... )
        PPI::Statement::Expression
          PPI::Token::Quote::Double    '"Hello World!"'
      PPI::Token::Structure            ';'
    PPI::Statement
      PPI::Token::Bareword             'exit'
      PPI::Structure::List             ( ... )
      PPI::Token::Structure            ';'

As you can see, the tree can get fairly deep at time, especially when every isolated token in a bracket becomes its own statement. This is needed to allow anything inside the tree the ability to grow. It also makes the search and analysis algorithms much more flexible.

Because of the depth and complexity of PDOM trees, a vast number of very easy to use methods have been added wherever possible to help people working with PDOM trees do normal tasks relatively quickly and efficiently.

CLASSES

This section has two parts.

Firstly a large tree of all the classes contained in the PPI core. They are listed only by name, with no description.

And second, a shorter list with descriptions for the primary classes in the core PPI distribution. The list is in alphabetical order. Anything with its own POD documentation can be considered stable, as the POD is only written after the API is largely finalised and frozen. Still, don't rely on anything here until after PPI official becomes a beta, in the 0.9xx versions.

Perl Document Object Model Classes

   PPI::Element
      PPI::Node
         PPI::Document
            PPI::Document::Fragment
         PPI::Statement
            PPI::Statement::Scheduled
            PPI::Statement::Package
            PPI::Statement::Include
            PPI::Statement::Sub
            PPI::Statement::Variable
            PPI::Statement::Compound
            PPI::Statement::Break
            PPI::Statement::Data
            PPI::Statement::End
            PPI::Statement::Expression
            PPI::Statement::Null
            PPI::Statement::UnmatchedBrace
            PPI::Statement::Unknown
         PPI::Structure
            PPI::Structure::Block
            PPI::Structure::Subscript
            PPI::Structure::Constructor
            PPI::Structure::Condition
            PPI::Structure::List
            PPI::Structure::ForLoop
            PPI::Structure::Unknown
      PPI::Token
         PPI::Token::Whitespace
         PPI::Token::Comment
         PPI::Token::Pod
         PPI::Token::Number
         PPI::Token::Word
         PPI::Token::DashedWord
         PPI::Token::Symbol
            PPI::Token::Magic
         PPI::Token::ArrayIndex
         PPI::Token::Operator
         PPI::Token::Quote
            PPI::Token::Quote::Single
            PPI::Token::Quote::Double
            PPI::Token::Quote::Literal
            PPI::Token::Quote::Interpolate
         PPI::Token::QuoteLike
            PPI::Token::QuoteLike::Backtick
            PPI::Token::QuoteLike::Command
            PPI::Token::QuoteLike::Regexp
            PPI::Token::QuoteLike::Words
            PPI::Token::QuoteLike::Readline
         PPI::Token::Regexp
            PPI::Token::Regexp::Match
            PPI::Token::Regexp::Substitute
            PPI::Token::Regexp::Transliterate
         PPI::Token::HereDoc
         PPI::Token::Cast
         PPI::Token::Structure
         PPI::Token::Label
         PPI::Token::Separator
         PPI::Token::Data
         PPI::Token::End
         PPI::Token::Prototype
         PPI::Token::Attribute
         PPI::Token::Unknown

Class Summary

PPI::Tokenizer
The PPI Tokenizer consumes chunks of text and provides access to a stream of PPI::Token objects. The Tokenizer is really nastily complicated, to the point where even the author treads a bit carefully when working with it. Most of the complication is the result of optimizations which have tripled the tokenization speed, at the expense of maintainability. Yeah, I know... Because the Tokenizer holds the array of Tokens internally, providing cursor-based access to it, an instantiate Tokenizer object can only be used once, unlike the Lexer which just spits out a single PPI::Document object and can be reused as needed.
PPI::Lexer
The PPI Lexer. Converts Token streams into PDOM trees.
PPI::Dumper
A simple class for dumping readable debugging version of PDOM structures
PPI::Token::_QuoteEngine
The PPI::Token::Quote and PPI::Token::QuoteLike classes provide abstract base classes for the many and varied types of quote and quote-like things in perl. However, much of the actual quote login is implemented in a separate quote engine, based at PPI::Token::_QuoteEngine. Classes that inherit from PPI::Token::Quote, PPI::Token::QuoteLike and the base Regexp class PPI::Token::Regexp are generally parsed only by the Quote Engine.
PPI::Document
The Document object, the top of the PDOM
PPI::Document::Fragment
A cohesive fragment of a larger Document. Currently Incomplete. Will be used later on for cut/paste/insert etc. Very similar to PPI::Document, but has some additional methods, and does not represent a lexical scope boundary.
PPI::Element
The Element class is the abstract base class for all objects within the PDOM
PPI::Node
The Node object, the abstract base class for all PDOM object that can contain other Elements, such as the Document, Statement and Structure objects.
PPI::Statement
The base class for all Perl statements. Generic evaluate for side-effects statements are of this actual type. Other more interesting statement types belong to one of its children. See the PPI::Statement documentation for a longer description and list of all of the different statement types and subclasses.
PPI::Structure
The abstract base class for all structures. A Structure is a language construct consisting of matching braces containing a set of other elements. See the PPI::Structure documentation (not yet written) for a description and list of all of the different structure types/classes.
PPI::Token
A token is the basic unit of content. At its most basic, a Token is just a string tagged with metadata (its class, some additional flags in some cases). See the PPI::Token documentation (not yet written) for a description and list of all of the different Token types/classes

INSTALLING

The core PPI distribution is pure perl and has been kept as tight as possible and with as few dependencies as possible.

It should download and install normally on any platform from within the CPAN and CPANPLUS applications, or directly using the distribution tarball.

There are no special install instructions for PPI.

EXTENDING

For the time being, the PPI namespace is to be reserved for the sole use of the Parse::Perl project and its modules.

<http://sf.net/parseperl>

You are recommended to use the PPIx:: namespace for PPI-specific modifications, or Perl:: for modules which provide a general Perl language-related functions.

TO DO

- Complete documentation for the remaining PPI classes

- More analysis methods for PDOM classes

- Creation of a PPI tutorial

- Expansion of the unit test suite

- More documentation

SUPPORT

Anything documented is considered to be loosely frozen, and bugs should always be reported at:

<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=PPI>

For other issues, or commercial enhancement or support, contact the author.

AUTHOR

Adam Kennedy (Maintainer), <http://ali.as/>, cpan@ali.as

ACKNOWLEDGMENTS

Thank you to Phase N (<http://phase-n.com/>) for permitting the original open sourcing and release of this distribution.

Completion funding provided by The Perl Foundation (<http://www.perlfoundation.org/>)

COPYRIGHT

Copyright (c) 2004 - 2005 Adam Kennedy. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.