Wednesday, January 3, 2018

Software Project Documentation

I want to motivate the use of pure text for the central technical documents of a software project.

Many companies use DOCX, but this is no good for the technical developer folks. I have investigated and tested a way to use RST and still have a final DOCX, in addition to HTML and PDF, in case the client/company insists on it. See dcx.py.

Summary

  • Minimize resource usage.
  • Simplify system development to short design-testing cycles.
  • Planning = Simulation. Simulate design-testing cycles.
  • Requirements are interfaces. Make them minimal.
  • Requirements external to the team are in the SRS else in the SDD.
  • Phrase requirements as tests, or code them as unit tests.
  • Generate final documents from smaller units.
  • Mix human language with computer language to reduce redundancy and increase cohesion.
  • The raw format of documentation should be pure text. Use RST.
  • Use HTML (or PDF) as final format, not DOCX or ODT.

The Principles

Documentation in a project is written communication to be preferred over oral.

Documentation is not the goal per se but the means to the project's goal.

These general principles apply also to documentation:

  • A system develops by mutation and selection or creation and selection (= evolution). The selection is the verification or testing by the environment.
  • Maximize cohesion and minimize coupling. The amount of coupling is a metric of locality and should be reflected in the documentation artefacts. Coupling and cohesion are synonyms.
  • A system is developed by a divide and conquer approach. Units are characterized by a separate evolution. There is a hierarchy of units.
    • Units have requirements (purpose, coupling, interface) to the system they are part of.
    • Units can be and must be tested separately (unit tests).
    • Units have an internal complexity larger than the interface.
  • Minimize redundancy, i.e. don't repeat yourself. To this purpose:

    • Develop concepts (units) and words for these concepts, i.e. a language specific to the system.
    • Make paragraphs (units) that have only one aspect.
    • Give every paragraph an URI.
    • Link concepts and paragraphs by using the words or URI's.

    URI's, for efficiency, must be accessible by a direct jump (via mouse click or a keyboard shortcut).

The latter three follow from minimize resource usage: memory and time of the computer and the developers.

Evolution

Evolution governs all dynamical systems: our brain, teams, ...

The basic step in evolution is to select an element from a set, which translates to fix a value of a variable. The set or variable needs to be created. So there is creation and selection.

The values of a variable are its type. Type, in a certain aspect, is synonymous to variable. But due to the (often unmotivated) reuse of the same set of values for different variables, one needs a way to reference these values by a different name, which is type.

Variables are the elements of information processing, which also mathematics deals with in the foundational frameworks of set theory, type theory and category theory.

Variables and types are the essential elements of a human languages and of programming languages, be it procedural, OOP or functional.

modelling

A hierarchy of variables of the actual system is mapped to a hierarchy of variables in a (programming) language, normally with simplifications.

Of the system development lifecycle I subsume planning, analysis and implementation in the design process. As an important separate part there is the testing (stabilization). Design-testing are the cycles of evolution. One can also call it Trial and Error.

design

= creation. create variables, units, concerns, interactions, ...

test

= selection. select (yes/no) and improve.

Design and Testing repeats in a layered way, until reaching a stable software, everybody is content with.

The Design and Testing is parallelized to team members as soon as the major units have been identified. There is no individual designer or architect, because that would serialize and thus slow down the evolution.

There can be a quality manager to streamline methodology and communication.

It is important, that everyone can make proposals. The quality manager needs to instantiate channels to facilitate that.

Planning

Planning is evolution in the mind. Planning (far ahead) is simulation.

Constructing variables in the mind is less effort than constructing the real variables like hardware units. Simulation becomes more important with the probability that designs will be dumped, if the testing fails, and with the difference in effort between real construction and simulation. Physical constructions should be well planned and simulated. The real design and testing is best delayed as long as possible.

The mind is not powerful enough to simulate complex systems. But it works with the help of computers.

With software, simulation is not necessarily so important, because SW elements often follow directly form concepts of the mind without extra effort like procurement and physical treatment. But if other OS components (executables, libraries) and communication to other programmers and teams are involved, the effort immediately rises. To minimize this effort, one simulates such interface components (mock, stub). The mocks are made at an early stage and are preferably make a part of the requirements (Test-driven development = TDD).

A single developer should also mock the bigger ones of his own interface components,

  • to have a consistent TDD approach, where developer assignment is transparent
  • to allow parallelizing through reassignment of subunits

The Documents

The raw format of documentation should be pure text.

Documentation that accompanies and coordinates the development of software (or generally system) from requirements to testing, is intricately interwoven with the software, not just in one direction.

  • There should be no tool barrier between the source code and the documentation.
  • Formatting has no importance and especially during development you should not be concerned about it (Separation of Concern = SoC).

Note

Software/System

Software can be replaced by System, because non-software systems can also be described by source code in a domain specific language and it is a good approach to do so, instead of using a GUI-only software package.

A coarse hierarchy of units is mapped to teams, a finer one to individual developers, which then produce the finest granulation of units.

Documentation is especially important as a means of coordination between teams and individuals. These are the

  • Software Requirement Specifications (SRS).

The requirements come from the environment (= context), i.e. the bigger system.

A system is described by the

  • Software Design Description (SDD). The result of the design is an architecture.

So SRS and SDD are linked, but

  • only requirements external to the team are in the SRS.
  • internal requirements are part of the SDD.

Tests accompany all units throughout the hierarchy.

For cohesion,

  • unit tests are part of the SDD,
  • the requirement tests are part of the SRS.

Actually it is convenient to immediately phrase requirements as tests or code them as unit tests (TDD).

Since there are more aspects of testing, a test team will have a

that can reference SRS and SDD tests, and thus avoid producing redundancy, but will have additional tests, too.

We see that the hierarchy of units is not mapped to a hierarchy of documents, at least not in the same granularity. But this actually breaks the cohesion.

  • Separating documentation from source code produces more links, i.e. more URI's are necessary. The natural ordering is that of the units.

By generating final documents from smaller units, one can have both, cohesion by aspect (SRS, SDD, Test Plan) and cohesion by units.

Initiating the Development Process

The SRS is an interface document between teams:

  • the development team
  • an outside entity

It is essential input for the development. It should therefore be an integral part of the documentation of the development process. Therefore the final version of the

  • SRS should be written by a member of the development team.
  • Both parties will negotiate and will agree on a final version and will also agree on future reviews.

The content of the SRS

  • does not describe the system to be developed, but its context, the way of its operation, its usage in the bigger system. In case documentation about the bigger system is available (e.g. if internal), a link will do.
  • The requirement paragraphs must
    • address a well confined unit (of the interface to the bigger system)
    • be verifiable
    • be referenceable via a hyperlink URI

      The URI is important, because during development decisions will need to be justified by a reference to the SRS entry.

Design

In the SRS the context (bigger system) must be analyzed, i.e. split into a suitable parts and concerns and written in paragraphs of single concern (implemented)

Test

Verify that they can be tested and easily maintained. Phrase requirements as tests.

  • As an interface document the SRS must satisfy the minimal coupling principle, i.e. it must be minimal.

    Requirement paragraphs should link to other documents that support the requirement, as a measure to ensure that no unnecessary requirements are specified.

    Optional things are not in requirement paragraphs but in surrounding text.

    Every small requirement change late in the development can entail a huge effort, if it means changing a complex system, since that will need a long time to stabilize again.

    It is often easier to change requirements in case of obstacles encountered during development. But if the SRS is kept minimal, one can normally do without.

  • The SRS must be open for changes, if at an early stage of the system and thus not yet thought through in detail. In this case the SRS will be modified during the development process. The coordinating development team member will agree on the changes with the outside entity in SRS reviews.

Development

Often the approach is to model in a human language first and then "translate" that to a computer language, i.e. implement it.

But specifying variables and relations is what computer languages are good in and made for. So many parts can be coded in a computer language right away at the component level. Higher language generations can be made very readable.

It is a good idea to mix human language with computer language to

  • minimize redundancy
  • maximize cohesion, independent of whether human or computer language is used

There are two approaches to do this:

The second approach is more popular. The reason is, that computer languages allow quite well to split conceptual things from implementation details. Files that contain variables and values of the model can also contain additional human language in comments. Such files can be

  • used directly by the compiler
  • parsed for parts to be incorporated in the documentation

To incorporate code comments in the documentation in a selective fashion, it is good

  • to use a light markup text, like RST
  • to script one's own way to extract those parts from the source code files

    The available documentation generators tend to be targeted to specific applications, like creating an API reference, and thus are not flexible enough to exploit the full potential of the principle approach.

Light Markup

The advantages of light markup formats:

  • It allows mixing source code with documentation for better cohesion and less redundancy.
  • It can be easily learned, because it restricts itself to essential elements.
  • The elements are of conceptual nature (header, list item, ) not actual formatting. The formatting is done when creating the final document. This makes it easier to keep a consistent formatting when more people work on the documentation.
  • As text it, is perfect for version control systems. One can commit documentation changes together with the according source code changes. It is easy to review documentation changes. It allows to keep outdated information without lying around and messing up.
  • It is easier to generate parts of the documentation with scripts from source code or source code comments
  • It is easier to extract data from the documentation, like which items link to which other ones, especially if the team agrees on facilitating conventions.
  • It can be edited with a text editor, i.e. the same tool developers work with all the time.
  • It is accessible to grep.
  • Ctags can be used to jump around while editing.
  • It is very readable as source and can be translated to several final formats, e.g.
    • HTML, most importantly
    • PDF (pandoc, sphinx)
    • ODT, DOCX (pandoc)

RST

There are many light markup formats. But especially restructuredText (RST)

  • has rich format support (e.g. table formats)
  • is extensible
  • is best for python scripting
  • has a very good tooling support
    • Pandoc (to HTML, PDF, DOCX, ...)
    • Sphinx (to HTML, PDF)
    • Docutils This is used by sphinx, but allows for own scripts, and has separate rst2html script.
    • Ctags support to jump around while editing

http://rst.ninjs.org can be used to play with RST. Here a cheatsheet.

Dealing with Company Tradition

Unfortunately companies often bury information

  • by not using pure text
  • or by using text that needs special tools (like over-formatted XML).

At three of the companies I worked for, they used MS Office for documentation, at one they used Lotus Notes.

The problems I see are the following:

  • They are too detached from the other text project artefacts.
  • They don't have the idea of an URI for every resource.
  • Thus they cannot be linked well via hyperlinks.
  • They are not suitable for a version control system: diffs do not work well.
  • A proprietary format is no good for the company's precious information. The information gets locked. Even it being standardized does not change that, because the adoption by the developer community is reluctant and thus independent tools are rare.
  • DOCX or ODT is not easily accessible to scripting, because libraries are rare and also because DOCX is XML that mixes formatting with content. HTML with CSS and XML with XSLTs (e.g. Docbook) separate better and are more accessible to scripts, but less suitable for direct editing, because too formal and needing more learning.

Thanks to Pandoc and Sphinx it is possible to use text as the documentation source and still fulfill the company's requirement for DOCX.

Providing DOCX from RST

HTML is the normal target of light markup formats. It is also best for the URI principle. Nevertheless it is also possible to generate DOCX.

For the conversion from RST to DOCX currently the best tool is Pandoc, Pandoc only takes pure RST and does not know about the Sphinx role extensions like :ref:. For that, one would need a sphinx-docxbuilder.

I have investigated and tested a way to use RST and still have a final DOCX output, in addition to HTML and PDF. dcx.py is used as a support script.

  • Don't use Sphinx specific roles, like :ref:.
  • :math: is supported well. It is not a Sphinx extension.
  • Top level files use extension .rst. Included files use extension .txt (.. _include: somefile.txt)
  • Make paragraphs with target ID this way:

    .. _`targetid`:
    
    :targetid:
    
      Text follows here.

    targetid is lower case, because docutils converts targets to lower case.

  • As the link differs between HTML and DOCX, use replacement substitutions (|targetid|) as links.

    A links_docx.txt file with entries:

    .. |targetid| replace:: `targetid <file.docx#targetid>`_

    and a links_sphinx.txt with:

    .. |targetid| replace:: :ref:`targetid <file.html#targetid>`

    will define the substitutions separately.

  • Above headers there can be some unique target ID:

    .. _`secondminutedate`:

    Links using |secondminutedate| will replace the header for the target.

  • Substitutions cannot be in included files, until the Pandoc include bug is corrected. For the links_docx.txt this helps:

    cat file.rst links_docx.txt | sed -e's/.. include:: links_sphinx.txt//g' | pandoc -f rst -t docx -o file.docx

    For image substitutions to work place the .. image:: xxx.jpg into the main rst files, before the .. include:: links_sphinx.txt.

For an illustrative implementation of these guidelines see dcx.py.