Any implementation MUST support the syntax as described below, except for the noted exceptions. Any reading implementation MUST be able to read at least the syntax defined in this section (it may accept various extensions), and any writing implementation MUST generate this format when a user creates data that can comply with it.

This syntax is defined using the BNF notation of XML version 1.1, as described at Note that each syntax rule is defined using "::=". Text is interspersed to explain these syntax rules and their implications.

Note that formulas are typically embedded inside an XML document. When this occurs, various characters (such as "<", ">", '"', and "&") must be escaped, as described in the XML specification. In particular, "<" becomes <, '"' becomes ", and "&" becomes &.

Basic Expressions

FormulaContent ::= Namespace Formula
Namespace ::= Namespace_in_XML ':'
Namespace_in_XML ::=

The namespace tells the reading application how to treat formula content written by a specific application or an application conforming to a certain dialect of this formula language.

*Note:* For OpenDocument files written by versions prior to this specification (including versions 2.0, 2.0.1, 2.0.2, and 2.0.3), the namespace is xmlns:oooc="".

TODO: The namespace for this specification is ???.

Formula ::= '=' ForceRecalc? Expression
ForceRecalc ::= '='

If the second '=' is present, this is a "forced recalculate". Typical implementations optimize what is recalculated, and typically only recalculate what is displayed and the values they depend on (transitively). If the formula is a "forced recalculation" formula, then it must be recalculated whenever one of its predecessors it depends on changed, even if it isn't apparantly needed.

*Note:* Spreadsheet implementations decide what to recalculate, and may decide to consider other formulas as if they were "forced recalculate" formulas based on the functions used in them.

Expression ::= Number |
               String |
               Array |
               PrefixOp Expression |
               Expression PostfixOp |
               Expression InfixOp Expression |
               '(' Expression ')' |
               FunctionName '(' ParameterList ')' |
               Error |
               NamedExpression |

Number ::= WrittenNumber |
           '.' [0-9]+ ([eE] [-+]? [0-9]+)?

WrittenNumber ::= [0-9]+ ('.' [0-9]+)? ([eE] [-+]? [0-9]+)?

Numbers are written according to the "C" or en-US locale; the '.' dot is used as the decimal separator and the group (AKA thousand) separators are not written. Optional "E" or "e" denote scientific notation. Syntactically, negative numbers are simply numbers with a prefix "-" operator; implementations MAY and typically do optimize this operator by performing the negation and storing the negative number directly.

Readers SHOULD be able to read a fraction that starts with '.' (without a leading zero). Writers MUST write numbers with a leading digit (the WrittenNumber format), and MAY NOT write numbers with a leading '.'.

String ::= '"' ([^"#x00] | '""')* '"'

A literal double-quote character (") as string content is escaped by duplicating it. Note that since a formula is normally stored as an XML attribute, all double-quotes are written as their entity ".

NOTE: Changed to allow all control characters except ASCII 0.

Array ::= '{' Matrix ( '_' Matrix )* '}'
Matrix ::= ( MatrixRow ( ColumnSeparator MatrixRow )* ) ?
MatrixRow ::= Expression ( ';' Expression )*
ColumnSeparator ::= '|'

Note: Some implementations that support arrays don't accept full expressions - only constants.

COMMENT: is there any implementation that allows expressions in inline arrays? If not, shouldn't we better restrict this to not allow expressions for claritie's sake?

PrefixOp ::= '+' | '-'

These are the unary prefix operators, prefix plus (a no-op) and prefix minus (negation).

PostfixOp ::= '%'

This is the unary percentage operator, dividing the preceding expression by 100.

InfixOp ::= ArithmeticOp | ComparisonOp | '&' | ReferenceOp

The '&' ampersand is the string concatenation operator. Note that since a formula is typically stored as an XML attribute, an '&' ampersand is written as the entity &

ArithmeticOp ::= '+' | '-' | '*' | '/' | '^'

These are Addition, Subtraction, Multiplication, Division, and Exponentiation (power).

ComparisonOp ::= '=' | '<>' | '<' | '>' | '<=' | '>='

These are EqualTo, UnequalTo, LessThan, GreaterThan, LessThanOrEqualTo, GreaterThanOrEqualTo.

ReferenceOp ::= IntersectionOp | CellConcatenationOp | RangeOp
IntersectionOp ::= '!'
CellConcatentationOp ::= '~'
RangeOp ::= ':'

These are operations on references: Intersection (which some user interfaces display as a space), cell concatenation (which some user interfaces display as the function parameter seperator), and range extension (given two ends, creates a range that covers both ends). The result may be a 3 dimensional range, with front-upper-left and back-lower-right corners.

Note that since they are defined as general operators, either or both sides may be the output of a function. The operations are defined on their own lines, in case this is helpful in describing "alternative" syntaxes.

AutomaticIntersection ::= ColumnLabel AutomaticIntersectionOp RowLabel |
                          RowLabel AutomaticIntersectionOp ColumnLabel
AutomaticIntersectionOp ::= '!!'
RowLabel ::= QuotedLabel
ColumnLabel ::= QuotedLabel
QuotedLabel ::= "'" StringCellContent "'"

An AutomaticIntersection may be used to identify the intersection of a ColumnLabel and a RowLabel. The difference between the AutomaticIntersectionOp and the usual IntersectionOp is that the AutomaticIntersectionOp may be represented different in UI, for example using a space. This resembles natural language, for example =Sales Hamburg with Sales being a ColumnLabel and Hamburg being a RowLabel, which results in displaying the result of the cell at the intersection.

Automatic lookup of labels can be enabled or disabled in the document settings. Additionally, a list of defined labels has to be taken into consideration, see OpenDocument 1.0 section 8.5.4 Label Ranges. Defined label ranges take precedence over cell content that could be looked up.

TODO: define algorithm of automatic lookup (least square roots clockwise starting upper left).

TODO: Excel has a weird feature called stacked labels, where a label may be the result of a concatenation of cell contents of stacked adjacent cells in one column, which confusing enough in the UI are separated with a space as well. It's hard to predict what an expression like =SUM(one two three four) actually refers.. even more if you have B2:'one two', B2:'three', C1:'one', C2:'two three' and A3:'four'.

Functions and Function Parameters

FunctionName ::= Identifier
Identifier ::= LetterXML (LetterXML | DigitXML |
               '_' | '.' | CombiningCharXML)*
LetterXML ::=
DigitXML ::=
CombiningCharXML ::=

Note that the function names defined in this specification use English text, not a translated UI representation. Thus predefined function names actually meet the stricter definition as follows:

[A-Za-z] [A-Za-z0-9_.]*

However, letter characters as defined for an Identifier are allowed, since user-defined functions may use such characters.

Function names are case-insensitive, so SUM and Sum are the same function. However, traditionally spreadsheet function names are written in all upper case, so this tradition is followed in this specification. Implementations SHOULD write function names in all upper case.

*Rationale:* Function names are often displayed in all upper case by both implementations, showed this way in documentation, and saved in uppercase, owing to the influence of the first spreadsheet program (VisiCalc). By saying implementations SHOULD save in all uppercase, this increases the likelihood that implementations can trivially write back formulas and, if the formulas are unchanged, produce an identical result. This can be helpful to systems that depend on detecting differences, by minimizing the number of unnecessary differences. Some functions always produce the same value, and are thus constants; PI() and TRUE() are examples.

*Note:* Microsoft Excel accepts empty parameters in any position; OpenOffice 1.1.3 does not, so the current syntax above doesn't either. See below (level 3) for how this rule can be relaxed. Typical implementations will have many built-in functions, and most implementations also support one or more ways to create user-defined functions.

*Rationale:* Excel uses the "," as the function parameter separator, but OpenFormula uses ";" instead. Many locales use "," as the decimal separator; using the semicolon as the parameter separator eliminates confusion and the risk of incorrect implementation.

ParameterList ::= /* empty */ |
                  Parameter ( Separator EmptyOrParameter )* |
                  Separator EmptyOrParameter /* First param empty */
                  ( Separator EmptyOrParameter )*
EmptyOrParameter ::= /* empty */ | Parameter
Parameter ::= Expression
Separator ::= ';'

Functions may be given a parameter list. Note that given this definition, SIN() has a parameter list with 0 parameters, NOT one parameter that happens to be empty.

TODO: If there is a need for passing 1 parameter that is empty, could define a function like EMPTY() for representing an empty parameter.


Reference ::= '[' Source? RangeAddress ']'
RangeAddress ::= CellAddress ( ':' CellAddress )? |
                 SheetName "." Column ':' SheetName "." Column |
                 SheetName "." Row ':' SheetName "." Row
CellAddress ::= SheetName "." Column Row
SheetName ::= /* empty */ | '$' ? ( [^\. #$']+ | "'" ([^'] | "''")+ "'" ) | Error
Column ::= '$'? [A-Z]+
Row ::= '$'? [1-9] [0-9]*

References always begin with '[' in the storage format; this immediately disambiguates cell addresses from function names and named expressions. User interfaces need not (and often do not) display the surrounding [...]. SheetNames include ' characters by doubling them. Column labels MUST be in uppercase; this makes it easier to distinguish between column labels and many named expressions. The syntax above supports whole-row and whole-column refernces, but they must always have a range marker. The optional '$' in front of named sheetnames, columns, and rows, is used when copying the reference; it indicates that the particular value is absolute (not relative).

Note: Referencing a sheet that got deleted turns into an error, typically #REF!.

NOTE: As a SheetName may be quoted, if given ['... it may be ambiguous if that is a start of a Source or a SheetName. It is clarified when parsing on that a Source ends in "'#", and a SheetName does not.

Note: See OpenDocument 1.0 section 8.3.1's definition of cellRangeAddress and cellAddress. The "$" markings note that the following component is absolute, otherwise it is relative; these markings have no effect on calculation (they are intended for use when copying or moving formulas).

*Rationale:* Cell addresses in OpenFormula begin with "[" and end with a "]"; this makes parsing simpler, faster, and more reliable. Cell addresses are specified in A1 notation, not an R1C1 notation, as required by OpenDocument 1.0 section 8.1.3 subsection "Formula". Not using R1C1 notation can in some cases cause an increase in compressed file size (because copied formulas are shown differently and thus do not compress as well). However, A1 notation is much easier for humans to understand, so using A1 format is likely to increase reliability (because it is more likely to be correctly generated and interpreted). If this is a serious problem, Array formulas can usually be used to achieve the same results. Both R1C1 and A1 could be allowed, but then different spreadsheets are likely to generate different characters for the same cell, and thus create "differences" that do not exist even in simple spreadsheets. Only uppercase characters are allowed, for the same reason.

*Note:* Typical spreadsheet displays will often not display or require input of the square brackets. OpenDocument 1.0 section 8.3.1 also defines a cellRangeAddressList (space-separated ranges) which is used by several OpenDocument constructs. OpenFormula does not use cellRangeAddressList; cell concatenation can do the same thing when this is necessary.

*Rationale:* Supporting whole-rows and whole-columns is useful; Excel 2003 supports this. OpenOffice 2.0 does not directly support it, but it fakes it very well by translating a "whole row" or "whole column" entry into a specific range. As long as spreadsheets only go from A1 to IV65536, there is no difference, and 2.0's approach is probably fine for many people. However, spreadsheet implementations of the future are likely to support larger limits, at which point these spreadsheets will quietly fail. It would be wiser to directly support these selectors so that large spreadsheets of the future would not silently fail.

TODO: clarify whether ODF specification could be changed to allow constructs like A:A or 2:2, which the current cellRangeAddress definition does not.

NamedExpression ::= Source? Identifier
Source ::= "'" IRI "'" "#"

A NamedExpression references another expression, possibly in a completely different spreadsheet or any other document type that can be imported into a spreadsheet. Also other protocols than file/http/ftp may be supported.

TODO: In Excel, NamedExpressions can be sheet-specific or global, but there's no syntax to reference one or the other (a potential problem for external references in particular). Should we do better?

COMMENT: You can't decide that on a formula-attribute basis. Current ODF doesn't provide means to distinguish between global and sheet-local expressions.

Source location's IRI is described in RFC3987, Internationalized Resource Identifiers (IRIs), based on RFC3986, Uniform Resource Identifier (URI): Genral Syntax. Implementations MUST support absolute IRIs (URLs are IRIs too). Implementations SHOULD support relative IRIs, which can be distinguished because they do not begin with [A-Za-z]+ ":". Relative IRIs are formed according to section 6.5 of RFC3987, respectively section 4.2 of RFC3986. Implementations MUST support the file scheme (file:// prefix); they MAY support others, but beware of the security ramifications (an attacker could use this ability to track someone when they download a file to extract data from it, or use this to send malicious data). Implementations SHOULD ask the user before loading data from any external data sources.

TODO: exceptions?


Error ::= '#' [^\. #']+

Specific error values are indicated by an identifier. The error value "NA" MUST be represented as "#NA". It is RECOMMENDED that the following identifier names be used (in upper case) where the particular error value is intended:




Attempt to divide by zero, including division by an empty cell. Often ERROR.TYPE of 2


Unrecognized/deleted name. Often ERROR.TYPE of 5.


Not available. Lookup functions which failed and NA() return this value. Often ERROR.TYPE of 7.


Intersection of ranges produced zero cells. Often ERROR.TYPE of 1.


Failed to meet domain constraints (e.g., input was too large or too small). Often ERROR.TYPE of 6.


Reference to invalid cell. Often ERROR.TYPE of 4.


Parameter is wrong type. Often ERROR.TYPE of 3.

An unknown Error value MUST be mapped into an error value supported by the implementation. It is desirable to preserve the original specific error value when writing back out, but it is not required.

*Rationale*: Since spreadsheet implementations may have additional or more specific types of errors, the notion of error has been generalized. Many implementations share these error values, however, so these representations are recommended so sharing is possible where appropriate.

*Note:* Simple lexers, given an initial '#', will not be able to disambiguate between an error name and an unquoted Sheetname preceded by a source, because simple lexers often don't have grammar state information. The syntax has been carefully defined so that they are the same, and thus they do not need to be disambiguated. This means that these two cases can be handled by the identical token in a lexer, resulting in cleaner designs and faster read processing. A higher-level parser can easily distinguish between the cases.


Whitespace (space, tab, newline, and carriage return) is ignored in the default formulas syntax, except in the contents of string constants. A line break is normally represented by a single ASCII character 10 (\n); when embedded in an XML document this is typically represented as " ". Implementations SHOULD retain whitespace entered by the original formula creator, and SHOULD NOT add additional whitespace unless directed to do so during the process of editing a formula.

*Note:* In Microsoft Excel's display format, the space is used as the intersection operator. This is very confusing, so the OpenFormula exchange format uses "!" instead for intersection.


The operator have the following precedence, from lowest to highest:





<, =, >, <=, >=, <>, !=, ==

Less than, equal to, greater than, less than or equal to, greater than or equal to, not equal to, equal to



Binary operation string concatenation. Note that unary (prefix) + and - has a different priority. Note that "&" must be escaped when included in an XML document.



Binary operations add, subtract. Note that unary (prefix) + and "#147; have a different priority.



Multiply, divide. Division does not truncate, so 1/2 is equal to 0.5.



Power (2 ^ 3 is 8).



Postfix unary operator % (divide by 100). Note that this is legal with expressions (e.g., [.B1]%) and it can be duplicated (1), and it does not change the meaning of other operations such as "+".



Prefix unary operators, e.g., -5 or -[.A1]. Note that these have a difference precedence than add and subtract.



Cell address intersection ([.A1:.C4]![.B1:.B5] is [.B1:.B4]).



Reference union.




TODO: Check that precedence of ":" (when not inside [...]) is appropriate.

Precedence can be overridden by using parentheses, so "2+3*4" computes 14 while "(2+3)*4" computes 20. Implementations' user interfaces may display these operators differently or with a different precedence, but when exchanging formulas they MUST use the precedence rules here. Implementations SHOULD retain and regenerate "unnecessary" parentheses and white space, since these are added by people to improve readability.

*Note:* The precedence of unary prefix "-" compared to "^" has initiated more than a little research. In this specification prefix "-" has higher precedence than "", because "-22" is "4" in Microsoft Excel,, Gnumeric, and probably many others. This is not universal; prefix "-" has a lower precedence on Lotus 1-2-3, Quattro Pro, and Excel's own Visual Basic (Walkenbach, 2004, pg. 579), so these products will need to insert and remove parentheses when reading/writing expressions in OpenFormula where this matters. However, the vast majority of spreadsheet implementations used today treat "-" as higher precedence, so as usual we identified and documented common conventions. Using a different precedence would create a high likelihood of error and additional work in implementations, for no good reason. Ideally OpenFormula could also be used for attribute draw:formula as defined in OpenDocument 1.0 section 9.5.5 and attribute anim:formula as defined in 13.3.2; neither of these attributes includes an exponentiation operator "^", so this issue of precedence causes no problem. (Note that they require "," as the function separator, and have other requirements, as discussed later in this document).

*Rationale:* Prefix "-" and "+" are right-associative, not non-associative, because "=--[.B3]" is legal (and it converts B3 to a number, so it can have an effect). Cell address intersection has an even higher precedence, so that a unary minus in front an intersection will work correctly.

*Rationale:* Walkenbach gives Microsoft Excel 2003's precedence levels as (lowest to highest, note that the book gives the reverse order) comparison (such as "="), "&", "+" and "-", "*" and "/", "^", "%", and unary "-" (negation). (Walkenbach, 2004, pg 38). By having the precedence be the same as Excel, it simplifies transition.

*Note:* this format is intentionally similar to traditional presentations of spreadsheet formulas, which reduces the likelihood of error or misunderstanding. Although the user interface is not specified here, the representation is intentionally chosen so that formulas can "round trip" to this format and back without loss using typical formula representations. It would be possible to replace some or all uses of "%" with "/100", and to replace "^" with POWER(), but this would cause the display format to change once it was saved and reloaded.


Below is miscellaneous text that will eventually be thrown away once it is dealt with.

Future versions could handle OpenDocument's anim:formula and draw:formula. They will need to add variable syntax such as $0, $a, and $. In addition, the issue of using "," instead of ";" as a function parameter separator will need to be resolved to accept those formulas. Future versions may also accept constant array specifiers, e.g., "{" list "}" - and in that list, there must be a way to separate different values inside a row as well as a way to separato different rows (Excel uses "," to separate entries within a row, and ";" to separate entries in different rows). 2.0 always generates absolute Sheetnames if it's not the current sheet (i.e., '$' always prefixes a Sheetname). However, the OpenDocument 1.0 specification 8.3.1 clearly makes the "$" optional in front of a sheetname.

*Discussion:* ??? Named values are actually named expressions in Excel, and can have references ("the cell row-2 and column+1 from here") and be reused in many different cells. Thus, there appears to be a real need to support an R[]C[]-style reference format, possibly at level 3. Probably should also support an argument format, so that they can be used as user-defined formulas, using the sytax compatible with the other formula-like constructs in OpenDocument (e.g., $1). However, once you start changing the fundamentals of how cell references are made, should we abandon the OOo storage format? After all, the current syntax already doesn't well-support arbitrary range operations. The Gnumeric folks argue for using their syntax instead; details of this proposal have not yet been seen. See mailing list.

*TBD:* Need to specify how to save relative sheet names, and how to handle ' embedded in sheet names.

*TBD:* Need to specify how to save "sheet deleted" ("#REF") sheet references. [edit </index.php?title=Expression_Syntax&action=edit&section=8>]

An implementation MUST write a cell range address ":" separator, and MUST NOT write the cell range extension operator, where it is possible to do so. That is, an implementation MUST write the expression "[.A1:.A3]" instead of "[.A1]:[.A3]" even though they are semantically identical. Where there is a choice of which cells to join inside a cell range address, an implmentation MUST choose the leftmost one.

*Rationale:* This specification does not mandate a cell concatenation operator at level 2, since 1.1.3 and 2.0 do not include one. However, Microsoft Excel includes this operation. In Excel this is represented using the comma (",") character, the same symbol used as the parameter separator for function calls. This is a very poor choice with a number of unfortunate ramifications. One problem is that concatentating cells in a function parameter requires surrounding the cells with additional parentheses in Excel display syntax. For example, AREAS(A1:A3,B2:B4) is a function call with two parameters, while AREAS((A1:A3,B2:B4)) is a function call with one parameter. Another problem is that the comma interferes with the use of "," as a decimal separator (as it is used in many locales) when using traditional entry formats (which do not mark cell addresses with ".."). Gnumeric uses "+" as the cell concatenation operator in its display, but this has its own problems: it interferes with the use of "+" as a matrix addition operator. There are many alternatives, e.g., other characters (such as "~", "|", and "\"), or requiring a function syntax for this purpose. The character "_" would be a poor choice because formula variables can also include this character in their name (complicating parsing when "[..]" are not used "#147; is B3_B2 a formula variable, or are B2 and B3 concatenated?). Because of these issues, this document proposes using "~" as the cell concatenation symbol.

A special range-extension operator is needed to deal with cases such as more than one range extension, where one or more sides is a name, or where one or more sides is calculated as the result of an expression. But where possible it shouldn't be used, for maximum portability, since level 2 systems don't necessarily implement it. The leftmost rule is noted so that implementations are likely to generate the same text for the same formula, eliminating the appearance of change when none occurs.

Issue: Subtables. "Level 3 implmentations MUST support addressing subtables (tables in tables), using this syntax:

cellRangeAddress ::= cellRangeAddress.level2 |
"[A-Z]+" ":" "[A-Z]+" | /* whole column(s) */
"[0-9]+" ":" "[0-9]+" /* whole row(s) */
cellAddress ::= "$"? sheet_name ( "." in_sheet_address )+

*Rationale:* OpenDocument 1.0 section 8.3.1 discusses subtables, and requires this syntax when addressing subtables. However, it doesn't require that subtables actually be supported in its syntax for cellAddress. [edit </index.php?title=Expression_Syntax&action=edit&section=14>]

Extension: R1C1 Notation

Level 3 implementations MUST support R1C1 notation.

cellRangeAddress ::= cellRangeAddress.level2 |
"[A-Z]+" ":" "[A-Z]+" | /* whole column(s) */
"[0-9]+" ":" "[0-9]+" /* whole row(s) */
cellAddress ::= "$"? sheet_name ( "." in_sheet_address )+
in_sheet_address ::= ( "$"? [A-Z]+ "$"? [0-9]+ ) | ( [-+$] [0-9]+ [-+$] [0-9]+ )

The symbols + and - indicate a relative reference; "$" indicates an absolute reference. The initial "r" MUST be in lowercase letters.

*Rationale:* R1C1 notation tends to produce much better compression results, it supports large spreadsheets (larger than IV65536) more easily, and better supports textual XML diffs. In earlier versions this had a lowercase 'r' and 'c', but this was unnecessary.

*TBD:* It's not clear this particular syntax is a good idea; discussion welcome.

*Rationale:* 2.0 supports source locations for Cell addresses like A1, but not for named values, so this was moved to level 3. [edit </index.php?title=Expression_Syntax&action=edit&section=16>]

Extension: In-line arrays

The syntax above does not include a specification for arrays specified in-line. Microsoft Excel includes a way to write arrays of constants without having to create a cell range for them. 1.1.3 does not support this. Excel's display format uses a comma as the separator between values in a row, and a semicolon for the separator between rows. Here semicolon is used as the separator between values in a row (again, so different locales will have a simpler time entering data when entering data), and the pipe symbol "|" is proposed as the symbol separating rows (with absolutely no precedent).

array_spec ::= "{" matrix ( '_' matrix )* "}"
matrix ::= ( matrix_row ( column_separator matrix_row )* ) ?
matrix_row ::= expression ( ";" expression )*
column_separator ::= "|"

*TBD:* This is a proposed syntax, which in addition permits non-constant values in the array (something Excel does not permit) and concatenation (which Excel may not permits either).

Gnumeric is adding the ability to handle matrixes and NOT forcing them into a fixed cell array; these are called dynamic matrices. Does this require syntactic extensions? Need to discuss in types and functions.

Need to re-investigate in-line arrays and other array function syntax issues.

Should we add arguments (e.g., "$1") so that anim:formula, etc. are easier to support AND so that named expressions are much more effective (they'd then allow the easy definition of new functions)?

Expression_Syntax (last edited 2009-08-12 18:04:23 by localhost)