* [RFC PATCH olang v1] docs: create zero programming language specification @ 2024-03-09 0:05 Johnny Richard 2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht ` (4 more replies) 0 siblings, 5 replies; 11+ messages in thread From: Johnny Richard @ 2024-03-09 0:05 UTC (permalink / raw) To: ~johnnyrichard/olang-devel; +Cc: Johnny Richard This document specifies the semantics and behavior of the Zero Programming Language for compiler programmers be informed how the language is designed. This document will help newcomers to understand how the language looks like and as a DRAFT guide on the language design discussions. The grammar was made by using a EBNF evaluator tool[1]. [1]: https://mdkrajnak.github.io/ebnftest/ Signed-off-by: Johnny Richard <johnny@johnnyrichard.com> --- This grammar adds the token SEMICOLON (';') for every statement. I know we agreed make it optional, but the SEMICOLON makes the parser much more convenient to implement. And this is the first topic I would like to discuss. Let me know if you agree otherwise I can adapt the grammar to make SEMICOLON optional. docs/pages/language_specification.md | 41 ++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 docs/pages/language_specification.md diff --git a/docs/pages/language_specification.md b/docs/pages/language_specification.md new file mode 100644 index 0000000..9d27eda --- /dev/null +++ b/docs/pages/language_specification.md @@ -0,0 +1,41 @@ +zero programming language specification +======================================= + +ABSTRACT +-------- + +This document specifies the semantics and behavior of the Zero Programming +Language for compiler programmers be informed how the language is designed. + +This specification is on DRAFT and will evolve through discussions on olang-dev +mailing list. + +Language Syntax +--------------- + +This is the Zero Programming Language EBNF grammar specification + +NOTE: This grammar spec is a DRAFT and it covers only a small portion of the +language. + +``` +<program> ::= <function-definition> +<function-definition> ::= <fn_keyword> <space>+ <identifier> <space>* <f-args> <space>* <colon> <space>* <type> <space>* <block> +<identifier> ::= <alpha>+ + | <alpha>+ <number>* + ; +<f-args> ::= '(' <space>* ')' +<block> ::= <ocurly> <space>* <statement>* <space>* <ccurly> +<statement> ::= <return-statement> +<return-statement> ::= <return_keyword> <space>* <number>* <space>* <semicolon> +<semicolon> ::= ';' +<ocurly> ::= '{' +<ccurly> ::= '}' +<type> ::= 'u32 ' +<colon> ::= ':' +<alpha> ::= #'[a-zA-Z_]' +<number> ::= #'[0-9]' +<fn_keyword> ::= 'fn' +<return_keyword> ::= 'return' +<space> ::= #'[ \t\r\n]' +``` -- 2.44.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [olang/patches/.build.yml] build success 2024-03-09 0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard @ 2024-03-08 23:09 ` builds.sr.ht 2024-03-14 4:29 ` Ricardo Kagawa 2024-03-09 0:36 ` [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard ` (3 subsequent siblings) 4 siblings, 1 reply; 11+ messages in thread From: builds.sr.ht @ 2024-03-08 23:09 UTC (permalink / raw) To: Johnny Richard; +Cc: ~johnnyrichard/olang-devel olang/patches/.build.yml: SUCCESS in 33s [docs: create zero programming language specification][0] from [Johnny Richard][1] [0]: https://lists.sr.ht/~johnnyrichard/olang-devel/patches/50092 [1]: mailto:johnny@johnnyrichard.com ✓ #1164786 SUCCESS olang/patches/.build.yml https://builds.sr.ht/~johnnyrichard/job/1164786 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [olang/patches/.build.yml] build success 2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht @ 2024-03-14 4:29 ` Ricardo Kagawa 2024-03-14 22:43 ` Johnny Richard 0 siblings, 1 reply; 11+ messages in thread From: Ricardo Kagawa @ 2024-03-14 4:29 UTC (permalink / raw) To: ~johnnyrichard/olang-devel; +Cc: builds.sr.ht >> This grammar adds the token SEMICOLON (';') for every statement. I know we >> agreed make it optional, but the SEMICOLON makes the parser much more >> convenient to implement. >> >> And this is the first topic I would like to discuss. Let me know if you >> agree otherwise I can adapt the grammar to make SEMICOLON optional. > > (...) Therefore, I'm curious about your statement that using a > semicolon makes the parser much more convenient to implement. Could you > elaborate on this? Have you encountered any new considerations that might > complicate the implementation? My limited understanding is that the semicolon would indeed be more convenient, as it would be a definitive end-of-statement symbol, requiring no lookahead to resolve as such. The LF token could be ambiguous on its own (between end-of-statement and white space), so some lookahead would be required to resolve it. But it should be alright, as long as the language remains context-free. Even if it becomes ambiguous, non-deterministic, or requires a long lookahead. Ideally it should be determinitstic for linear time performance, but it seems there are parsers that can run close to it in the average case, as long as the language remains close to deterministic. And I don't have a strong opinion on the semicolon issue, except that it must be an option. But whatever we do, we must avoid the following pitfall from JavaScript: ```javascript example ;(x) ``` The semicolon is mandatory here, because otherwise `(x)` is handled as an argument list, and `example` would be called as a function. That is, it would be a multi-line statement, instead of two separate statements. And why anyone would do this? ```javascript const x = y.example ;(() => { console.log(x) })() ``` Immediately invoked function expressions are a thing in JavaScript, and it would not be uncommon to have some expression ending with an identifier right before them. >> The grammar was made by using a EBNF evaluator tool[1]. >> >> [1]: https://mdkrajnak.github.io/ebnftest/ > > I would add this link at the markdown, so then people can play with it. I would make an even stronger argument for including the link in the docs. A good language specification also specifies which language specification grammar is used for the specification itself. And the EBNF in particular is not properly standardized, so you really need to specify which EBNF variant you are using. The link should thus be good enough to refer to the EBNF implementation used in this specification, although a permanent (version locked) link would be better. ---- As for my revision of the grammar: - Separated rules into sections. - Added optional white space around the program. - You don't actually need non-terminal symbols for keywords. Especially if you are including the keyword in the symbol name. - You don't need non-terminal symbols for symbols either, unless you have a more "semantic" name for it. There should not be another "semicolon" besides `;`, for example. - In Johnny's version the function name is a single identifier. I don't know why Carlos's version made it multiple. I have made it single again. - In Johnny's version the space before the return type is optional. I don't know why Carlos's version made it mandatory. I have made it optional again. - Replaced `<identifier>` in `<function-definition>` with `<function-name>` to express that this identifier is the name of the declared function. Then, `<function-name>` is just `<identifier>`. - Renamed `<fn-args>` to `<function-parameters>`, since parameters are the variables in a function declaration, while arguments are the values bound to those variables during function calls. - Replaced `<type>` for `<return-type>` in `<function-declaration>` to express that this type identifier is the return type of the function. Then, `<return-type>` is just `<type>`. - Replaced `<block>` in `<function-definition>` for `<function-body>` to express that this block is the body of the declared function. - Reworked `<block>`, `<statement>` and `<end-of-statement>` to allow for: - Single statement followd by optional end-of-statement; - Statement list with mandatory end-of-statement between statements; - But the statements could be made optional, yet I did not in this version, as there is no `void` return type, currently. - Replaced `<number>` in `<return-statement>` with `<expression>` to prepare for them in the future. The only allowed expression is still an integer literal, though. - Renamed `<number>` to `<integer>`, and reworked it to actually represent decimal integer literals. Sequences of zero digits are now forbidden at the left side, but a lone zero digit is still allowed. - Reworked `<identifier>` to better express that it starts with `<alpha>` or underline, followed by zero or more `<alpha>`, `<digit>` or underline. - Removed `_` from `<alpha>` to better reflect the name (as underline is not an alphabetic character). - Renamed `<space>` for `<ws>` to avoid ambiguity with the character U+0020 Space, and made it a one-or-more list. Also introduced `<ows>` for "optional white space". Shorter names were preferred here due to these symbols in particular being used very frequently. - Also introduced `<line-break>` as either LF, CR or CRLF. Otherwise the CRLF sequence would be parsed as two separate line breaks. Not that it would matter that much, except maybe for mapping line numbers. ``` (* Entry Point *) <program> ::= <ows> <function-definition> <ows> (* Functions *) <function-definition> ::= 'fn' <ws> <function-name> <ows> <function-parameters> <ows> ':' <ows> <return-type> <ows> <function-body> <function-name> ::= <identifier> <function-parameters> ::= '(' <ows> ')' <return-type> ::= <type> <function-body> ::= <block> (* Statements *) <block> ::= '{' <ows> <statement> <ows> (<end-of-statement> <ows> <statement> <ows>)* <end-of-statement>? <ows> '}' <end-of-statement> ::= ';' | <line-break> <statement> ::= <return-statement> <return-statement> ::= 'return' <ws> <expression> (* Expressions *) <expression> ::= <integer> (* Identifiers *) <type> ::= 'u32' <identifier> ::= (<alpha> | '_') (<alpha> | <digit> | '_')* (* Literals *) <integer> ::= <integer-base10> <integer-base10> ::= #'[1-9]' <digit>* | '0' (* Utilities *) <ws> ::= <white-space>+ <ows> ::= <white-space>* <white-space> ::= <linear-space> | <line-break> <line-break> ::= '\n' | '\r' | '\r\n' <linear-space> ::= #'[ \t]' <alpha> ::= #'[a-zA-Z]' <digit> ::= #'[0-9]' ``` Further discussion: - Is the language going to support Unicode? If so, `<alpha>` could use the _L:Letter_ Unicode category instead of being limited to `[a-zA-Z]`. But the EBNF tool does not support Unicode categories in its regular expressions (it does not support flags). Also don't forget to rename it to `<letter>` in that case. - It would help developers in non-English speaking countries, but it could be difficult to work with multi-byte characters and Unicode normalization. - There are more linear space and line break characters than the ones included here, even within ASCII, although they are not all that important. Even more in Unicode (some under _Cc:Other/control_, others under _Z:Separator_). Should we support them? - The function definition could accept a single expression as an alternative to its `<block>`, similar to Kotlin. - The integer literal could include optional underline separators for readability. Just need to be careful not to start with underline, to avoid ambiguity with identifiers. - I guess we don't have to support the full set of Unicode digits, since we don't know if these digits would even be decimal in the first place. The numbering system could be very different from our own, so it is likely not feasible to support them. - I have not checked if this syntax would avoid that edge case with JavaScript I mentioned in the beginning. I might check that next time (I'm still not sure of how). - It might seem strange that I included semantic non-terminals here, despite having removed non-terminals for symbols and keywords. I can't say for sure, since this is my first time trying this style, but I suspect that besides making the language specification easier to understand, the important bits to hook into in the parser will be around these symbols. That is, it could simplify some work on the parser. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [olang/patches/.build.yml] build success 2024-03-14 4:29 ` Ricardo Kagawa @ 2024-03-14 22:43 ` Johnny Richard 0 siblings, 0 replies; 11+ messages in thread From: Johnny Richard @ 2024-03-14 22:43 UTC (permalink / raw) To: Ricardo Kagawa; +Cc: ~johnnyrichard/olang-devel Thank you very much for you contribution, I love it <3. nitpick for next replies: 1) You've replied to the CI build reply. Next time try to reply to the right thread. 2) Your message has few weird line breaks. I don't know why it's happening (noticed you are using Thunderbird). However, you can make sure your setup is correctly configured to send plain text emails by visiting this website https://useplaintext.email/ On Thu, Mar 14, 2024 at 01:29:09AM -0300, Ricardo Kagawa wrote: > >> This grammar adds the token SEMICOLON (';') for every statement. I know > we > >> agreed make it optional, but the SEMICOLON makes the parser much more > >> convenient to implement. > >> > >> And this is the first topic I would like to discuss. Let me know if you > >> agree otherwise I can adapt the grammar to make SEMICOLON optional. > > > > (...) Therefore, I'm curious about your statement that using a > > semicolon makes the parser much more convenient to implement. Could you > > elaborate on this? Have you encountered any new considerations that might > > complicate the implementation? > > My limited understanding is that the semicolon would indeed be more > convenient, as it would be a definitive end-of-statement symbol, > requiring no lookahead to resolve as such. The LF token could be > ambiguous on its own (between end-of-statement and white space), so > some lookahead would be required to resolve it. You are right about it. I had to implement the lookahead capability in order to skip LF tokens. > But it should be alright, as long as the language remains context-free. > Even if it becomes ambiguous, non-deterministic, or requires a long > lookahead. Ideally it should be determinitstic for linear time > performance, but it seems there are parsers that can run close to it in > the average case, as long as the language remains close to > deterministic. > > And I don't have a strong opinion on the semicolon issue, except that > it must be an option. But whatever we do, we must avoid the following > pitfall from JavaScript: > > ```javascript > example > ;(x) > ``` > > The semicolon is mandatory here, because otherwise `(x)` is handled as > an argument list, and `example` would be called as a function. That is, > it would be a multi-line statement, instead of two separate statements. > > And why anyone would do this? > > ```javascript > const x = y.example > ;(() => { > console.log(x) > })() > ``` I strong agree on avoid those odd JavaScript design. I think we can continue with optional SEMICOLON. I also think it makes a better programmer experience. > >> The grammar was made by using a EBNF evaluator tool[1]. > >> > >> [1]: https://mdkrajnak.github.io/ebnftest/ > > > > I would add this link at the markdown, so then people can play with it. > > I would make an even stronger argument for including the link in the > docs. A good language specification also specifies which language > specification grammar is used for the specification itself. And the > EBNF in particular is not properly standardized, so you really need to > specify which EBNF variant you are using. > > The link should thus be good enough to refer to the EBNF implementation > used in this specification, although a permanent (version locked) link > would be better. Sure, I can add it to the document. I'm not sure how you want to version lock this variant. Should I add a specific github/git tag version to the document? > As for my revision of the grammar: I liked all comments and definitely it seems to be better version. In my option we can start with you changes and keep this document alive for future discussion. Not sure about Carlos. Let see his thoughts on that as well. > Further discussion: > > - Is the language going to support Unicode? If so, `<alpha>` could use > the _L:Letter_ Unicode category instead of being limited to > `[a-zA-Z]`. But the EBNF tool does not support Unicode categories in > its regular expressions (it does not support flags). Also don't > forget to rename it to `<letter>` in that case. > > - It would help developers in non-English speaking countries, but it > could be difficult to work with multi-byte characters and Unicode > normalization. I lack knowledge to answer this question right know. I would say to keep it simple as much as we can on this earlier stage (ASCII only) unless you have a big concern. > - There are more linear space and line break characters than the ones > included here, even within ASCII, although they are not all that > important. Even more in Unicode (some under _Cc:Other/control_, > others under _Z:Separator_). Should we support them? Let's add the remaining ASCII ones meanwhile. > - The function definition could accept a single expression as an > alternative to its `<block>`, similar to Kotlin. Scala also has this capability. But I think it doesn't fit well in our current function declaration: fn f(): u32 <statement> ^ If we don't add a token in here like **=** it will be very weird. No strong options here to be honest. > - The integer literal could include optional underline separators for > readability. Just need to be careful not to start with underline, to > avoid ambiguity with identifiers. I like that. We can have it as well. > - I guess we don't have to support the full set of Unicode digits, since > we don't know if these digits would even be decimal in the first > place. The numbering system could be very different from our own, so > it is likely not feasible to support them. Perhaps we could postpone the Unicode support? > - I have not checked if this syntax would avoid that edge case with > JavaScript I mentioned in the beginning. I might check that next > time (I'm still not sure of how). Maybe we are going to discovery it on the implementation process. > - It might seem strange that I included semantic non-terminals here, > despite having removed non-terminals for symbols and keywords. I can't > say for sure, since this is my first time trying this style, but I > suspect that besides making the language specification easier to > understand, the important bits to hook into in the parser will be > around these symbols. That is, it could simplify some work on the > parser. ACK ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH olang v1] docs: create zero programming language specification 2024-03-09 0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard 2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht @ 2024-03-09 0:36 ` Johnny Richard 2024-03-09 5:09 ` Carlos Maniero ` (2 subsequent siblings) 4 siblings, 0 replies; 11+ messages in thread From: Johnny Richard @ 2024-03-09 0:36 UTC (permalink / raw) To: ~johnnyrichard/olang-devel I messed with the grammar, here I have few adjusts. On Sat, Mar 09, 2024 at 01:05:39AM +0100, Johnny Richard wrote: > diff --git a/docs/pages/language_specification.md b/docs/pages/language_specification.md > new file mode 100644 > index 0000000..9d27eda > --- /dev/null > +++ b/docs/pages/language_specification.md > @@ -0,0 +1,41 @@ > +zero programming language specification > +======================================= > + > +ABSTRACT > +-------- > + > +This document specifies the semantics and behavior of the Zero Programming > +Language for compiler programmers be informed how the language is designed. s/be informed how/be informed on how/ > +This specification is on DRAFT and will evolve through discussions on olang-dev s/is on DRAFT/is a DRAFT/ > +<identifier> ::= <alpha>+ > + | <alpha>+ <number>* I missed the recusion here, it should be: <identifier> ::= <alpha>+ | <alpha>+ <number>* <identifier> ; > +<type> ::= 'u32 ' s/u32 /u32/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH olang v1] docs: create zero programming language specification 2024-03-09 0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard 2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht 2024-03-09 0:36 ` [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard @ 2024-03-09 5:09 ` Carlos Maniero 2024-03-19 20:21 ` Johnny Richard 2024-03-23 23:31 ` Carlos Maniero 4 siblings, 0 replies; 11+ messages in thread From: Carlos Maniero @ 2024-03-09 5:09 UTC (permalink / raw) To: Johnny Richard, ~johnnyrichard/olang-devel > This grammar adds the token SEMICOLON (';') for every statement. I know we > agreed make it optional, but the SEMICOLON makes the parser much more > convenient to implement. > > And this is the first topic I would like to discuss. Let me know if you > agree otherwise I can adapt the grammar to make SEMICOLON optional. I understand that using line feed (LF) as an end of statement (EOS) marker can simplify the parsing process. However, at the top of my mind I see two main downsides in using LF as EOS: 1. It could increase the number of tokens that the lexer produces, requiring the parser to ignore some of them. 2. It requires an strategy to skip LF for enable multi-line statements. While the second point does add some complexity, I believe the first one only slightly alters the implementation without significantly increasing its complexity. Therefore, I'm curious about your statement that using a semicolon makes the parser much more convenient to implement. Could you elaborate on this? Have you encountered any new considerations that might complicate the implementation? > The grammar was made by using a EBNF evaluator tool[1]. > > [1]: https://mdkrajnak.github.io/ebnftest/ I would add this link at the markdown, so then people can play with it. Furthermore, I was playing with the grammar you send and I did a few adjusts: - I introduced the <end-of-statement> to make SEMICOLON optional; - It seams it has a pattern in where you used DASH or UNDERLINE but IMO it reads better when standardized. - I made a simplification on <identifier>. - I reordered the non-terminal keeping the single chars and strings at the bottom. <program> ::= <function-definition> <function-definition> ::= <fn-keyword> <space>+ <identifier>+ <space>* <fn-args> <space>* <colon> <space>+ <type> <space>* <block> <identifier> ::= <alpha>+ <number>* <identifier>* <fn-args> ::= '(' <space>* ')' <block> ::= <ocurly> <space>* <statement> <space>* <ccurly> <statement> ::= <return-statement> <end-of-statement>+ <return-statement> ::= <return-keyword> <space>* <number>+ <space>* <end-of-statement> ::= <semicolon>* <space>* '\n' <fn-keyword> ::= 'fn' <return-keyword> ::= 'return' <type> ::= 'u32' <alpha> ::= #'[a-zA-Z_]' <number> ::= #'[0-9]' <space> ::= #'[ \t\r\n]' <semicolon> ::= ';' <ocurly> ::= '{' <ccurly> ::= '}' <colon> ::= ':' Other than that I would like to say thank you to always been looking forward and helping us to improve our process. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH olang v1] docs: create zero programming language specification 2024-03-09 0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard ` (2 preceding siblings ...) 2024-03-09 5:09 ` Carlos Maniero @ 2024-03-19 20:21 ` Johnny Richard 2024-03-23 23:31 ` Carlos Maniero 4 siblings, 0 replies; 11+ messages in thread From: Johnny Richard @ 2024-03-19 20:21 UTC (permalink / raw) To: ~johnnyrichard/olang-devel A patch v2 has been sent, this patch is SUPERSEDED. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH olang v1] docs: create zero programming language specification 2024-03-09 0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard ` (3 preceding siblings ...) 2024-03-19 20:21 ` Johnny Richard @ 2024-03-23 23:31 ` Carlos Maniero 4 siblings, 0 replies; 11+ messages in thread From: Carlos Maniero @ 2024-03-23 23:31 UTC (permalink / raw) To: Johnny Richard, ~johnnyrichard/olang-devel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH olang v1] docs: create zero programming language specification @ 2024-03-15 20:54 Ricardo Kagawa 2024-03-17 15:41 ` Carlos Maniero 2024-03-18 9:58 ` Johnny Richard 0 siblings, 2 replies; 11+ messages in thread From: Ricardo Kagawa @ 2024-03-15 20:54 UTC (permalink / raw) To: ~johnnyrichard/olang-devel > You've replied to the CI build reply. Next time try to reply to the > right thread. I just opened the "Reply to thread" link from sourcehut's web interface. It automatically filled the TO, CC, Subject and some thread ID header, which values I just trusted. It seems the UI is not to be trusted, but then, which values should I use? The address in TO seems to have bounced my reply, so I thought it didn't even make it to sourcehut. > Your message has few weird line breaks. I am not myself sure, but I suspect it is an issue with the file format generated by `vim`. My default `email` file type seems to be forcing the `dos` format (CRLF line breaks), which might be being interpreted as two separate line breaks somewhere between Thunderird, Gmail and sourcehut. I'll see if I can force it to use the `unix` format (LF only) and if that fixes things at all. > I'm not sure how you want to version lock this variant. Should I add > a specific github/git tag version to the document? Yes, sort of. The web tool itself cannot be version locked, since it simply does not have that option, but it does link to its GitHub project. The project does not itself contain the description of its EBNF syntax, but it does have a link to what it uses to implement its EBNF parser, which in turn describes its syntax. You could include a version locked link to [that][1]. [1]: https://github.com/Engelberg/instaparse/tree/v1.4.12 > > Is the language going to support Unicode? > > I would say to keep it simple as much as we can on this earlier stage > (ASCII only) unless you have a big concern. I guess that would be OK. I don't think it would be too difficult to migrate later. Maybe tricky, but not difficult, since Unicode is a superset of ASCII. Just need to be careful not to depend too much on the fact that ASCII characters are stored in 8-bit variables, as Unicode uses variable-length characters (variable within a string, but characters are multiples of 8 bits). > If we don't add a token in here like **=** it will be very weird. Actually, I mentioned Kotlin also to imply that there would be an equals sign before the expression. > > - I have not checked if this syntax would avoid that edge case with > > JavaScript I mentioned in the beginning. I might check that next > > time (I'm still not sure of how). > > Maybe we are going to discovery it on the implementation process. I _suspect_ it would be enough to give precedence to interpreting line breaks as end-of-statement, and if so, there might be a way to represent that precedence in the EBNF grammar (by convention). I would still need to mull over it for a while to be sure. Another revision: - Function body now accepts a single expression. - This introduced the `<end-of-file>` token, which is not an actual sequence of characters. It allows the function body expression at the end of the program without a following line break or semicolon. Earlier declarations must include a line break or semicolon. - `\v` (vertical tab) and `\f` (form feed) included as line breaks for completeness over ASCII (based on `\s` regex class, which agrees with Unicode properties over the ASCII range). - Integer literals can now include underlines as separators. - The literal is allowed to terminate with an arbitrarily long sequence of separators, though. - It would be possible to restrict the last character to be a digit, but maybe it is not worth the trouble? - Introducing hexadecimal integer literals. - Hexadecimal literals are allowed to have an arbitrarily long sequence of zeroes to the left, after the `0x` prefix. This is intentional, and the parser should ignore excess zeroes. - The alphabetic characters in the literal should be parsed without case-sensitivity. There is no difference between the `0x` and `0X` prefixes in literals. If mixed case is not desirable, let the linter or formattter restrict that. ``` (* Entry Point *) <program> ::= <ows> <function-definition> <ows> <end-of-file> (* Functions *) <function-definition> ::= 'fn' <ws> <function-name> <ows> <function-parameters> <ows> ':' <ows> <return-type> <ows> <function-body> <function-name> ::= <identifier> <function-parameters> ::= '(' <ows> ')' <return-type> ::= <type> <function-body> ::= <block> | '=' <ows> <expression> <ows> (<end-of-statement> | <end-of-file>) (* Statements *) <block> ::= '{' <ows> <statement> <ows> (<end-of-statement> <ows> <statement> <ows>)* <end-of-statement>? <ows> '}' <end-of-statement> ::= ';' | <line-break> <statement> ::= <return-statement> <return-statement> ::= 'return' <ws> <expression> (* Expressions *) <expression> ::= <integer> (* Identifiers *) <type> ::= 'u32' <identifier> ::= (<alpha> | '_') (<alpha> | <digit> | '_')* (* Literals *) <integer> ::= <integer-base10> | <integer-base16> <integer-base10> ::= #'[1-9]' (<digit> | '_')* | '0' <integer-base16> ::= #'0[Xx]' <hex-digit> (<hex-digit> | '_')* (* Utilities *) <ws> ::= <white-space>+ <ows> ::= <white-space>* <white-space> ::= <linear-space> | <line-break> <line-break> ::= #'[\n\v\f\r]' | '\r\n' <linear-space> ::= #'[ \t]' <alpha> ::= #'[a-zA-Z]' <digit> ::= #'[0-9]' <hex-digit> ::= <digit> | #'[a-fA-F]' <end-of-file> ::= #'$' ``` ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH olang v1] docs: create zero programming language specification 2024-03-15 20:54 Ricardo Kagawa @ 2024-03-17 15:41 ` Carlos Maniero 2024-03-18 9:58 ` Johnny Richard 1 sibling, 0 replies; 11+ messages in thread From: Carlos Maniero @ 2024-03-17 15:41 UTC (permalink / raw) To: Ricardo Kagawa, ~johnnyrichard/olang-devel I'd like to begin by echoing Johnny's words of thanks. It's truly fantastic to see more people getting involved in making olang a remarkable language. It's also quite refreshing to have someone on board who is well-versed in the theory behind creating a programming language. I'm certain that I, along with others, will learn a great deal from you. I'll start with a question from your first reply: > My limited understanding is that the semicolon would indeed be more > convenient, as it would be a definitive end-of-statement symbol, > requiring no lookahead to resolve as such. The LF token could be > ambiguous on its own (between end-of-statement and white space), so > some lookahead would be required to resolve it. I had a hard time trying to understand why the "LF token could be ambiguous on its own", but now I got it. Before, I was just thinking about the function body where, to me, a blank line in the function's body is just an empty statement, same as a sequence of semicolons in C. But I wasn't considering that there are a lot of other places where the programmer could add a blank space, like in function declarations: fn main() : u32 { return 0 } Even though the code above is ugly AF, it's still syntactically correct. And that made me realize why it's more convenient to have an end-of-statement token. I'm just putting this out here in case someone else has the same doubt. Although, it seems like we are all on the same page about wanting to make the language more user-friendly, even if it means giving the parser a bit of a hard time. About the "some lookahead would be required to resolve it" we definitely added some of these in the parser, but given a better look, we could easily get rid of then by replacing the function: static void skip_line_feeds(lexer_t *lexer) { token_t token; lexer_peek_next(lexer, &token); while (token.kind == TOKEN_LF) { lexer_next_token(lexer, &token); lexer_peek_next(lexer, &token); } } With: static void next_non_lf_token(lexer_t *lexer, token_t *token) { do { lexer_next_token(lexer, token); } while (token->kind == TOKEN_LF); } I'm sure it may be some corner cases where it cannot be applied, but I think that it reduces backtrack for most of the cases. > - Function body now accepts a single expression. > ... > - `\v` (vertical tab) and `\f` (form feed) included as line breaks for > completeness over ASCII (based on `\s` regex class, which agrees with > Unicode properties over the ASCII range). > - Integer literals can now include underlines as separators. > ... > - Introducing hexadecimal integer literals. > ... >From your last email, I'm totally up for discussing these topics. But I reckon we might want to split them into new threads to avoid trying to hash out the entire language in one go. What do you all think about us trying to: 1. Nail down the current state of the language, leaving new features for later. 2. Figure out how we're gonna document new features. 3. Kick off a new thread for each feature we're thinking of adding. Sound good? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH olang v1] docs: create zero programming language specification 2024-03-15 20:54 Ricardo Kagawa 2024-03-17 15:41 ` Carlos Maniero @ 2024-03-18 9:58 ` Johnny Richard 1 sibling, 0 replies; 11+ messages in thread From: Johnny Richard @ 2024-03-18 9:58 UTC (permalink / raw) To: Ricardo Kagawa; +Cc: ~johnnyrichard/olang-devel Amazing job. Wonderful changes and thank you so much for you contribution. On Fri, Mar 15, 2024 at 05:54:10PM -0300, Ricardo Kagawa wrote: > > You've replied to the CI build reply. Next time try to reply to the > > right thread. > > I just opened the "Reply to thread" link from sourcehut's web interface. > It automatically filled the TO, CC, Subject and some thread ID header, > which values I just trusted. It seems the UI is not to be trusted, but > then, which values should I use? The address in TO seems to have bounced > my reply, so I thought it didn't even make it to sourcehut. Sad to hear the Sourcehut reply is miss behaving. Could you please report the problem to them? The good practice is always **reply all** (in case someone who is not subscribed to the mailing list can follow up the thread discussion). Since I wrote the email you replied, you should reply it to me and **CC** the mailing list and carry the **CC** from previous Email (reply already behaves like so). You might want to have a copy (optional), in that case you can **BCC** yourself. In order to have set up correctly the thread you want to reply to, you have to set the **In-Reply-To** header with the **Message-ID** of the message you want to reply to. It seems to be very hard to active but your email client should do the work. You can download and import the email thread from the archive right into your email client and **reply all**. There is another mailing list archive[1] which has instruction on how to reply to emails using **git-send-email** that maybe you might be interested. [1]: http://lists.johnnyrichard.com/olang/ For example, you could follow the steps bellow to reply to the right thread using **git-send-email**: Reply instructions: You may reply publicly to this message[2] via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox[3] Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=y2brhif5vjjbjmd26wwxykhlt2qnnm2k7n2lpiz7dr7pccmoc7@qqx67mvyzl2c \ --to=johnny@johnnyrichard.com \ --cc=ricardo.kagawa@gmail.com \ --cc=~johnnyrichard/olang-devel@lists.sr.ht \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link[4] Be sure your reply has a Subject: header at the top and a blank line before the message body. [2]: http://lists.johnnyrichard.com/olang/y2brhif5vjjbjmd26wwxykhlt2qnnm2k7n2lpiz7dr7pccmoc7@qqx67mvyzl2c/#t [3]: http://lists.johnnyrichard.com/olang/y2brhif5vjjbjmd26wwxykhlt2qnnm2k7n2lpiz7dr7pccmoc7@qqx67mvyzl2c/raw [4]: mailto:johnny@johnnyrichard.com?In-Reply-To=%3Cy2brhif5vjjbjmd26wwxykhlt2qnnm2k7n2lpiz7dr7pccmoc7@qqx67mvyzl2c%3E&Cc=ricardo.kagawa%40gmail.com%2C~johnnyrichard%2Folang-devel%40lists.sr.ht&Subject=Re%3A%20%5Bolang%2Fpatches%2F.build.yml%5D%20build%20success > > Your message has few weird line breaks. > > I am not myself sure, but I suspect it is an issue with the file format > generated by `vim`. My default `email` file type seems to be forcing the > `dos` format (CRLF line breaks), which might be being interpreted as two > separate line breaks somewhere between Thunderird, Gmail and sourcehut. > I'll see if I can force it to use the `unix` format (LF only) and if > that fixes things at all. I think you made it, this email looks good. Perhaps you can set the vim as your default editor on Thunderird (I know you can do it but not sure how). Gmail should not mess up with you message. If you like to live on terminal I suggest **mutt** or **aerc** clients which works out of the box with plain text emails and they are very flexible. > > I'm not sure how you want to version lock this variant. Should I add > > a specific github/git tag version to the document? > > Yes, sort of. The web tool itself cannot be version locked, since it > simply does not have that option, but it does link to its GitHub > project. The project does not itself contain the description of its > EBNF syntax, but it does have a link to what it uses to implement its > EBNF parser, which in turn describes its syntax. You could include a > version locked link to [that][1]. > > [1]: https://github.com/Engelberg/instaparse/tree/v1.4.12 Sounds good. > > > Is the language going to support Unicode? > > > > I would say to keep it simple as much as we can on this earlier stage > > (ASCII only) unless you have a big concern. > > I guess that would be OK. I don't think it would be too difficult to > migrate later. Maybe tricky, but not difficult, since Unicode is a > superset of ASCII. Just need to be careful not to depend too much on > the fact that ASCII characters are stored in 8-bit variables, as > Unicode uses variable-length characters (variable within a string, but > characters are multiples of 8 bits). Nice, let us know whenever we make such ASCII dependency. > > If we don't add a token in here like **=** it will be very weird. > > Actually, I mentioned Kotlin also to imply that there would be an > equals sign before the expression. > > > > - I have not checked if this syntax would avoid that edge case with > > > JavaScript I mentioned in the beginning. I might check that next > > > time (I'm still not sure of how). > > > > Maybe we are going to discovery it on the implementation process. > > I _suspect_ it would be enough to give precedence to interpreting line > breaks as end-of-statement, and if so, there might be a way to represent > that precedence in the EBNF grammar (by convention). I would still need > to mull over it for a while to be sure. I can test the precedence solution later as well. > Another revision: > > - Function body now accepts a single expression. Hmm... I think we still haven't agreed on having it. We are trying to make a very simple language and maybe avoid having multiple ways of achieving the stuff will make it simpler. > - Introducing hexadecimal integer literals. > - Hexadecimal literals are allowed to have an arbitrarily long > sequence of zeroes to the left, after the `0x` prefix. This is > intentional, and the parser should ignore excess zeroes. > - The alphabetic characters in the literal should be parsed without > case-sensitivity. There is no difference between the `0x` and `0X` > prefixes in literals. If mixed case is not desirable, let the > linter or formattter restrict that. Nice you added support to hex. We will need octal in the future as well (starting with 0). But let's craft more features in new patches, this thread is become out of control. Perhaps we can merge an initial version and propose small changes to it (making new discussion threads). I can set up a PATCH with most of your changes (except the function definition with single line, we can discuss it on a new PATCH thread). ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-03-23 23:31 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-09 0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard 2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht 2024-03-14 4:29 ` Ricardo Kagawa 2024-03-14 22:43 ` Johnny Richard 2024-03-09 0:36 ` [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard 2024-03-09 5:09 ` Carlos Maniero 2024-03-19 20:21 ` Johnny Richard 2024-03-23 23:31 ` Carlos Maniero 2024-03-15 20:54 Ricardo Kagawa 2024-03-17 15:41 ` Carlos Maniero 2024-03-18 9:58 ` Johnny Richard
Code repositories for project(s) associated with this public inbox https://git.johnnyrichard.com/olang.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox