Skip to content

Commit 4388c69

Browse files
ruflinandrewkroh
authored andcommitted
Change structure of URL (#7)
* Change structure of URL So far the url structure was heavily inspired by whatwg/url#337. I initially only wanted to make some tweaks to it to improve querying but I realised I never fully felt comfortable with the field names used here. So I started to look at the url parser of different languages like Go, Ruby, Python and the output they provide are surprisingly similar but not consistent with whatwg. The change made here brings the field names closer to what most url parsers output. * fix description of query field * switch it to integer * Add dot to host.name to make it consistent
1 parent 396be4e commit 4388c69

File tree

5 files changed

+59
-62
lines changed

5 files changed

+59
-62
lines changed

CHANGELOG.md

+1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ All notable changes to this project will be documented in this file based on the
55
## [Unreleased](https://github.com/elastic/ecs/compare/0.1.0...master)
66

77
### Breaking changes
8+
* Change structure of URL. #7
89

910
### Bugfixes
1011

README.md

+10-13
Original file line numberDiff line numberDiff line change
@@ -347,26 +347,23 @@ A complete URL, with scheme, host, and path.
347347

348348
The URL object can be reused in other prefixes like `host.url.*` for example. It is important that whenever URL is used that the same structure is used.
349349

350-
`url.href` is a [multi field](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/multi-fields.html#_multi_fields_with_multiple_analyzers) which means the data is stored as keyword `url.href` and test `url.href.analyzed`. The advantage of this is that for running a query against only a part of the url still works without having to split up the URL in all its part on ingest time.
351-
352-
Based on whatwg URL definition: https://github.com/whatwg/url/issues/337
350+
`url.href` is a [multi field](https://www.elastic.co/guide/en/ elasticsearch/reference/6.2/ multi-fields.html#_multi_fields_with_multiple_analyzers) which means the data is stored as keyword `url.href` and test `url.href.analyzed`. The advantage of this is that for running a query against only a part of the url still works without having to split up the URL in all its part on ingest time.
353351

354352

355353
| Field | Description | Type | Multi Field | Example |
356354
|---|---|---|---|---|
357-
| <a name="url.href"></a>`url.href` | href contains the full url. The field is stored as keyword.<br/>`href` is an analyzed field so the parsed information can be accessed through `href.analyzed` in queries. | keyword | | `https://elastic.co:443/search?q=elasticsearch#top` |
355+
| <a name="url.href"></a>`url.href` | href contains the full url. The field is stored as keyword.<br/>`href` is an analyzed field so the parsed information can be accessed through `href.analyzed` in quries. | keyword | | `https://elastic.co:443/search?q=elasticsearch#top` |
358356
| <a name="url.href.analyzed"></a>`url.href.analyzed` | | text | 1 | |
359-
| <a name="url.protocol"></a>`url.protocol` | The protocol of the request, e.g. "https:". | keyword | | |
360-
| <a name="url.hostname"></a>`url.hostname` | The hostname of the request, e.g. "example.com".<br/>For correlation the this field can be copied into the `host.name` field. | keyword | | |
361-
| <a name="url.port"></a>`url.port` | The port of the request, e.g. 443. | keyword | | |
362-
| <a name="url.pathname"></a>`url.pathname` | The path of the request, e.g. "/search". | text | | |
363-
| <a name="url.pathname.raw"></a>`url.pathname.raw` | The url path. This is a non-analyzed field that is useful for aggregations. | keyword | 1 | |
364-
| <a name="url.search"></a>`url.search` | The search describes the query string of the request, e.g. "q=elasticsearch". | text | | |
365-
| <a name="url.search.raw"></a>`url.search.raw` | The url search part. This is a non-analyzed field that is useful for aggregations. | keyword | 1 | |
366-
| <a name="url.hash"></a>`url.hash` | The hash of the request URL, e.g. "top". | keyword | | |
357+
| <a name="url.scheme"></a>`url.scheme` | The scheme of the request, e.g. "https".<br/>Note: The `:` is not part of the scheme. | keyword | | `https` |
358+
| <a name="url.host.name"></a>`url.host.name` | The hostname of the request, e.g. "example.com".<br/>For correlation the this field can be copied into the `host.name` field. | keyword | | `elastic.co` |
359+
| <a name="url.port"></a>`url.port` | The port of the request, e.g. 443. | integer | | `443` |
360+
| <a name="url.path"></a>`url.path` | The path of the request, e.g. "/search". | text | | |
361+
| <a name="url.path.raw"></a>`url.path.raw` | The url path. This is a non-analyzed field that is useful for aggregations. | keyword | 1 | |
362+
| <a name="url.query"></a>`url.query` | The query field describes the query string of the request, e.g. "q=elasticsearch".<br/>The `?` is excluded from the query string. In case an URL contains no `?` it is expected that the query field is left out. In case there is a `?` but no query, the query field is expected to exist with an empty string. Like this the `exists` query can be used to differentiate between the two cases. | text | | |
363+
| <a name="url.query.raw"></a>`url.query.raw` | The url query part. This is a non-analyzed field that is useful for aggregations. | keyword | 1 | |
364+
| <a name="url.fragment"></a>`url.fragment` | The part of the url after the `#`, e.g. "top".<br/>The `#` is not part of the fragment. | keyword | | |
367365
| <a name="url.username"></a>`url.username` | The username of the request. | keyword | | |
368366
| <a name="url.password"></a>`url.password` | The password of the request. | keyword | | |
369-
| <a name="url.extension"></a>`url.extension` | The url extension field contains the extension of the file associated with the url.<br/>A simple example is `http://localhost/logo.png` where the extension would be `png`. There can also be more complex cases like `http://localhost/content?asset=logo.png&token=XYZ` where the extension could also be `png` but depends on the implementation.<br/>The `extension` field should be left out if the extension is not defined. | keyword | | `png` |
370367

371368

372369
## <a name="user"></a> User fields

schema.csv

+6-7
Original file line numberDiff line numberDiff line change
@@ -113,15 +113,14 @@ source.ip,ip,0,
113113
source.mac,keyword,1,
114114
source.port,long,1,
115115
source.subdomain,keyword,1,
116-
url.extension,keyword,0,png
117-
url.hash,keyword,0,
118-
url.hostname,keyword,0,
116+
url.fragment,keyword,0,
117+
url.host.name,keyword,0,elastic.co
119118
url.href,keyword,0,https://elastic.co:443/search?q=elasticsearch#top
120119
url.password,keyword,0,
121-
url.pathname,text,0,
122-
url.port,keyword,0,
123-
url.protocol,keyword,0,
124-
url.search,text,0,
120+
url.path,text,0,
121+
url.port,integer,0,443
122+
url.query,text,0,
123+
url.scheme,keyword,0,https
125124
url.username,keyword,0,
126125
user.email,keyword,1,
127126
user.hash,keyword,1,

schemas/url.yml

+27-26
Original file line numberDiff line numberDiff line change
@@ -8,41 +8,46 @@
88
example. It is important that whenever URL is used that the same structure
99
is used.
1010
11-
`url.href` is a [multi field](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/multi-fields.html#_multi_fields_with_multiple_analyzers)
11+
`url.href` is a [multi field](https://www.elastic.co/guide/en/
12+
elasticsearch/reference/6.2/
13+
multi-fields.html#_multi_fields_with_multiple_analyzers)
1214
which means the data is stored as keyword `url.href` and test
1315
`url.href.analyzed`. The advantage of this is that for running a query
1416
against only a part of the url still works without having to split up the
1517
URL in all its part on ingest time.
16-
17-
Based on whatwg URL definition: https://github.com/whatwg/url/issues/337
1818
fields:
1919
- name: href
2020
type: keyword
2121
description: >
2222
href contains the full url. The field is stored as keyword.
2323
2424
`href` is an analyzed field so the parsed information can be accessed
25-
through `href.analyzed` in queries.
25+
through `href.analyzed` in quries.
2626
multi_fields:
2727
- name: analyzed
2828
type: text
2929
example: https://elastic.co:443/search?q=elasticsearch#top
30-
- name: protocol
30+
- name: scheme
3131
type: keyword
3232
description: >
33-
The protocol of the request, e.g. "https:".
34-
- name: hostname
33+
The scheme of the request, e.g. "https".
34+
35+
Note: The `:` is not part of the scheme.
36+
example: https
37+
- name: host.name
3538
type: keyword
3639
description: >
3740
The hostname of the request, e.g. "example.com".
3841
3942
For correlation the this field can be copied into the `host.name`
4043
field.
44+
example: elastic.co
4145
- name: port
42-
type: keyword
46+
type: integer
4347
description: >
4448
The port of the request, e.g. 443.
45-
- name: pathname
49+
example: 443
50+
- name: path
4651
type: text
4752
description: >
4853
The path of the request, e.g. "/search".
@@ -52,21 +57,29 @@
5257
description: >
5358
The url path. This is a non-analyzed field that is useful
5459
for aggregations.
55-
- name: search
60+
- name: query
5661
type: text
5762
description: >
58-
The search describes the query string of the request,
63+
The query field describes the query string of the request,
5964
e.g. "q=elasticsearch".
65+
66+
The `?` is excluded from the query string. In case an URL
67+
contains no `?` it is expected that the query field is left out.
68+
In case there is a `?` but no query, the query field is expected
69+
to exist with an empty string. Like this the `exists` query can be
70+
used to differentiate between the two cases.
6071
multi_fields:
6172
- name: raw
6273
type: keyword
6374
description: >
64-
The url search part. This is a non-analyzed field that is useful
75+
The url query part. This is a non-analyzed field that is useful
6576
for aggregations.
66-
- name: hash
77+
- name: fragment
6778
type: keyword
6879
description: >
69-
The hash of the request URL, e.g. "top".
80+
The part of the url after the `#`, e.g. "top".
81+
82+
The `#` is not part of the fragment.
7083
- name: username
7184
type: keyword
7285
description: >
@@ -75,15 +88,3 @@
7588
type: keyword
7689
description: >
7790
The password of the request.
78-
- name: extension
79-
type: keyword
80-
description: >
81-
The url extension field contains the extension of the file associated with
82-
the url.
83-
84-
A simple example is `http://localhost/logo.png` where the extension would be `png`.
85-
There can also be more complex cases like `http://localhost/content?asset=logo.png&token=XYZ`
86-
where the extension could also be `png` but depends on the implementation.
87-
88-
The `extension` field should be left out if the extension is not defined.
89-
example: png

template.json

+15-16
Original file line numberDiff line numberDiff line change
@@ -580,17 +580,17 @@
580580
},
581581
"url": {
582582
"properties": {
583-
"extension": {
583+
"fragment": {
584584
"ignore_above": 1024,
585585
"type": "keyword"
586586
},
587-
"hash": {
588-
"ignore_above": 1024,
589-
"type": "keyword"
590-
},
591-
"hostname": {
592-
"ignore_above": 1024,
593-
"type": "keyword"
587+
"host": {
588+
"properties": {
589+
"name": {
590+
"ignore_above": 1024,
591+
"type": "keyword"
592+
}
593+
}
594594
},
595595
"href": {
596596
"fields": {
@@ -606,7 +606,7 @@
606606
"ignore_above": 1024,
607607
"type": "keyword"
608608
},
609-
"pathname": {
609+
"path": {
610610
"fields": {
611611
"raw": {
612612
"ignore_above": 1024,
@@ -617,14 +617,9 @@
617617
"type": "text"
618618
},
619619
"port": {
620-
"ignore_above": 1024,
621-
"type": "keyword"
622-
},
623-
"protocol": {
624-
"ignore_above": 1024,
625-
"type": "keyword"
620+
"type": "long"
626621
},
627-
"search": {
622+
"query": {
628623
"fields": {
629624
"raw": {
630625
"ignore_above": 1024,
@@ -634,6 +629,10 @@
634629
"norms": false,
635630
"type": "text"
636631
},
632+
"scheme": {
633+
"ignore_above": 1024,
634+
"type": "keyword"
635+
},
637636
"username": {
638637
"ignore_above": 1024,
639638
"type": "keyword"

0 commit comments

Comments
 (0)