Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

1624 index large files #3337

Merged
merged 15 commits into from
Apr 18, 2019
Merged

1624 index large files #3337

merged 15 commits into from
Apr 18, 2019

Conversation

ijsnow
Copy link
Contributor

@ijsnow ijsnow commented Apr 10, 2019

This PR is part of the solution for #1624. We are adding a configuration option called "search.largeFiles" which is a list of glob patterns. Any files matching the patterns in the list will be indexed and searched regardless of the size.

This PR:

  • adds the setting
  • adds an internal endpoint that serves this setting for zoekt to consume
  • uses the setting in searcher

Test plan:

Testing searcher implementation:

  • Run sourcegraph
  • Make sure basic text search is ran by disabling indexing "search.index.enabled": false,
  • Add a large file to a repo. (e.g. ijsnow/tmp)
  • Make sure we don't search it
  • Add "search.largeFiles": ["bigfile.txt"] to config
  • Make sure it is in search results

@ijsnow ijsnow requested a review from keegancsmith April 10, 2019 01:39
@ijsnow ijsnow force-pushed the 1624-index-large-files branch from e1947d1 to 59b6a1d Compare April 11, 2019 00:50
@ijsnow ijsnow requested a review from ijt April 11, 2019 21:42
@ijsnow ijsnow marked this pull request as ready for review April 15, 2019 19:39
LargeFiles []string
}

// serveSearchConfiguration is _only_ used by the zoekt index server. Zoekt does
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go doc comments aren't supposed to include markdownish things like foo or bar.
See golang/go#16666 for some discussion about this.

Copy link
Contributor Author

@ijsnow ijsnow Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm optimizing for reading in sourcegraph. Sourcegraph tooltips render as markdown. See this comment in sourcegraph here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's a sourcegraph feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lots of people use markdown in godoc even though it isn't supported. Luckily godoc has so little features it doesn't ever conflict.

@ijsnow ijsnow force-pushed the 1624-index-large-files branch from 136aa24 to cc184b8 Compare April 16, 2019 22:25
@ijt
Copy link
Contributor

ijt commented Apr 16, 2019

LGTM once the tests pass.

@ijsnow ijsnow force-pushed the 1624-index-large-files branch from ffb2138 to a8e613d Compare April 17, 2019 22:57
@ijsnow ijsnow force-pushed the 1624-index-large-files branch from a8e613d to 87da254 Compare April 18, 2019 00:31
@codecov
Copy link

codecov bot commented Apr 18, 2019

Codecov Report

Merging #3337 into master will increase coverage by <.01%.
The diff coverage is 50%.

Impacted Files Coverage Δ
cmd/frontend/internal/httpapi/internal.go 0% <0%> (ø) ⬆️
cmd/frontend/internal/httpapi/httpapi.go 19.04% <0%> (-0.23%) ⬇️
cmd/searcher/search/store.go 73.78% <92.3%> (+1.16%) ⬆️
web/src/user/settings/backend.tsx

@ijsnow ijsnow requested a review from keegancsmith April 18, 2019 00:32
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have inline suggestions, but otherwise LGTM.

My only possible concern with this implementation is its a global setting across all repositories. However, that is actually good! We can always add a way to narrow it down to a specific repo in the future.

LargeFiles []string
}

// serveSearchConfiguration is _only_ used by the zoekt index server. Zoekt does
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lots of people use markdown in godoc even though it isn't supported. Luckily godoc has so little features it doesn't ever conflict.

// key is a sha256 hash since we want to use it for the disk name
h := sha256.Sum256([]byte(string(repo.Name) + " " + string(commit)))
h := sha256.Sum256([]byte(string(repo.Name) + " " + string(commit) + " " + string(lfpBytes)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the marshalling and have a simpler implementation:

Suggested change
h := sha256.Sum256([]byte(string(repo.Name) + " " + string(commit) + " " + string(lfpBytes)))
h := sha256.Sum256([]byte(fmt.Sprintf("%q %q %q", repo.Name, commit, conf.Get().SearchLargeFiles)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooooh fancy. Didn't know about %q

if hdr.Size > maxFileSize {
// We do not search the content of large files unless they are
// whitelisted.
if hdr.Size > maxFileSize && !ignoreSizeMax(hdr.Name, conf.Get().SearchLargeFiles) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want to use a consistent SearchLargeFiles here, while this can change. Additionally there is an overhead of getting the config, and this is done for each file here.

Good enough suggestion: outside of the loop do largeFiles := conf.Get() and use largeFiles here.
More consistent suggestion: Pass down the largeFiles used in prepareZip.

"items": {
"type": "string"
},
"group": "Search"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add examples. JSON schema supports the "examples" key.

@@ -22,6 +22,14 @@
"!go": { "pointer": true },
"group": "Search"
},
"search.largeFiles": {
"description": "A list of file glob patterns where matching files will be indexed and searched regardless of their size.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the description of the glob format (godoc link is good enough)

@ijsnow ijsnow merged commit 17ac46c into master Apr 18, 2019
@ijsnow ijsnow deleted the 1624-index-large-files branch April 18, 2019 16:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants