[BUG] HTTP down #299

eskey0 · 2024-04-08T14:59:45Z

Faikin hardware
Faikin-S3-MINI-N4-R2: 91c1bc5 2024-03-31T10:59:15 S21 from Amazon

Daikin hardware
FTXP35N5V1B via s403

Describe the bug
The website goes down, I can control the unit via MQTT and ping it, but no HTTP or whatsoever

To Reproduce
No idea, happened out of the blue, I waited to see if it comes back but no dice.

Expected behavior
Have the web service working, I searched for a reboot via MQTT to see if that fix it, but found none.

Additional context
I have 3 of them, all of them configured and setted up the same day, only one of them failed

The text was updated successfully, but these errors were encountered:

revk · 2024-04-08T15:02:20Z

Hmm, odd, we had this ages ago on older code with an app using the legacy URLs, but fixed long ago.

Try just power cycling or sending restart command over MQTT and see if it comes back.

Try web via IP not URL/domain in case an mDNS issue.

eskey0 · 2024-04-08T15:15:59Z

Sorry I didn't specify that yes I do use direct ip address to connect to the device.
After the restart command the website is up again, I don't know if you want to dig more on this, or let it be for now.

revk · 2024-04-08T15:24:12Z

Ok not sure, as I say, only seen with some very specific (and now fixed) legacy IP polling. See if its happens again.

eskey0 · 2024-04-08T15:25:48Z

Sure, I'll keep an eye on this, and keep you updated, thanks sir you awesome!

eskey0 · 2024-04-15T07:38:47Z

Hello there again, just a heads up, I got my second device to also "http fail", and I, again, fix it by mqtt restart, and now my 3rd device is in that state too.

EDIT: Just wanted to share the status, if no one else is experience this, maybe it's something in my setup

revk · 2024-04-15T07:52:25Z

Are you using the legacy URLs / polling them?

eskey0 · 2024-04-15T07:59:12Z

I just navigate to http://ipaddress in the browser, it usually just works.

revk · 2024-04-15T08:00:05Z

OK but no tools, HA plug-ins, or something, that may be accessing the legacy URLs for data?

eskey0 · 2024-04-15T08:03:07Z

No that Iam aware of, just HA through MQTT, nothing going for the HTTP besides my browser that I rarely use.

revk · 2024-04-15T08:05:10Z

OK, as I know some HA plug-ins use the old URLs, but if using MQTT, that should be fine. Which leaves my rather puzzled at the issue, to be honest.

eskey0 · 2024-04-15T08:17:01Z

It also looks timed, one failed, reboot, about 3/5 days passed by, and then the other one, and repeat. Now is the 3rd one (of 3) I can just reboot it via MQTT too and see if they start from the first one that failed.

To give you more of hindsight, I do have a more-than-average network, the Faikins also are in a restricted network, with some cameras in the same segment, with only access to HA trough MQTT, the web access from my computer, and to your update server.

I do have some plug-ins in HA, but that were for the "official" modules, and they were assigned different IP addresses, and I dissconect them from the units, so I don't think that could be an issue.

eskey0 · 2024-04-17T14:33:31Z

I have more information to share, it happened again, this time to 2 of the 3 devices I have. It happened just after I changed the wifi band on my AP, does that ring any bell? Again after sending a MQTT reboot the website goes online.

I must add I live in an appartment that is very noisy wifi wise.

antwin · 2024-08-18T01:22:19Z

This has just happened to me. The device is online - responds to pings, nmap can see it but not analyse it, it works on mqtt, but the webserver times out. Addressed by ip address. Webserver is up again after an mqtt restart. Uptime was a few days.
Just before the web server stopped, I was looking at the page. It loaded the first time ok. Then just gave the blue screen with no buttons. On a reload it loaded it all, then timed out.
Faikin-S3-MINI-N4-R2: b16bfc4 2024-08-12T14:02:04 S21
Would any more info help - wireshark capture, status output ... ?

revk · 2024-08-18T06:43:50Z

Just to check, are you using the legacy URLs? We think, somehow, there is a memory leak, possibly in the ESP IDF.

antwin · 2024-08-18T09:00:22Z

I'm not sure what you mean by legacy URLs. I'm using the IP address (192.168.0.150) directly.

revk · 2024-08-18T09:08:09Z

I.e. a monitoring app that talks http to Faikin to get/set data. The way the old Daikin wifi modules used to work.

antwin · 2024-08-18T09:14:22Z

I'm using Firefox to read from http://192.168.0.150 (the Faikin) on one computer. The page appears to be refreshed at intervals. I have not disconnected the original Daikin wifi module, but that has never been used, and the Daikin app is not available here.

revk · 2024-08-18T09:21:40Z

OK sounds like you are not using the legacy HTTP API then. The web page on the Faikin is not "refreshed" it uses a web socket. It should have no problem working indefinitely. I'm puzzled if you think it is being refreshed.

When we have seen issues with web server stopping it has always been down to someone using some app (not the Daikin app, usually some home assistant plug in that is not using MQTT). That polls the HTTP legacy APIs constantly, and we think there is some memory leak issue from that, but not 100% sure.

If you are not doing that, it is the first case of a problem like this.

Can you check the settings / basic page occasionally and see if the memory figures on that page are going down over time?

antwin · 2024-08-18T09:33:44Z

First off, thanks for the prompt replies - I'm very impressed!
My terminology was off. The page is updated, which is why I assumed it was refreshed. I must get the hang of websockets some day.
I'm not using HA. I intend to be using MQTT sometime.
I'll check the memory figures on the settings page, but it's a cold wet night here (NZ) and I'm off to bed, so there will be a pause of a day or two.

revk · 2024-08-18T09:38:36Z

Have a good night. The fact this is not using legacy HTTP APIs is interesting, and so may give us clues.

antwin · 2024-08-20T23:08:49Z

Here are some preliminary results from status/faikin - are these what you need to see?:
{"ts":"2024-08-20T05:25:26Z","id":"DC5475EF52FC","up":true,"uptime":3690,"mqtt-up":3686,"mem":119504,"spi":2090296}
{"ts":"2024-08-20T08:28:48Z","id":"DC5475EF52FC","up":true,"uptime":14692,"mqtt-up":14688,"mem":119324,"spi":2090196}
{"ts":"2024-08-20T22:40:31Z","id":"DC5475EF52FC","up":true,"uptime":65794,"mqtt-up":65790,"mem":119120,"spi":2090196}

revk · 2024-08-21T06:19:12Z

Ah prefect yes mem and SPI, over time.

antwin · 2024-08-25T05:58:46Z

No http hangs for several days!
More results:
{"ts":"2024-08-21T09:38:24Z","id":"DC5475EF52FC","up":true,"uptime":105267,"mqtt-up":21698,"mem":119324,"spi":2090196}
{"ts":"2024-08-21T23:28:38Z","id":"DC5475EF52FC","up":true,"uptime":155080,"mqtt-up":71511,"mem":118760,"spi":2090196}
{"ts":"2024-08-23T23:37:50Z","id":"DC5475EF52FC","up":true,"uptime":328430,"mqtt-up":244861,"mem":118676,"spi":2090040}
{"ts":"2024-08-25T05:06:19Z","id":"DC5475EF52FC","up":true,"uptime":434538,"mqtt-up":350969} mem 113600+2090108 (for some reason, it's not now reporting "mem" in status.)

antwin · 2024-08-27T23:18:27Z

MQTT is working fine. BUT although HTTP is working on one device I cannot connect on a second device. Current status:
{"ts":"2024-08-27T22:42:04Z","id":"DC5475EF52FC","up":true,"uptime":670681,"mqtt-up":587112,"mem":109792,"spi":2089848}

revk · 2024-08-28T08:01:39Z

OK, that means it is not a memory leak. I'll have to look at number of TCP sockets or something.

Does it eventually recover, or does it need a restart?

antwin · 2024-08-28T09:39:55Z

The working one worked for some hours. But it has also just stopped. It stopped with just the blue background page and 'settings....' at the bottom left, so no updating. So now no http connection on either, but pings and mqtt work fine.

revk · 2024-08-28T09:57:32Z

This sounds a lot like a TCP related issue. I'll have to have a play with the options.

PianSom · 2025-03-10T08:34:30Z

There's no need for any special HA work, the data is available.

eg one of my Faikins has the name faikinliving, which it publishes data to MQTT on. The topic needed is state/faikinliving and the memory field is (correct me if I'm wrong) the json field "mem"

So in my HA configuaration.yaml I have the line

mqtt: !include mqtt.yaml

and in my mqtt.yaml I can put

sensor:
  - name: "FaikinLiving mem"
    unique_id: faikinliving-mem
    device_class: data_size
    state_topic: "state/faikinliving"
    value_template: "{{value_json.mem}}"

and then I have the sensor called sensor.faikinliving-mem which can be graphed.

macmpi · 2025-03-10T09:03:01Z

@PianSom good tip, thanks.
However it required quite a bit of involvement & know-how on your side to figure this-out: nice sharing it here.
Would definitely be helpful to have it (and other meaningful dynamic debug info) better exposed, and accessible as a base: Hopefully this example will help develop that.

PS: unique id might be in the form sensor.<hostname>-mem to follow HA syntax

revk · 2025-03-10T09:11:26Z

OK a simple setting in next beta to turn this on in HA.

macmpi · 2025-03-10T09:16:11Z

pl consider restart button in same way #657 😉

gregrob · 2025-03-10T09:34:14Z

@macmpi , have you noticed the HTTP failure as frequently with the newer ca0777 2025-02-19 version? I encountered the HTTP failure (which required a restart via MQTT) with the version prior to this many times. Sometimes it happened within a day, and other times after 2 or 3 days. Since updating to the newer version, I haven’t experienced the full HTTP lockup that required a restart via MQTT in about two weeks. There have been occasional delays (around 10-15 seconds with the Uncaught DOMException: An attempt was made to use an object that is not, or is no longer, usable error), but no complete lockups. I have 6 Faikins units across Alira X models, ranging from 3.5kW to 8.5kW - smaller units use the S403 port while bigger units use S21.

revk · 2025-03-10T09:36:51Z

Bear in mind I do update the ESP IDF periodically as well, I think I usually mention in release notes, so bugs or leaks in underlying http server may get fixed.

macmpi · 2025-03-10T09:57:25Z

@gregrob thanks for your observation.
Indeed it seems a bit less severe, but I just observed one while tracking RAM below: not locked-up but very significant hogs on web page refresh, then recovering this time. Is on 7ca0777
Lowest at 95068

revk · 2025-03-10T10:00:50Z

Odd as default malloc is meant to get from PSRAM anyway (not sure it does). A dip like that is usually only when doing s/w upgrade.

macmpi · 2025-03-10T10:30:05Z

Did let it settle for about 15' and played again with web interface: this time went as far as ERR_CONNECTION_RESET, and homepage access does not recover fully this time...

Note: upper peak at 11:45 is successful access to settings page, but homepage is just continuously locked-up.

macmpi · 2025-03-10T13:22:24Z

So had to reset...
Starting from that reset (level 122 572), did 3 accesses only. First homepage access:

decreases to 121 516 and then go back to 122 332 (~220 loss); next access
decreases to 121 324 (same ~220 gap), and then back to 122 324 gradually (~220 loss from origin); third access
severe drop to 113 200, then gradual recover (with accident) to 122 312

revk · 2025-03-10T13:35:15Z

None of that is remotely low enough to be an issue - I assume PSRAM remains high. And also it is recovering, which suggests no actual leak.

macmpi · 2025-03-10T13:49:07Z

Really?... 😲
Loss may be small each time, and after few days of accesses piling-up become significant enough. How do you explain the big/sudden drop in 3rd access?
Do you really find memory plots pattern perfectly fine & expected when homepage access struggles to death?
Isn't that suspect enough to investigate with some memory usage inspection methods from IDF?

revk · 2025-03-10T13:57:47Z

Yes, I expect to see a leak as a gradual trend over days.

The individual accesses will use memory in various ways and the logs are a per minute snapshot. It may be every access needs that much memory and you caught just one of them. Indeed, I would be amazed in an http page serving using only 10kB.

macmpi · 2025-03-10T14:05:16Z

I can only provide traces I'm given.
If nothing is built into the code to provide relevant info per your judgement, I'm afraid it's going to be impossible to debug this.
What's your plan to get to the bottom of this? Happy to help if I can.

revk · 2025-03-10T14:07:11Z

Err, in my judgement.

A leak will be a gradual downward trend over days
A spike off 10KB or 20KB is not at all surprising as it is very likely that such amount is needed to serve the page.

So the code to provide relevant information per my judgement is there for that!

macmpi · 2025-03-10T14:40:52Z

Ok, whereas there should not be a bug, it positively seems there is one right?
I did provide a trace this morning with such relevant data points you built into the code: bug did not need days to materialize, but just 30mn for homepage lockup (10h45-11:15 and on).
How do we make progress from there then?

revk · 2025-03-10T14:42:51Z

Feel free to find the bug and do an MR.

I still have not reproduced it, but a memory leak is one thing to consider, and one that will take some days off trends to see.

macmpi · 2025-03-10T16:03:55Z

I think there is a misunderstanding: how do you interpret the first trace -long version, knowing that:

last reset was about 12h before, and it was sitting idle (MQTT/HA) in the 120K boundaries at 10h45
first homepage accesses serie then were mostly fine, and recovering in the 120k boundaries keeping just MQTT/HA operations for a little while
second series of accesses (sporadic manual page reloads from PC and iphone for a minute or so) from 11h15 brought homepage to death, while only settings page access at 11h45 happened to go through.

Isn't the mem pattern from 11h15-to-end unexpected?
Any thought on what can cause this?
This materializes in couple of minutes, not days of operations.

eMeF1 · 2025-03-12T06:57:25Z

Only to encourage the fix: Three (out of three) Faikins around me face the same issue. I try to avoid using the HTTP page as much as possible; otherwise, a restart is needed every 30 minutes. Very unstable and annoying:(

revk · 2025-03-12T06:58:43Z

I would love to get to the bottom of this - are the new RAM/PSRAM HA graphs showing memory leaks at all?

macmpi · 2025-03-12T11:12:44Z

are the new RAM/PSRAM HA graphs showing memory leaks at all?

Can't find new sensors on HA. Updated to 7a94561 beta, restarted HA.

revk · 2025-03-12T11:14:04Z

There is a setting to enable it

macmpi · 2025-03-14T08:57:34Z

FWIW enclosing few data points with provided sensors during problematic homepage accesses under 7a94561 (seems to be more frequent when using mdns based address)
history.csv

revk · 2025-03-14T09:13:19Z

I am not 100% sure of reliability of MDNS, so yes, can we avoid using MDNS for these tests so we focus on one thing.

macmpi · 2025-03-14T10:04:15Z

same with 7a47974 history.csv
If MDNS is on the suspects list, is there a way to disable it, or log any debug data?

revk · 2025-03-14T10:10:35Z

I have just done a beta with more httpd stack

macmpi · 2025-03-14T10:30:26Z

I have just done a beta with more httpd stack

Yes and previous comment was report after that beta (notice commit version) 😉

revk · 2025-03-14T10:36:02Z

Well it was worth a try.

macmpi · 2025-03-14T10:53:57Z

re mDNS: it seems it got bumbed 1.8.0 last week: https://github.com/espressif/esp-protocols/tree/master/components/mdns (they also worked-out memory allocation for it last month in link with mosquitto release...)

revk · 2025-03-14T11:26:52Z

I do update esp idf from time to time as well.

eskey0 assigned revk Apr 8, 2024

revk closed this as completed Apr 8, 2024

revk reopened this Aug 18, 2024

[BUG] HTTP down #299

[BUG] HTTP down #299

Comments

eskey0 commented Apr 8, 2024

revk commented Apr 8, 2024

eskey0 commented Apr 8, 2024

revk commented Apr 8, 2024

eskey0 commented Apr 8, 2024

eskey0 commented Apr 15, 2024 • edited Loading

revk commented Apr 15, 2024

eskey0 commented Apr 15, 2024

revk commented Apr 15, 2024

eskey0 commented Apr 15, 2024

revk commented Apr 15, 2024

eskey0 commented Apr 15, 2024

eskey0 commented Apr 17, 2024

antwin commented Aug 18, 2024

revk commented Aug 18, 2024

antwin commented Aug 18, 2024

revk commented Aug 18, 2024

antwin commented Aug 18, 2024

revk commented Aug 18, 2024

antwin commented Aug 18, 2024

revk commented Aug 18, 2024

antwin commented Aug 20, 2024

revk commented Aug 21, 2024

antwin commented Aug 25, 2024

antwin commented Aug 27, 2024

revk commented Aug 28, 2024

antwin commented Aug 28, 2024

revk commented Aug 28, 2024

PianSom commented Mar 10, 2025 • edited Loading

macmpi commented Mar 10, 2025 • edited Loading

revk commented Mar 10, 2025

macmpi commented Mar 10, 2025

gregrob commented Mar 10, 2025 • edited Loading

revk commented Mar 10, 2025

macmpi commented Mar 10, 2025 • edited Loading

revk commented Mar 10, 2025

macmpi commented Mar 10, 2025 • edited Loading

macmpi commented Mar 10, 2025

revk commented Mar 10, 2025

macmpi commented Mar 10, 2025 • edited Loading

revk commented Mar 10, 2025 • edited Loading

macmpi commented Mar 10, 2025

revk commented Mar 10, 2025

macmpi commented Mar 10, 2025

revk commented Mar 10, 2025

macmpi commented Mar 10, 2025 • edited Loading

eMeF1 commented Mar 12, 2025

revk commented Mar 12, 2025

macmpi commented Mar 12, 2025

revk commented Mar 12, 2025

macmpi commented Mar 14, 2025 • edited Loading

revk commented Mar 14, 2025

macmpi commented Mar 14, 2025 • edited Loading

revk commented Mar 14, 2025

macmpi commented Mar 14, 2025

revk commented Mar 14, 2025

macmpi commented Mar 14, 2025

revk commented Mar 14, 2025

eskey0 commented Apr 15, 2024 •

edited

Loading

PianSom commented Mar 10, 2025 •

edited

Loading

macmpi commented Mar 10, 2025 •

edited

Loading

gregrob commented Mar 10, 2025 •

edited

Loading

macmpi commented Mar 10, 2025 •

edited

Loading

macmpi commented Mar 10, 2025 •

edited

Loading

macmpi commented Mar 10, 2025 •

edited

Loading

revk commented Mar 10, 2025 •

edited

Loading

macmpi commented Mar 10, 2025 •

edited

Loading

macmpi commented Mar 14, 2025 •

edited

Loading

macmpi commented Mar 14, 2025 •

edited

Loading