I'm currently porting mainline to Google Pixel 3 XL (google-crosshatch). Google and Qualcomm already upstreamed sdm845 support, so all we need is a device tree to boot the phone.
(I'm making this an issue because I think it's a better place to keep a discussion than in Matrix chat. I'll move this to Wiki once this is complete)
Current status: USB networking works.
Screen currently doesn't work, although it's possible to use the screen if USB is disabled.
I'm currently compiling/booting the kernel from outside of pmbootstrap, because I want incremental kernel compiles
I'm using Linux master branch (a bit past 4.20-rc6, 65e08c5e86311143f45c3e4389561af3107fc8f6)
I copied config-postmarketos-mainline.aarch64 from pmaports into my kernel's .config, then ran make menuconfig and saved to use defaults for newly added params
I copied the device tree from sdm845-mtp.dts to sdm845-crosshatch.dts, and added it to the makefile
I built the kernel, but intentionally added code to reboot before device initialization
I haven't changed the device tree yet, and I don't want it to cause damage to my phone
I built a boot.img
fastboot boot accepts the boot.img
device reboots immediately after boot, as expected, confirming that the mainline kernel boots all the way to start_kernel
Edit December 19
updated the device tree with regulator information and ramoops/memory reserve nodes
booted the mainline kernel with the resulting tree
nothing happens - no USB
made a patch to dump kernel log to RAM, then reboot, when pmOS tries to setup USB networking
log shows the USB controller isn't being detected
What I still need to do:
Verify regulator configuration matches between downstream and mainline kernelDone
Confirm whether mainline even has USB device support (SDM845 MTP forces host on the main USB Type-C port)
Copy other parts of the device tree from downstream (memory reservations, ramoops, etc)
Try to boot the device with updated device tree and see how it crashes
Edit December 19:
Debug the USB networking
Questions and topics:
DTBO
The Pixel 3 XL has two device trees:
the base device tree, appended to the end of the kernel
an overlay, in the dtbo partition
At boot time, the bootloader merges the two device trees and passes the result to the kernel.
Google's intention is that the kernel's device tree would contain only SoC specific info, and the dtbo partition would contain device-specific info.
I don't think this fits well with mainline's use of device trees, where each board only has one device tree file that includes both SoC and board info.
Thus, I made an empty device tree, with no extra nodes beyond device name. I built this into a dtbo image and flashed it. So now all device tree information should come from the device tree included in the kernel.
Is this the correct approach? I can split Pixel 3 XL specific parts into the dtbo later, if needed.
Regulators: HPM mode
According to the guide, I need to verify the regulator voltages.
Regulators all default to HPM mode w/ no ability to switch modes. Future patches can switch things to LPM and possibly add dynamic load switching if we have determined there's a benefit. This should only be done for rails where we'll actually be able to take advantage of the lower power modes so we don't need to churn with lots of patches adding regulator_set_load() calls to drivers.
I just want to confirm: is this just for power saving? If I keep everything as HPM mode, would it damage anything?
Getting debug output: pstore-ramoops
I don't have a debug cable, and I don't trust myself to make one.
Instead, I plan to use pstore-ramoops to get logs in RAM. While the Pixel 3 XL has an encrypted pstore-ramoops (which mainline doesn't support), it does offer a /dev/access-ramoops device to dump that memory region directly, bypassing encryption.
I'll try to setup ramoops in mainline (unencrypted), then reboot into TWRP and read /dev/access-ramoops to get the ramoops log.
Does this sound feasible? Is there a better option that doesn't require soldering?
Can I force the kernel to keep the /sys/fs/pstore/console-ramoops file even on successful reboot? My Nexus 6P's kernel does this, but later kernels seem to only keep the log if the kernel panics. Previously I tried forcing a kernel oops when I need logging, but there's gotta be a better way.
(Or maybe accessing the memory region directly would give me the log even on successful reboot? I have no idea how/when pstore-ramoops wipes the log on successful reboot)
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
HPM is just for power saving, you are fine with every regulator being in hpm.
Pstore should work fine, iirc the ramoops node can be just copied over from 4.9 as it uses same version.
For usb, remove all nodes relating to usb_2* and change usb_1 node to use peripheral mode.
For dtb, just do as you are currently doing and add a stub dtb.
If you need output and device isn't rebooting you probably will need to cause a panic but sometimes boot loader will clear memory if panic isn't set.
Tested booting mainline. It didn't boot, but I got the dmesg up to init.
Odd things:
it still says "Machine model: Qualcomm Technologies, Inc. SDM845 MTP". This is wrong: the dtbo's supposed to override that. Will look into it.
"framoops ramoops: failed to locate DT /reserved-memory resourceSUBSYSTEM=platform" - ok I can't write device tree files. Will try to fix.
What I think I still need to do:
Fix the ramoops device node
See if I'm missing any modules. (I already have CONFIG_USB_DWC3_QCOM=Y in .config; not sure what I'm missing?)
What I've done:
Updated regulators.
Tried booting the kernel
Result: absolutely nothing happens; device stays on the initial Google splash screen; no USB.
I tried diagnosing this through pstore-ramoops (which I had to turn on in menuconfig. Why doesn't our kernel config enable this?), but as you can see in the dmesg, pstore thinks my device tree node isn't valid. Sigh.
I resorted to a more stupid way to get the dmesg out, by ioremap'ing the region (usually used by ramoops) and memcpy'ing the kernel log there.
(I may switch to using a initramfs hook to write to /dev/kmem or /dev/mem, but this is a quick and dirty solution)
I'm pushing the device tree source right now if you want to see what I screwed up.
Edit 1: fixed the ramoops node:
printk: console [pstore-1] enabledpstore: Registered ramoops as persistent store backendramoops: attached 0x200000@0xa1810000, ecc: 0/0
Still nothing shows up in pstore when rebooted into TWRP. Sigh. (/>_<)/
Edit: ok why is USB configfs disabled in our mainline kernel config (and in armhf)?! How does nokia-n900 or sony-amami (the two other devices using linux-postmarketos-stable) do usb?
The -qcom kernel does have it enabled, but that kernel variant seems to be for armhf devices only.
Regarding dmesg writing, if you have not seen it, this might be helpful.
Edit: ok why is USB configfs disabled in our mainline kernel config (and in armhf)?! How does nokia-n900 or sony-amami (the two other devices using linux-postmarketos-stable) do usb?
CONFIG_CONFIGFS_FS=y is enabled in both the aarch64 and armhf config:
(here).
These devices must use configfs for usb, as we only have configfs and "android_usb" supported in the initramfs.
(USB networking on the N900 doesn't always work though, maybe playing with that options would help)
Just in case you won't get further here, consider joining ##linux-msm on freenode and asking there for help as well.
Don't get discouraged, you're doing a great job and mainlining is hard.
Thank you so much for working on this @zhuowei!
(I think I actually based my config on the -mainline kernel's config, not -stable, but the two configs only differ in minor networking options)
The -qcom kernel has these enabled. We could probably copy the same options over.
I don't have an n900 or any other device using -stable. I'm guessing n900 uses one of these options for usb networking instead? None of them are in the -stable kernel's aarch64 config; they're only enabled on armhf.
Re ram console: I did take a look at the guide; the downstream ram_console isn't in mainline anymore; it's been merged with pstore-ramoops (https://lwn.net/Articles/497881/). (No, I'm not forward porting it just for this)
I have the ramoops node added to the device tree, and I do get the correct message at boot:
ramoops: attached 0x200000@0xa1810000, ecc: 0/0
which matches the downstream kernel. From TWRP dmesg:
but console-ramoops doesn't show up. Not even panicking the kernel causes anything to show in twrp after a reboot. I'm guessing I'm still not using pstore/ramoops properly, so I switched back to the stupid ioremap approach. It works, is simple, and the downstream kernel has the perfect interface for it. (Not as good as a serial cable, but I'll take it.)
I'm not discouraged - I just thought that this is probably a good time to stop for more research: I've solved one problem (USB doesn't show up because its driver waits for clock) and gained another one I don't know how to solve (clock driver freezes phone).
Thanks for the advice and encouragement!
Edit: was looking at other mainlining attempts on the Wiki. The Nexus 9 port uses the clk_ignore_unused kernel cmdline, which supposedly can fix some clock issues. So I rebuilt boot.img with this param.
What you should do is base your config of the defconfig (run make defconfig), this will enable all useful options for arm64.
Once you have done that you should enable PSTORE_RAMOOPS and USB_ETH, if USB works you will get a notification in dmesg about a new rndis device.
If pstore still doesn't work I would suggest diffing the drivers between 4.20 and 4.9 its likely there is something obvious breaking it.
What specific clock is USB freezing on? of could mean its parent rcg isn't on.
New progress: got dmesg on the kernel panic caused by the clock driver.
I was pretty sure that the freeze I got after enabling the clock driver time was from a kernel panic. How do I prove that?
From the Nintendo Switch port, I remembered that fail0verflow, the developers who mainlined the Switch, used a kernel option to reboot the kernel on panic. I searched around; turns out I need to add panic=1 on the cmdline to cause a reboot after one second.
After one second, the kernel indeed rebooted, confirming it was a panic.
I added a patch in kernel/panic.c to dump the log to ram before rebooting, and I got this panic backtrace.
Looks like the clock driver initialized correctly, but the dwc3 driver got a null pointer when it's setting the clock, causing a kernel panic.
@opendata26 re. defconfig: I don't think the arm64 defconfig has been updated for SDM845 yet:
$ grep SDM845 arch/arm64/configs/defconfig $
$ mkdir build2$ make O=build2 defconfigmake[1]: Entering directory '/home/zhuowei/linux/build2' HOSTCC scripts/basic/fixdep GEN Makefile HOSTCC scripts/kconfig/conf.o YACC scripts/kconfig/zconf.tab.c LEX scripts/kconfig/zconf.lex.c HOSTCC scripts/kconfig/zconf.tab.o HOSTLD scripts/kconfig/conf*** Default configuration is based on 'defconfig'## configuration written to .config#make[1]: Leaving directory '/home/zhuowei/linux/build2'$ grep 845 build2/.config CONFIG_ARM64_ERRATUM_845719=y# CONFIG_INPUT_MMA8450 is not set# CONFIG_PINCTRL_SDM845 is not set# CONFIG_SDM_CAMCC_845 is not set# CONFIG_SDM_GCC_845 is not set# CONFIG_SDM_VIDEOCC_845 is not set# CONFIG_SDM_DISPCC_845 is not set# CONFIG_MMA8452 is not set
I could try it, I guess.
Edit 2: I added some logging in clk_hw_get_parent_by_index, which showed me that it's trying to get one of the parents of gcc_usb30_prim_mock_utmi_clk_src.
let's look up all these clocks: the last three are present in qcom/gcc-sdm845.c, but the first one, "bi_tcxo", isn't. It's in a different file, qcom/clk-rpmh.c.
That file is behind another config parameter, CONFIG_QCOM_CLK_RPMH, which wasn't enabled.
doing a search on menuconfig (/ in the menu) shows that this option depends on CONFIG_QCOM_RPMH, which isn't enabled. So the kernel just silently disabled it on me without telling me. (/>_<)/ _|linux|_.
Discovered by adding the following debug message when configuring clocks:
diff --git a/drivers/clk/qcom/clk-rcg2.c b/drivers/clk/qcom/clk-rcg2.cindex 6e3bd195d012..7fbcb8be0a2f 100644--- a/drivers/clk/qcom/clk-rcg2.c+++ b/drivers/clk/qcom/clk-rcg2.c@@ -192,6 +192,8 @@ static int _freq_tbl_determine_rate(struct clk_hw *hw, const struct freq_tbl *f, struct clk_rcg2 *rcg = to_clk_rcg2(hw); int index;+ pr_info("clk = %s\n", clk_hw_get_name(hw));+ switch (policy) { case FLOOR: f = qcom_find_freq_floor(f, rate);
That worked around the USB clock panic on my Razer Phone 2 but I still don't have working USB.
I'm pretty sure we need to port a Type C driver from downstream (looks like an I2C device).
Overall I'm in a similar place with my mainline kernel port.
I wrote a display panel driver but it's not loading. I do have a pstore ramoops console though.
@zhuowei: I highly doubt my workaround is correct, essentially it just hides the clock so the kernel doesn't touch it. FWIW, I have all the RPMH configs enabled (CONFIG_REGULATOR_QCOM_RPMH, CONFIG_QCOM_CLK_RPMH, CONFIG_QCOM_RPMH, CONFIG_QCOM_RPMHPD) and they all probe.
and that didn't seem to work. Is your config based on linux-postmarketos-mainline or -stable?
I can try CONFIG_QCOM_RPMHPD and see if that changes anything.
Edit 2: enabled CONFIG_SERIAL_QCOM_GENI, CONFIG_SERIAL_QCOM_GENI_CONSOLE, CONFIG_QCOM_COMMAND_DB, and CONFIG_QCOM_GENI_SE based on that defconfig. Let's see what happens.
@zhuowei: I based my config on a previous porting attempt for another device (the msm8996/SD820 based ZTE Axon 7), which was in turn basically made from scratch.
I also applied a few patches from the linux-arm-msm mailing list, see
https://patchwork.kernel.org/project/linux-arm-msm/list/?series=59555 for RPMH power domains. As for CONFIG_QCOM_RPMH, you'll probably need additional out-of-tree DT patches. Can't remember off the top of my head, but I'll be back in a few hours.
@DrGitX I'm just using 4.20 and its sdm845-mtp.dts with no out-of-tree patches. Are there any patches that I absolutely need to add into 4.20 to boot on sdm845?
Edit: as far as I can tell, after enabling all these options, the device now reboots without a panic, suggesting that the firmware/TrustZone is forcing a reboot. I guess I have to do the protected clock workaround, then...
Edit 2: or not! I grabbed the log, and it seems it didn't reboot from firmware; it is hitting the reboot I added during USB initialization. So it seems the kernel configs avoided the panic, at any rate.
platform a600000.dwc3: Retrying from deferred listSUBSYSTEM=platformDEVICE=+platform:a600000.dwc3bus: 'platform': driver_probe_device: matched device a600000.dwc3 with driver dwc3bus: 'platform': really_probe: probing driver dwc3 with device a600000.dwc3dwc3 a600000.dwc3: no pinctrl handleSUBSYSTEM=platformDEVICE=+platform:a600000.dwc3JEof_clk_set_defaults called with dwc3, <NULL>, dwc3@a600000, not supplierfdwc3 a600000.dwc3: Failed to get clk 'ref': -2SUBSYSTEM=platformDEVICE=+platform:a600000.dwc3fdwc3 a600000.dwc3: failed to initialize coreSUBSYSTEM=platformDEVICE=+platform:a600000.dwc3platform a600000.dwc3: Driver dwc3 requests probe deferralSUBSYSTEM=platformDEVICE=+platform:a600000.dwc3platform a600000.dwc3: Added to deferred listSUBSYSTEM=platformDEVICE=+platform:a600000.dwc3
So I guess I'm still missing a clock or two...
Edit 3: other fails:
the USB PHY isn't probed: error:
fqcom-qusb2-phy 88e2000.phy: Failed to get supply 'vdda-pll': -517SUBSYSTEM=platform
the regulators that the USB phy depends can't be loaded: error:
fldo19: failed to get the current voltage(-131)
Sigh.
Note that ldo19 is changed in downstream (it supplies a constant 3.3v for the touchscreen, instead of 3.0v in the upstream -mtp). I changed it in mainline to match; maybe I screwed it up?
It looks like when initializing rpmh regulators (like ldo19), the kernel can't read the initial voltage settings, so must set the voltages again. Maybe ldo19 needs some other settings to be changed before rpmh would allow setting it to 3.3 volts?
(On downstream kernel, this works just fine. From TWRP:
Edit: I commented out the entire ldo19 regulator. Now the device simply shuts down on boot.
I can tell this is a shut down because when this device reboots, pressing Volume Down gets into the bootloader. However, here, pressing Volumn Down after the screen goes blank has no effect; once the button is released, the device starts booting (and ignores the volumn down).
This matches the behaviour of a shutdown then an automatic poweron in charger mode.
There's nothing in the in-memory dmesg.
This suggests that the kernel boot actually triggered the device's self protection - maybe I've over or undervolted something...
I haven't gotten the kernel to boot further, but I did spend today getting the screen sort-of working on mainline, by reusing the bootloader's framebuffer.
This is completely useless for me: this obviously won't support acceleration, and I can already get kernel logs out of access-ramoops. However, this may be useful for other Snapdragon 835/845 mainlining projects.
The bootloader on this phone sets up the screen to show the boot splash.
On old Qualcomm devices (eg Snapdragon 810), the bootloader freezes the screen before the kernel loads. However, the Snapdragon 845 uses UEFI firmware, which can boot Windows. Windows requires the screen to be kept working after bootloader exits so it can display its boot progress bar without drivers. Therefore, the Snapdragon 845 doesn't freeze the screen at kernel boot.
Thus, we can write to the framebuffer that was setup by the bootloader.
This address isn't passed to Linux kernels, so I found it the hard way: binary searching through the entire memory (setting half the memory to 0xff00ff00, see if screen turns green, repeat.).
Later, I found there's an easy way: the address is written in plain text in the bootloader's configuration section. All I needed was to extract the factory image, then run:
And screen worked. Penguins showed at the top of the fbcon at boot, and after a while the postmarketOS splash showed up.
I wasn't able to get kernel logs displaying on screen (there's just a blinking cursor), but I think that's because I need to pass a command line param to enable the console.
It turns out the problems I had were entirely my mistakes. Now I'm stuck at the same stage as the Razer Phone mainlining: the USB controller is detected, but nothing shows up.
Mistake 1: no, the phone wasn't doing a self protection when the screen suddenly turned black - that's just the regulator code deciding to cut power to the screen since nobody's using it.
Mistake 2: the USB2 PHY transceiver won't probe. Adding a ton of printk statements to drivers/phy/qualcomm/phy-qcom-qusb2.c showed that I needed to enable CONFIG_QCOM_QFPROM.
(I really should just give up and use the arm64 defconfig... - why doesn't the postmarket-linux-stable kernel have all the qualcomm options enabled. Ugh.)
Mistake 3: even after the USB2 PHY probed, the actual USB controler (dwc3) still refused to probe.
This device has USB3, but I wanted to start with USB2, so I tried disabling USB3.
I originally disabled USB3 in the kernel config, thinking that the DWC3 USB driver would automatically fall back to USB2. Haha, nope. After adding another bunch of printks to dwc3_core_init, I found that I had to to manually disable USB3 in the device tree.
With this, the USB controller probes, but the device never shows up over USB, and the device reboots a minute later.
This sounds like same state that @DrGitX encountered on the Razer Phone, but I need to add more logging to be sure.
I'm not sure what to do next. Probably figure out what's doing the reboot (is it the clock panic that was worked around on the Razer Phone, or some watchdog?).
Edit: re. the reboot:
It's not a panic, since I copy logs to RAM on panic, so I would've seen the panic trace in the dmesg. So it's probably watchdog or self protection.
Adding the USB3 clocks to the list of protected clocks, per DrGit's comment, didn't help.
I'm not ruling out me adding a reboot for a debug and forgetting to remove it...
Edit 2: I wonder if the display would be easier to get working than USB. Mainline's about to get support for the display on SDM845.
Edit 3: made a quick patch to keep the display on after probing regulators: now I can get kernel logs on screen via "console=tty1". (Will look at enabling the display properly later - see Edit 2)
Still not sure what's causing the reboot, so I'm rebuilding the kernel with the stock arm64 defconfig to make sure all required drivers are built.
Edit 4: rebuilt kernel with Linux's own defconfig.
The new kernel reboots in the exact same way as the old one - and only after probing USB.
I noticed that the cursor on screen freezes before the reboot - so either the device hangs before the reboot, or the firmware's saving diagnostic information prior to the reboot.
My RAM console doesn't seem to survive past this reboot.
@opendata26 Any advice? I'm still stuck on the reboot when USB is enabled.
Here's what I checked:
just probing USB is fine. I patched gadget_dev_desc_UDC_store to always return an error, and there's no reboot. The reboot only happens if I try to enable USB networking.
dwc3_gadget_start runs fine (prints both enter and exit).
dwc3_process_event_buf never runs (added a printk in there, but never gets printed on screen)
It's not a kernel panic (I would see logs printed to screen, and it won't reboot). It's probably firmware or hardware self protection?
Is there a way to check why a Qualcomm device rebooted? Alternatively, would the boot reason be dumped over serial?
Edit: After a usb reboot, I get this in TWRP dmesg:
[ 0.368558] qcom,qpnp-power-on c440000.qcom,spmi:qcom,pm8998@0:qcom,power-on@800: PMIC@SID0 Power-on reason: Triggered from Hard Reset and 'cold' boot[ 0.368593] qcom,qpnp-power-on c440000.qcom,spmi:qcom,pm8998@0:qcom,power-on@800: PMIC@SID0: Power-off reason: Triggered from PS_HOLD (PS_HOLD/MSM controlled shutdown)[ 0.368810] input: qpnp_pon as /devices/virtual/input/input0[ 0.369565] qcom,qpnp-power-on c440000.qcom,spmi:qcom,pmi8998@2:qcom,power-on@800: No PON config. specified[ 0.369624] qcom,qpnp-power-on c440000.qcom,spmi:qcom,pmi8998@2:qcom,power-on@800: PMIC@SID2 Power-on reason: Triggered from PON1 (secondary PMIC) and 'cold' boot[ 0.369656] qcom,qpnp-power-on c440000.qcom,spmi:qcom,pmi8998@2:qcom,power-on@800: PMIC@SID2: Power-off reason: Triggered from GP1 (Keypad_Reset1)
Which seems to suggest that it's the firmware that shut down the phone, not hardware: a hardware shutdown - like the one I get by holding the power button down for a while - is clearly indicated:
[ 0.369355] qcom,qpnp-power-on c440000.qcom,spmi:qcom,pm8998@0:qcom,power-on@800: PMIC@SID0: Power-off reason: Triggered from KPDPWR_AND_RESIN (Simultaneous power key and reset line)
(I really should just give up and use the arm64 defconfig... - why doesn't the postmarket-linux-stable kernel have all the qualcomm options enabled. Ugh.)
@ollieparanoid Thanks! Yeah, I noticed the -qcom kernel's missing an aarch64 defconfig.
If this ever gets added to aports, I'll probably want this device on -stable or -mainline, though, since most patches in -qcom only applies to old SoCs (ie Nexus 5-era 32-bit devices)
Congrats on the 600 days post, by the way!
Edit: As mentioned above, I switched to using Linux's own defconfig, which does - mostly - support sdm845.
Currently Linux does a lot of setup before enabling the dwc3 USB controller. I wondered if it's possible to not touch other parts of the USB hardware (relying on the bootloader to set them up) and just enable the dwc3 by itself.
It's not possible to bypass the USB hardware setup in Linux, as far as I know.
I decided to port a simpler OS with USB support - but without the hardware setup - and see if the same issue happens there.
I got the bare minimum of Zircon (an experimental kernel from Google) booting on the Pixel 3 XL, to see if its dwc3 implementation also triggers the reboot.
Result: it does reboot in the exact same way.
I commented out code until I found that the very first command send to the dwc3 causes the reboot. Commenting out the send command code avoids the reboot.
Conclusion: it's not Linux's USB setup that's triggering this, since I reproduced it on an OS that doesn't perform USB setup.
Approach two:
So I thought about what could possibly break the dwc3's command. I realized that sending a command requires the dwc3 to read system memory, unlike all the other dwc3 operations. Maybe there's some IOMMU/SMMU protection that needs to be configured before the firmware allows memory access from the USB controller.
I tested this by editing the AOSP kernel's device tree to disable the SMMU node. The resulting kernel did reboot immediately after startup, confirming that, at least on the AOSP kernel, SMMU support is essential.
So maybe enabling the SMMU would help? There's a patch for this, so I applied it.
However, with the SMMU added to the mainline device tree, after booting.
the device's screen immediately turns black.
and seconds later the device reboots.
I'm not sure why this happens: dumping the log shows that the SMMU was successfully probed. I'll have to check what's causing the reboot.
Edit: Tried disabling SMMU (in Zircon): this didn't work.
staticvoidsmmuDisable(void){// sdm845 uses an ARM CoreLink MMU-500constuint64_tkSmmuBase=0x5040000;constuint64_tkSMMU_sCR0=0;constuint32_tkCLIENTPD=(1<<0);volatileuint32_t*ptr=(volatileuint32_t*)(kSmmuBase+kSMMU_sCR0);// Setting the CLIENTPD bit in sCR0 disables SMMU*ptr|=kCLIENTPD;}
In Zircon, the device still freezes and reboots on first USB command, like before.
I also tried this on mainline Linux: no, it didn't help.
Thanks! Wouldn't have been possible without awesome contributors like you
Still trying to get USB working, without success.
It's great that you keep us updated in detail about your different approaches.
I wonder if there is a way to trace how the downstream kernel initializes USB, then do the same trace with mainline and compare it... sort of like wireshark but at kernel level.
If you haven't done already, I can still recommend checking out ##linux-msm on freenode.
While I'm here, let's let some lowlevel hackers know what's going on here, maybe they have a good tip? @McBitter, @unrznbl, @cyrozap (reading the last comment from @zhuowei is enough for context )
Finishing the kernel config for arm64 would greatly help other phones with same arch that have qcom SoCs, I made a issue for this https://gitlab.com/postmarketOS/linux-postmarketos/issues/12
I attempted to do some diffs between armhf configs from mainline and qcom but it was too much to keep track of all, still might help you to have an idea of what should be enabled or not.
(I also passed maxcpus=1 to avoid booting up more cores, just in case that's causing issues. I'm pretty sure it is the SMMU disable code that fixed the reboot, though...) (enabled other cores and it still doesn't reset.)
With this, after "usb0: MAC (random hex digits)" is printed, the cursor continues blinking, showing that the USB controller was started without causing a reset.
Bad news: USB still doesn't work. Nothing shows up when I plug it in. (/>_<)/. This matches the behaviour on the Razer phone, at least.
This is probably a ridiculous question but, could it have anything to do with having to set up the controller to provide power to the device? All? USB-C ports on phones now tend to be DRP / Dual Role Ports, allowing them to take power or provide power to the device. I wonder if that's worthwhile investigating?
I hoped that the bootloader would leave the Type-C stuff in a working state; I guess not. I'm going to test with Zircon's USB dwc3 support to see if the USB would work if the kernel doesn't touch anything...
I guess there's no Type C support in mainline because all the upstreaming is for Google's Cheza Chromebook prototype, which handles USB-C using its own Chromebook EC.
Edit: no, Zircon's dwc3 doesn't work either. So I guess the bootloader shut down the USB port; it's not anything in Mainline that's breaking it.
OK, I think I need to ask for help here, since I've exhausted most of my ideas for debugging this.
To compare how the downstream and upstream kernels initialize USB, I tried dumping the dwc3 controller's registers from RAM and comparing the values to see what changed.
I grabbed the dwc3 configuration with /dev/mem on the downstream kernel and an ioremap in my mainline port. Later I remembered that there's an easier way to get the registers through debugfs. (oh well.)
I compared the registers, consulting TI and Intel's dwc3 Usb Controller documentations.
The only difference that stood out was the error report register: the GBUSERRADDR register is all zeroes in the downstream, indicating that there were no DMA failures, while it contains an address in my mainline kernel.
This, I guess, means DMA is still broken, even after I turned off the SMMU.
@DrGitX Do you have this issue on your device? Can you try the steps in here to grab the regdump file from your mainline port and see if the GBUSERRADDR is set? I would really appreciate it.
Does this mean that the SMMU must be turned on and configured before the USB would work?
As I mentioned above, the smmu support in mainline causes a reboot on probe. I think this is because the downstream kernel has special support for using SMMU with the continuous splash feature (bootloader leaving the screen on). So I can't enable the SMMU on mainline.
My other options for testing this hypothesis are all inefficient and likely ineffective:
I already tried guessing where the bootloader would put its DMA buffers and keep Linux's buffers in that zone:
I passed in "mem=467M" to my kernel to make sure Linux's memory allocation also falls in the bootloader's former heap.
Nope, didn't work; with SMMU enabled device still resets; with the SMMU disabling kludge USB still doesn't work.
I could remove the SMMU from the downstream device tree, add my mainline kludge for disabling SMMU instead, and see if the USB breaks. If it breaks, this means the USB depends on SMMU enabled. Problem: other unrelated things may break if I disable the SMMU there; how would I tell that apart?
Edit: Nope. Tried building downstream kernel with the same patch I used on mainline to disable the SMMU: kernel causes the device to reboot. USB didn't show up before it rebooted.
I could modify the mainline code with that downstream patch and see if I can get SMMU probing. That might take a long time, and would be a lot of wasted effort if it turns out the SMMU has nothing to do with it.
Edit: tried this, it doesn't work. Still seems to reboot after probing smmu (I added an infinite loop to make sure it doesn't reboot; it rebooted anyways, so the firmware must've detected a smmu fault.)
I could try changing my kludge: I currently disable SMMU globally; I can disable the SMMU just for the dwc3 USB controller's stream ID. I doubt that would change anything...
I tried asking ##linux-msm for help, but I haven't gotten a response yet. What other places can I ask for advice?
@ollieparanoid Yesterday, I actually did email two kernel developers, Doug Anderson and Bjorn Andersson, who works on sdm845 mainlining. They very kindly gave me some advice:
try Andy Gross's for-next tree, which has more sdm845 patches. (I have not switched to this yet but will do that later)
$ telnet 172.16.42.1Trying 172.16.42.1...Connected to 172.16.42.1.Escape character is '^]'.Type 'pmos_continue_boot' to continue booting:/ # uname -aLinux (none) 5.0.0-rc3-00073-g3c004e379cc5-dirty #49 SMP PREEMPT Thu Feb 7 11:28:26 PST 2019 aarch64 Linux
The only change I made was to disable the screen in Linux so I wouldn't have to use the fastboot command to disable screen (which breaks fastboot boot without flashing; having to fastboot flashing every time is annoying)
I didn't change anything else, and the ep0out error magically went away, so I guess maybe the USB is flakey and USB only works sometimes? (It worked twice in a row though.)
Edit: USB still works fine. Now I'm trying to get the display and the internal storage to work, without success.
Here's my source. (I switched to using Andy Gross's for-next branch.)
I'm following freedreno's guide. I started by copying an existing screen's driver and changing the resolution. I haven't copied the enable commands from downstream yet, so the screen doesn't turn on. However, this is enough for the msm module to probe, so at least the display controller is working.
@zhuowei thanks for your continued work on this. I'm super excited to see your progress. I went out and bought a Pixel 3 XL so I can follow along (and help, if there is anything I can do)
I'm curious what the current status is though; from what I gather you can boot an upstream kernel, and presumably, get some sort of shell / code execution via USB-Ethernet.
Can the kernel boot from internal storage? is UFS still broken? Any other interesting things working / not working?
I could just let the bootloader enable the screen and use simplefb - which worked before - but that'll require me to mess with SMMU again and I really don't want to do that. Plus, that would probably break gpu acceleration.
@z3ntu Thanks! Do you happen to know if anyone has logs for that channel? Would like to scroll through to see if anyone has similar problems and how they resolved them.
I do have logs thanks to Matrix but as there's no simple way of
dumping them to a text file, I can't be of much help I think. But I'd
suggest just asking there, the people there don't bite ;)
Bad news: does anyone else want to take over this project?
I'm probably not going to work on mainlining anymore: I have other projects I need to attend to, so I haven't done any work on this in the last four months.
If you have a Pixel 3 XL (or a regular Pixel 3), and you're interested in mainlining: let me know by replying! I'll help you get started.
I've tried to clean up my work so someone else can pick it up:
I've rebased my changes to 5.3-rc5 in !577 (merged).
I've also extracted my Kconfig changes into a file that can be used with Linux's merge_config.sh script, to make it easy to generate Kconfigs when updating the kernel:
... the modem crashes after a few seconds and reboots the phone, because I couldn't get qcom_rmtfs working. But it's still more promising than most other mainlined devices.
This is why I think it's worth mainlining Snapdragon 845 devices. The sdm845 enjoys a level of support that's practically unheard of for modern phone SoCs, thanks to the many developers working on sdm845 upstreaming.